The Ultimate Guide to XML DTD: Understanding the Legacy Approach to XML Validation

1. Introduction

The Ultimate Guide to XML DTD: Understanding the Legacy Approach to XML Validation : In our ongoing exploration of ensuring the structure and validity of XML documents, we now turn our attention to an older, yet still relevant technology: XML Document Type Definition (DTD). Before the advent of XML Schema (XSD), DTD was the primary mechanism used to define the structure, elements, attributes, and overall rules that an XML document should adhere to. While XSD has become the more powerful and feature-rich standard for schema definition, understanding DTD is still valuable, especially when dealing with legacy systems or older XML formats that continue to rely on it.

DTD provides a way to formally describe the grammar of an XML document. This includes specifying which elements can appear in the document, the order in which they can appear, whether they can contain other elements or just text, and what attributes each element can have. By validating an XML document against a DTD, you can ensure that it conforms to a predefined structure, which is crucial for data consistency and reliable processing.

This ultimate guide will provide you with a comprehensive understanding of XML DTD. We will explore its syntax, the different types of declarations it supports (including elements, attributes, and entities), and how to associate a DTD with an XML document. While DTD has limitations compared to XSD, it is still important to grasp its concepts and capabilities, especially when you encounter it in real-world scenarios. By the end of this guide, you will have a solid understanding of how DTD works and its role in the history and practice of XML validation.

2. Fundamentals of XML DTD

To begin our exploration of XML DTD, let’s understand its basic structure and key components.

  • DTD Structure: A DTD can be defined in two ways:
    • Internal DTD: The DTD declarations are included directly within the XML document itself, within a <!DOCTYPE> declaration.
    • External DTD: The DTD declarations are stored in a separate file (with a .dtd extension), and the XML document refers to this file using the <!DOCTYPE> declaration.
  • The <!DOCTYPE> Declaration: Every XML document that uses a DTD must have a <!DOCTYPE> declaration immediately after the XML declaration (if present) and before the root element. This declaration specifies the root element of the XML document and points to the location of the DTD.
  1. Internal DTD:

2. External DTD (using SYSTEM identifier):

3. External DTD (using PUBLIC identifier):

Here, rootElementName is the name of the root element of your XML document. For an external DTD, the SYSTEM identifier is used for a DTD located on your local system or network, while the PUBLIC identifier is used for a publicly available DTD (often referenced by a formal public identifier).

  • DTD Declarations: The main part of a DTD consists of declarations that define the rules for the XML document. These include declarations for elements, attributes, entities, and notations.
3. Element Declarations in DTD

Element declarations in a DTD specify the elements that can appear in the XML document and what their content can be. The syntax for declaring an element is:

Here, elementName is the name of the element being declared, and contentSpec defines what content is allowed within that element. There are several options for contentSpec:

  • EMPTY: Indicates that the element must not have any content (it’s an empty element).
  • ANY: Indicates that the element can contain any content, including text and/or any other elements. This is a very permissive declaration and reduces the benefits of using a DTD for strict validation.
  • Mixed Content: Indicates that the element can contain text mixed with other elements. The declaration must start with #PCDATA (Parsed Character Data) and can be followed by a list of allowed child element names, each followed by an occurrence indicator (* for zero or more, + for one or more, ? for zero or one).
  • Element Content: Specifies that the element can only contain other elements in a specific sequence or choice. You use parentheses to group elements and commas to indicate a sequence (elements must appear in the specified order). You can also use the pipe symbol | to indicate a choice (one of the listed elements must appear). Occurrence indicators can be used after each element or group:
    • ?: Zero or one occurrence (optional).
    • *: Zero or more occurrences.
    • +: One or more occurrences.

In this example:

  • A book element must have one title, one or more author elements, and zero or more chapter elements in that order.
  • A chapter element must have one title followed by one or more paragraph elements.
  • A paragraph element must have one or more sentence elements.
  • A sentence, author, and title element contain parsed character data (#PCDATA).

4. Attribute Declarations in DTD

Attribute declarations in a DTD specify the attributes that an element can have and any constraints on their values. The syntax for declaring attributes is:

Here, elementName is the name of the element to which the attribute applies, attributeName is the name of the attribute, attributeType specifies the type of the attribute’s value, and attributeDefault specifies the default value or whether the attribute is required.

  • Attribute Types: DTD supports several attribute types:
    • CDATA: The attribute value is character data (plain text).
    • ID: The attribute value must be a unique identifier within the document. Only one attribute of type ID can be defined per element type.
    • IDREF: The attribute value must match the value of an ID attribute on some element in the document. Used to create references between elements.
    • IDREFS: The attribute value is a space-separated list of IDREF values.
    • ENTITY: The attribute value must be the name of an unparsed entity declared in the DTD.
    • ENTITIES: The attribute value is a space-separated list of ENTITY names.
    • NMTOKEN: The attribute value is a name token (a string of characters valid in XML names).
    • NMTOKENS: The attribute value is a space-separated list of NMTOKEN values.
    • Enumerated Type: The attribute value must be one of the values specified in a list (case-sensitive).

Here, the product element must have an id attribute of type ID, and it can optionally have a type attribute whose value must be one of “book”, “electronic”, or “clothing”.

  • Attribute Defaults: The attributeDefault can be one of the following:
    • #REQUIRED: The attribute must be present on every instance of the element.
    • #IMPLIED: The attribute is optional; no default value is provided.
    • “defaultValue”: Provides a default value for the attribute. If the attribute is not specified in the element, the default value is used.
    • #FIXED “fixedValue”: Specifies a fixed value for the attribute. If the attribute is provided in the element, it must have this fixed value; if it’s not provided, the fixed value is assumed.

Here, the item element has a quantity attribute that defaults to “1” if not specified, and an optional discount attribute.

5. Entities in DTD

DTDs also allow you to define entities, which are shortcuts or named pieces of content that can be used within the XML document. There are two main types of entities:

  • General Entities: Used within the document content to represent text or other markup. They are declared using <!ENTITY name "value">.

You can then use &trademark; in your XML document, and the XML processor will replace it with the trademark symbol.

  • Parameter Entities: Used within the DTD itself to represent reusable parts of declarations. They are declared using <!ENTITY % name "value">.

Here, %basic.attrs; is a parameter entity that is used to include the id attribute declaration in the product element’s attribute list.

6. Notations in DTD

Notations in DTD are used to declare the format of unparsed entities (e.g., images, audio). They provide information to applications about how to handle these external data types.

Here, a notation for GIF images is declared, and then an unparsed entity logo of type GIF is declared, referencing an external file.

7. Limitations of DTD Compared to XSD

While DTD was a pioneering technology for XML validation, it has several limitations compared to XML Schema:

  • Limited Data Types: DTD has a very limited set of built-in data types.
  • No Support for Namespaces: DTD does not have native support for XML namespaces, which can be problematic when dealing with documents that combine different vocabularies.
  • Less Expressive: DTD offers less control over the structure and content of XML documents compared to XSD.
  • Syntax: DTD uses a different syntax than XML, which can be less intuitive for those familiar with XML.
8. Conclusion

In this ultimate guide, we have explored the XML Document Type Definition (DTD) in detail. We have examined its syntax for declaring elements, attributes, entities, and notations, and understood how it provides a way to define the structure and rules for XML documents. While DTD has been largely superseded by the more powerful XML Schema, it remains an important part of XML history and is still encountered in various contexts. Understanding DTD is crucial for working with these legacy systems and appreciating the evolution of XML validation technologies.

Scroll to Top