1. Introduction
The Ultimate Guide to XML DTD: Understanding the Legacy Approach to XML Validation : In our ongoing exploration of ensuring the structure and validity of XML documents, we now turn our attention to an older, yet still relevant technology: XML Document Type Definition (DTD). Before the advent of XML Schema (XSD), DTD was the primary mechanism used to define the structure, elements, attributes, and overall rules that an XML document should adhere to. While XSD has become the more powerful and feature-rich standard for schema definition, understanding DTD is still valuable, especially when dealing with legacy systems or older XML formats that continue to rely on it.
DTD provides a way to formally describe the grammar of an XML document. This includes specifying which elements can appear in the document, the order in which they can appear, whether they can contain other elements or just text, and what attributes each element can have. By validating an XML document against a DTD, you can ensure that it conforms to a predefined structure, which is crucial for data consistency and reliable processing.
This ultimate guide will provide you with a comprehensive understanding of XML DTD. We will explore its syntax, the different types of declarations it supports (including elements, attributes, and entities), and how to associate a DTD with an XML document. While DTD has limitations compared to XSD, it is still important to grasp its concepts and capabilities, especially when you encounter it in real-world scenarios. By the end of this guide, you will have a solid understanding of how DTD works and its role in the history and practice of XML validation.
2. Fundamentals of XML DTD
To begin our exploration of XML DTD, let’s understand its basic structure and key components.
- DTD Structure: A DTD can be defined in two ways:
- Internal DTD: The DTD declarations are included directly within the XML document itself, within a
<!DOCTYPE>
declaration. - External DTD: The DTD declarations are stored in a separate file (with a
.dtd
extension), and the XML document refers to this file using the<!DOCTYPE>
declaration.
- Internal DTD: The DTD declarations are included directly within the XML document itself, within a
- The
<!DOCTYPE>
Declaration: Every XML document that uses a DTD must have a<!DOCTYPE>
declaration immediately after the XML declaration (if present) and before the root element. This declaration specifies the root element of the XML document and points to the location of the DTD.
- Internal DTD:
<!DOCTYPE rootElementName>
2. External DTD (using SYSTEM identifier):
<!DOCTYPE rootElementName SYSTEM "dtdFileName.dtd">
3. External DTD (using PUBLIC identifier):
<!DOCTYPE rootElementName PUBLIC "-//Organization//DTD Document Type//EN" "dtdURL.dtd">
Here, rootElementName
is the name of the root element of your XML document. For an external DTD, the SYSTEM
identifier is used for a DTD located on your local system or network, while the PUBLIC
identifier is used for a publicly available DTD (often referenced by a formal public identifier).
- DTD Declarations: The main part of a DTD consists of declarations that define the rules for the XML document. These include declarations for elements, attributes, entities, and notations.
3. Element Declarations in DTD
Element declarations in a DTD specify the elements that can appear in the XML document and what their content can be. The syntax for declaring an element is:
<!ELEMENT elementName contentSpec>
Here, elementName
is the name of the element being declared, and contentSpec
defines what content is allowed within that element. There are several options for contentSpec
:
- EMPTY: Indicates that the element must not have any content (it’s an empty element).
<!ELEMENT br EMPTY>
- ANY: Indicates that the element can contain any content, including text and/or any other elements. This is a very permissive declaration and reduces the benefits of using a DTD for strict validation.
<!ELEMENT notes ANY>
- Mixed Content: Indicates that the element can contain text mixed with other elements. The declaration must start with
#PCDATA
(Parsed Character Data) and can be followed by a list of allowed child element names, each followed by an occurrence indicator (*
for zero or more,+
for one or more,?
for zero or one).
<!ELEMENT paragraph (#PCDATA|bold|italic)*>
<!ELEMENT section (#PCDATA|paragraph|subsection)+>
- Element Content: Specifies that the element can only contain other elements in a specific sequence or choice. You use parentheses to group elements and commas to indicate a sequence (elements must appear in the specified order). You can also use the pipe symbol
|
to indicate a choice (one of the listed elements must appear). Occurrence indicators can be used after each element or group:?
: Zero or one occurrence (optional).*
: Zero or more occurrences.+
: One or more occurrences.
<!ELEMENT book (title, author+, chapter*)>
<!ELEMENT chapter (title, paragraph+)>
<!ELEMENT paragraph (sentence)+>
<!ELEMENT sentence (#PCDATA)>
<!ELEMENT author (#PCDATA)>
<!ELEMENT title (#PCDATA)>
In this example:
- A
book
element must have onetitle
, one or moreauthor
elements, and zero or morechapter
elements in that order. - A
chapter
element must have onetitle
followed by one or moreparagraph
elements. - A
paragraph
element must have one or moresentence
elements. - A
sentence
,author
, andtitle
element contain parsed character data (#PCDATA
).
4. Attribute Declarations in DTD
Attribute declarations in a DTD specify the attributes that an element can have and any constraints on their values. The syntax for declaring attributes is:
<!ATTLIST elementName attributeName attributeType attributeDefault>
Here, elementName
is the name of the element to which the attribute applies, attributeName
is the name of the attribute, attributeType
specifies the type of the attribute’s value, and attributeDefault
specifies the default value or whether the attribute is required.
- Attribute Types: DTD supports several attribute types:
- CDATA: The attribute value is character data (plain text).
- ID: The attribute value must be a unique identifier within the document. Only one attribute of type ID can be defined per element type.
- IDREF: The attribute value must match the value of an ID attribute on some element in the document. Used to create references between elements.
- IDREFS: The attribute value is a space-separated list of IDREF values.
- ENTITY: The attribute value must be the name of an unparsed entity declared in the DTD.
- ENTITIES: The attribute value is a space-separated list of ENTITY names.
- NMTOKEN: The attribute value is a name token (a string of characters valid in XML names).
- NMTOKENS: The attribute value is a space-separated list of NMTOKEN values.
- Enumerated Type: The attribute value must be one of the values specified in a list (case-sensitive).
<!ATTLIST product
id ID #REQUIRED
type (book|electronic|clothing) #IMPLIED>
Here, the product
element must have an id
attribute of type ID, and it can optionally have a type
attribute whose value must be one of “book”, “electronic”, or “clothing”.
- Attribute Defaults: The
attributeDefault
can be one of the following:- #REQUIRED: The attribute must be present on every instance of the element.
- #IMPLIED: The attribute is optional; no default value is provided.
- “defaultValue”: Provides a default value for the attribute. If the attribute is not specified in the element, the default value is used.
- #FIXED “fixedValue”: Specifies a fixed value for the attribute. If the attribute is provided in the element, it must have this fixed value; if it’s not provided, the fixed value is assumed.
<!ATTLIST item
quantity CDATA "1"
discount CDATA #IMPLIED>
Here, the item
element has a quantity
attribute that defaults to “1” if not specified, and an optional discount
attribute.
5. Entities in DTD
DTDs also allow you to define entities, which are shortcuts or named pieces of content that can be used within the XML document. There are two main types of entities:
- General Entities: Used within the document content to represent text or other markup. They are declared using
<!ENTITY name "value">
.
<!ENTITY trademark "™">
You can then use &trademark;
in your XML document, and the XML processor will replace it with the trademark symbol.
- Parameter Entities: Used within the DTD itself to represent reusable parts of declarations. They are declared using
<!ENTITY % name "value">
.
<!ENTITY % basic.attrs "id ID #REQUIRED">
<!ELEMENT product (name, price)>
<!ATTLIST product
%basic.attrs;
category CDATA #IMPLIED>
Here, %basic.attrs;
is a parameter entity that is used to include the id
attribute declaration in the product
element’s attribute list.
6. Notations in DTD
Notations in DTD are used to declare the format of unparsed entities (e.g., images, audio). They provide information to applications about how to handle these external data types.
<!NOTATION GIF PUBLIC "-//CompuServe//DTD GIF 89a//EN">
<!ENTITY logo SYSTEM "images/logo.gif" NDATA GIF>
Here, a notation for GIF images is declared, and then an unparsed entity logo
of type GIF is declared, referencing an external file.
7. Limitations of DTD Compared to XSD
While DTD was a pioneering technology for XML validation, it has several limitations compared to XML Schema:
- Limited Data Types: DTD has a very limited set of built-in data types.
- No Support for Namespaces: DTD does not have native support for XML namespaces, which can be problematic when dealing with documents that combine different vocabularies.
- Less Expressive: DTD offers less control over the structure and content of XML documents compared to XSD.
- Syntax: DTD uses a different syntax than XML, which can be less intuitive for those familiar with XML.
8. Conclusion
In this ultimate guide, we have explored the XML Document Type Definition (DTD) in detail. We have examined its syntax for declaring elements, attributes, entities, and notations, and understood how it provides a way to define the structure and rules for XML documents. While DTD has been largely superseded by the more powerful XML Schema, it remains an important part of XML history and is still encountered in various contexts. Understanding DTD is crucial for working with these legacy systems and appreciating the evolution of XML validation technologies.