DTD Design Tips

July 06, 2003  

Click here to advertise

XML Schemas Best Practices

Document Design with UML

These tips were discovered after writing DTDs and XML processing code for two years.  These are guidelines, not mandates.

See also:  DTD Overview 


Tip 1:


If two or more different types of elements can appear at the same level in a tree, create 'container' elements

Advantages:
  • Makes the document easier to read
  • Makes the document easier to process using the DOM - "children" of the same element all appear within a sub-tree, so it's easy to tell how many children there are
Without this rule, it is common to see documents like:
<!ELEMENT Foo (Bar*, Baf*)>

<Foo>

    <Bar/>
    <Bar/>
    <Baf/>
    <Baf/>
    <Baf/>
</Foo>

 

By following the rule, the document instead looks like:
<!ELEMENT Foo (Bars?, Bafs?)>
<!ELEMENT Bars (Bar*)>
<!ELEMENT Bafs (Baf*)>

<Foo>

    <Bars>
        <Bar/>
        <Bar/>
    </Bars>
    <Bafs>
        <Baf/>
        <Baf/>
        <Baf/>
    </Bafs>
</Foo>


Tip 2:


Use attributes when you can, and elements when you have to

Advantages:
  • Makes the document easier to process using SAX or the DOM
  • Makes documents smaller and therefore documents can be processed more efficiently
This rule is fairly self-explanatory.  Not everyone is convinced that this is the best way.  Arguments against attributes usually assert that elements are easier for humans to read.  However, attributes lead to significant performance improvements when processing is involved.  The advantages of attributes can not be ignored if scalability is a priority.

You can't use attributes when an element needs "many" of something.  For example, a car has many tires.  Use elements instead in this case.

Another argument against attributes used to be based upon tools.  "Tool X" can not use attributes.  However, the state of XML and XML tools is such that these arguments are, for the most part, relics of the past.


Tip 3:


Reuse elements

Advantages:
  • Makes reuse possible without introducing awkward tags
Large XML projects invariably need to embed a document within another document.  The best way to manage reuse with XML is to create separate DTDs, each DTD describing the structure of each "major" element in a system.

If you design for reuse up front, it will save you maintenance effort as the project gets larger.  Today, there are almost no tools to help with this job.  Raw XML can be used to implement reuse by embedding DTDs inside other DTDs using entities.  See the DTD page for more information.

A disadvantage of referring to external DTDs is that each DTD requires a round-trip during the validate step.  Each round-trip represents a significant performance penalty.


Tip 4:


Avoid 'mixed' content

If you want mixed content (elements and text), the DTD must be written as:

<!ELEMENT Foo (#PCDATA|Bar)*>
<!ELEMENT Bar EMPTY>


This DTD is ambiguous.  It says that <Foo> can contain many <Bar> elements.  Also, <Foo> can have multiple blocks of content:

<Foo>
    hi
    <Bar/>
    bye
    <Bar/>
    <Bar/>
    goodnight
</Foo>

'mixed' is useful for marking up content and works well with free-form text.  However, unless this is exactly what you want, you should not define your document so liberally.  Create a new node that will be defined as #PCDATA:

<!ELEMENT Foo (Data|Bar)>
<!ELEMENT Data (#PCDATA)>


Tip 5:


Plan for DTD maintenance

DTDs can be changed once they have been published, as long as certain guidelines are followed:
  1. Elements can not be removed
  2. Attributes can not be removed
  3. Attributes can not be changed from "implied" to "required"
  4. Default values should not be modified (generally)
  5. A "value" can not be removed from an attribute "value list"
  6. The required structure of a document can not be changed.  For example, ? can not become + and you a new element can not be required to appear inside an existing element.  Only ? and *can be used when changing the document structure.
  7. #PCDATA can't be removed from an element

If these guidelines can't be followed, a new type of document must be created.

Another way to manage change is to plan for it.  For example, a top-level element could have a "version" attribute.  A document that conforms to the first version of the DTD would have version="1" and a document that conforms to the second version of the DTD would have version="2".  The version number would only change if the DTD was modified in such a way that violated the above guidelines.  However, without diligent coding, this method will fail.  Any code "forgets" to check the version will break when a new version is introduced to the system. 


Tip 6:


Use entities to encapsulate repetition

As an example, a traveler uses a vehicle to get to his destination:

<transitMode><car/></transitMode>

<transitMode><boat/></transitMode>

The DTD fragment for this is: 

<!ENTITY % VEHICLE "car | boat | train">
<!ELEMENT transitMode (%VEHICLE;)>

The VEHICLE entity is handy because it can be reused.  It also makes the DTD easier to maintain.  For example, if you add another type of vehicle, only the entity needs to be changed.  The rest of the DTD is unaffected.

By the way, XML Schemas address this need via two mechanisms:  substitution groups and inheritance (and it isn't easy to decide which one to use).