Document Design And UML

June 15 2000 

Click here to advertise

Document structure, or schema, is something that most XML developers take for granted.  With an existing schema, a developer can roll up his sleeves and start creating and processing documents.  But inevitably, there comes a time when a new type of document needs to be created from scratch.  Where to begin?

XML literature is still fairly silent when it comes to good document design practices.  Most of the literature is from the SGML community.  I don't care for most of the SGML material as it's more text-focused (i.e., a Memo) than data-focused (i.e., third normal form database rows).  So, like everyone else, my methodology evolved as I designed documents.

First of all, I don't use element text unless the data is truly free-form text that needs HTML or other mark-up.  Once that's out of the way, the next task involves identifying the elements and their attributes.

To distinguish between elements and attributes, I use object-oriented methodologies such as UML.  An element maps to a class and an attribute maps to a data member.  So the element vs. attribute problem is really a matter of identifying the entities in the domain.  Once the entities have been identified, the remaining "stuff" is just data that can be stored in attributes.

When one hears the words "Object Oriented" they almost instantly think of inheritance.  Representing inheritance with XML is done either through "true inheritance" or containment.  "True inheritance" means that an element has the same attributes and elements as its "parent."  "Containment" means that the "parent" element is embedded inside the "child" element, or vice-versa.  I prefer true inheritance.

There is a fine line between "data" and "objects."  Complex data is best represented as an object that has multiple data values.  For example, a name seems like a simple value, but it's best represented as an element with multiple attributes:

<Name last="Smith" first="Joe" surname="Mr."/>

Relationships such as containment and aggregation map well to XML.  Placing an element inside another element represents a "has-a" relationship.

Attributes should contain at most one value.  Multi-valued data is best represented as a set of elements:

<Car/>
  <Door type="hatchback"/>
  <Door type="left_front"/>
</Car>

Instead of:

<Car doors="left_front, hatchback"/>

Storing delimited fields in attributes is not a good idea because processing the attribute's values requires at least script code.  Multi-valued attributes are a pain to process with XSLT stylesheets.  Use multiple elements instead.

Object-oriented methodologies represent decades of research on effective design.  It would be foolish to ignore OO concepts just because XML is a "new thing."  From my own experience, using an OO approach to design XML schema results in usable documents that evolve gracefully.  Best of all, it works.  It's practical.

The final design decision is the most vexing.  When is a schema too rigid?  On the other hand, when is it too open-ended?  Database schema designers face the same problem.  I'll discuss this topic next time.