The Attribute/Text ConundrumAttributes versus (vs) element text May 25 2000 |
||
"Document-Centric" vs. "Data-Centric"You may have heard the phrases "document centric" and "data centric." What do these terms mean? HTML is "document centric" whereas XML is (at least it should be) data centric. How do document-centric and data-centric documents differ? XML documents can be either document-centric or data-centric. Document-centric design involves a liberal use of free-form text that is "marked up" with elements. Document-centric: Data-centric documents are more structured: Document-centric documents are easy to render on some sort of output device. For example, it's quite easy to format the first memo into HTML. That's because the first memo looks like HTML. On the other hand, data-centric documents are typically easier to process with computer programs because the data is better organized. Are document-centric better than data-centric documents? Sometimes there's no obvious answer. But since XML was created to represent data, not free-form text such as memos, XML documents should typically be data-centric. Attributes vs. Element TextNow that you've decided to write data-centric XML documents (at least I hope you have), you're faced with yet another decision: is the data stored in attributes or element text? The only substantial difference between attributes and element text is that the standard DOM (and perhaps SAX) APIs optionally strip whitespace from element text. If you've edited HTML documents by hand, you know that putting a carriage return in the middle of text doesn't accomplish anything -- you have to use <BR> or <P> instead. Any sequence of spaces, tabs, carriage returns is converted into a single space. Sometimes whitespace stripping is desirable and sometimes it isn't. It depends entirely on the nature of the data and what you intend to do with it. I have come to the conclusion that element text is usually unnecessary. If you need it (you usually don't), you can code whitespace stripping on your own because that operation is trivial. Element Text is just an attribute with a default name -- no name. Data without a name is undesirable. That's precisely why HTML is "bad" -- it's not specific. The text for an element could just as easily be stored in an attribute
with a generic name, such as "value." An attribute can
hold exactly the same information that a text node can. The main
difference between text and attributes is that only text can contain An argument made for text is that the w3c xml schema spec is less ambiguous about elements compared to attributes. I find that hard to believe due to the scoping rules of w3c schemas. The scoping rules apply equally to attributes and elements. Namespace references can resolve any other ambiguities. Text is ambiguous by its very nature.
is just a less precise way of writing
If you get a
document like APIs and Document FootprintFrom a programming perspective, text is a pain. With the DOM, it's substantially easier to process:
Compared to:
In the first document, only three nodes need to be visited. The second document requires five nodes to be visited. Extra node traversals negatively affect processing performance. More memory is needed to represent the additional nodes. The document is bigger too, and document size affects parsing speed more than anything else. Attributes are also more convenient than text because the parser ensures there is only one value per attribute -- if text is used, only validation can enforce that. Attributes can also have a "list of values" and even a default value. Tools such as XML Spy use attribute cardinality. For example, when editing an attribute, XML Spy displays a combo box containing the possible values. Attributes are also easier to create via the DOM. In order to add text to a document, a text node must be created and appended to an element. Mixed ContentA node that has text content can only be extended via the ANY construct. Let's say you have an element like: <Name>Joe</Name> and you find that you need to add another element to Name, Surname. <Name>Joe<Surname>Mr</Surname></Name> <Name> has "mixed" content, meaning it contains text and elements. The DTD for the Name element is <!ELEMENT Name ANY>. ANY is highly undesirable. Good DTDs are very specific when it comes to structure: they specify, without any ambiguity, which elements can appear inside other elements. DTDs that use ANY aren't specific. Ambiguous DTDs force the burden of semantic validation upon the application programmer. A general guideline to follow is that text should be present on an element only when you're absolutely sure that the element won't have any child nodes. Element Name UniquenessThe development community has gained substantial amounts of experience with HTML and XML since the initial W3C XML 1.0 recommendation. Quite a lot of SGML and XML developers agree that element names should be unique within a schema. This idea has been propagated, for example, to the XHTML spec. In XHTML, element names are unique. A <P> element in a <TD> element is the same as a <P> element in a <FORM> element. Representing data with element text causes immediate namespace issues. For example, two "entities" in your data model may have the same "property" called "ID." However, the ID for one entity is very different than the ID for another. For example, they may be represented by different datatypes. In the DTD realm, all element names must be unique. Period. In the XML Schema realm, element names can be "localized" to a particular parent element. Thus XML itself is wishy-washy regarding whether "overloading" element names is an acceptable practice. It depends on which validation strategy you choose. Reusing element names also confuses the hell out of XML editors. Thus, if you choose to put data in element text, it's a good idea to make all element names unique. In my opinion this is cumbersome and just plain wrong. Logically distinct namespaces should not interfere with each other. In the object-oriented world, classes provide a unique namespace for their property names. This obvious convenience is enjoyed by millions of programmers. Users of XML deserve the same level of usability. Multi-Valued DataArrays and such are particularly troublesome. How do you represent multi-valued data using attributes? One approach is to put delimited text into an attribute value. This is not without precedent. DTD-supported "IDREFS" attributes hold a space-delimited list of names. I was surprised to see the XML schema draft support space-delimited strings. I thought that XML had finally done away with delimited text. After all, processing delimited text requires yet another parser. Delimited text is tricky -- what if a legitimate value contains the delimiter? It's no picnic. Therefore, the best way to represent multi-valued data is via sub-elements. NULL ValuesIf you look closely at the w3c schema spec, there is a provision for NULL values in the sense of database NULL values.
However, if you want to roll your own solution, there are several available that use attributes:
DatatypesVia the w3c schema spec, an element can specify the datatype of its text:
This feature does not exist for attributes. Datatypes can be defined for attributes, but they must be specified via an external schema. That's not all bad. External schemas reduce the size of documents considerably by eliminating redundant type information. Preparing for ChangeSince attribute values can not be annotated further, you generally must add another attribute to an element whenever you need to add additional information. To some, this is incorrect if the new attribute annotates an existing attribute. Attributes should be independent, more or less, because it makes schema-based validation possible. XML purists prefer to create a new element that contains related attributes. Generally, it's easier to add attributes than add elements. So from this perspective, using text is probably safer than using attributes. You can always add another attribute to the text's element. Good Use of TextUDDI uses text and attributes in a compelling fashion in which the difference between text and attributes is clear, logical, and consistent. See UDDI: an XML Web Service. ConclusionAttributes make a lot of sense to me. I see the various "simple-ML" initiatives that want to eliminate attributes and I just think, these guys got it exactly backwards! Attributes are easier to create and process via the most popular APIs. Attributes reduce document size compared to text. However, if you want to use the NULL and in-line typing features provided by w3c schemas, you must use element text. There is an unfortunate bias towards element text throughout the XML community. If you learn just one thing from this article, let it be this: Elements should have text only when you're absolutely sure that the element won't ever have any child nodes. Furthermore, it's far more important to focus your mental energy on differentiating between elements and "data." Elements have sub-structure whereas "data" is represented by a simple datatype. Turning a data value into an element will break any existing code. This is true regardless of whether attributes or text is used. I will further discuss this design decision in my next article. A few additional rules of thumb were discussed in this article. Attributes should be used whenever possible. Sub-elements should be used to represent multi-valued data. When an element clearly has only one value associated with it, use text. If an element is analogous to an object, use attributes. Nullable values that have a default value must be stored as text. Finally, if you find all of this confusing, like most developers do, just use text. XML Schemas have much improved features for text, such as "simple" types (no internal structure), default values, and constraints. Alas, advocating either attributes or text is a lost cause. XML supports both so document authors are faced with an often arbitrary decision. Many XML developers prefer the "look" of element text over attributes. Many document designers, either through ignorance or time constraints, don't even consider using attributes. Many XML newbies gravitate towards element text because they already know HTML. HTML uses text liberally, and for good reason: HTML is a mark-up language for text. HTML is document-centric. In contrast, XML is a mark-up language for data. Therefore we should expect to see less free-form text in XML documents. We might even expect to see more attributes and less text. But more often than not, we see elements used in cases where attributes would work just fine:
The genie is out of the bottle, never to return. Update: September 25 2002Take a look at Simon St. Laurent's arguments for element text. All in all, there are no simple answers in computing. Do you suppose the human brain uses attributes or text? Er... The bottom line is I like attributes. My brain is hard-wired toward components. I do not "see" objects, hierarchical data, and relational data as text that is just asking to be marked up. Mark-up simply makes no sense to me when it comes to data. On the other hand, I understand how to group data in terms of entities and the entities and properties that comprise them. I do not consider the "fixed" vs. "open" argument to be the lynch pin in XML design. Yes, an open schema is the key to longevity, but XML is inherently open because elements can be nested arbitrarily deep in elements. We can argue until we're blue in face about which method is better. I like wine, maybe you like beer. We agree to disagree. Ultimately, let's be happy if and when XML satisfies our needs. |