Regarding Format Debt

An interesting (non-geeks, excuse yourself) discussion has arisen as the result of Bill de h├ôra’s post Format Debt, what you can’t say. At the heart of the argument is how to appropriately indicate layering given the inadequacy of Internet media types to specify that information. The problem crops up often, especially in regard to HTML with nested microformats.

For example, an HTML document is served up with a media type of text/html, but the actual information in the page is a person’s contact information marked up using hCard. An application interested in utilizing that contact information has no way of knowing whether or not the page contains that data without resorting to parsing it.

We say we want layered formats, because that’s what the combination of IETF IDs, W3C Recommendations and deployed browsers and servers allow us to say. It’s the Web version of of the Blub paradox. What we want is layered data. What we want is not just to qualify a media type, but to describe the ingredients in the entity whose “shell” is the media type.

I’d suggest that if you arrive at this point, it may indicate you’ve picked the wrong “shell.” Maybe I’m too dismissive, but I think there will always exist an impedance mismatch when attempting to define nested schemas for formats not suited to the purpose. The fact that we’ve reached a certain level of maturity with regard to extensibility (a huge engineering boon) is an exacerbating factor.

Before I get criticized for this viewpoint, I want to note explicitly that I am a proponent of microformats. I think they are incredible, and the enabling potential they offer to future browser extensions is well worth the effort in defining and developing them.

However, I believe this propensity towards nesting is ill-suited for scenarious involving structured data, often implying machine-to-machine communication. Any inadequecies inherent in media types can be easily overcome: simply define more formats that don’t necessitate deeper nesting. As a bonus, this fully complies with the REST architectural constraints.

In the address book example, HTML pages with microformats can continue to be delivered to people surfing via browsers. However, applications that are interested in structured data can request the same information be returned, not as HTML, but as vCard, a format designed specifically for the purpose of storing contact information, and delivered with a media type of text/x-vcard.

Nesting information in Atom (or RSS), while not obviating the media type issue, does not pose nearly the same problems. The reason is because Atom has a clear purpose as a container format, namely carrying a feed of items, such as news articles, that are generated over time. In contrast to the generic document markup of HTML, there are already constraints in place that should serve as guides when deciding whether or not Atom is an appropriate container.


Lonna Hanson
January 14, 2009 at 11:43 AM

Nice job of expressing your information. Being the non-engineer I am, I was a bit overwhelmed, but I still enjoyed reading it.

Post a comment