advertisement

Listen Print

Converting Unstructured Documents to XML

by W. Scott Means
03/22/2001

Although the migration to XML as a long-term document format is already a foregone conclusion, one thorny problem remains: what to do with all of those legacy documents. How nice it would be to be able to archive all of those old files, and just worry about creating new documents. But in the IT world the mantra is evolution, not revolution. So the problem of migrating old documents to XML formats will be with us for years to come.

Unfortunately for us, there are as many types of unstructured documents as there are people writing them. Word processor documents, spreadsheets, Web pages, and text files all provide unique challenges to the XML converter. But there are a few techniques that can make the translation process a little less painful.

Perhaps the best part of the Internet is having access to massive quantities of high-quality, free information. A great source of free information is the United States government. One agency, the National Weather Service, provides the weather forecasts and observations that are given to pilots by Federal Aviation Administration briefers when pilots are planning a flight. (For the insatiably curious, you can read more about METAR reports.) [Editor's Note: The METAR acronym roughly translates from French as Aviation Routine Weather Report.]

For example, the following URL will download the most current weather observations for the Columbia Metropolitan Airport (located here in the fine city of Columbia, South Carolina):

ftp://tgftp.nws.noaa.gov/data/observations/metar/decoded/KCAE.TXT

Although the data is updated on an hourly basis, we'll be working with this particular report:

Columbia, Columbia Metropolitan Airport, SC, 
	United States (KCAE) 33-56-31N 081-07-05W 73M
Mar 09, 2001 - 11:56 PM EST / 2001.03.10 0456 UTC
Wind: from the N (010 degrees) at 5 MPH (5 KT):0
Visibility: 10 mile(s):0
Sky conditions: clear
Temperature: 41.0 F (5.0 C)
Dew Point: 36.0 F (2.2 C)
Relative Humidity: 82%
Pressure (altimeter): 29.98 in. Hg (1015 hPa)
ob: KCAE 100456Z 01005KT 10SM CLR 05/02 A2998 
	RMK AO2 SLP152 T00500022 401170022
cycle: 5

[Editor's note: Lines 1 and 10 were extended to two lines each to accommodate our Web site's formatting. Throughout this article, whenever a line of code or data has been broken up into two lines, the second line is indented.]

Related Reading

XML in a Nutshell

XML in a Nutshell
A Desktop Quick Reference
By Elliotte Rusty Harold, W. Scott Means

Table of Contents
Index
Sample Chapter

Read Online--Safari Search this book on Safari:
 

Code Fragments only

Now, we have our source document. Let's begin the process of converting it to XML.

Isolate Atomic Elements

No, this isn't part of a physics primer included by mistake. XML documents are composed of elements, so the first step to converting the source document is to recognize its underlying structure. The smallest indivisible pieces of information in the document will become the elements in the resulting XML document.

Let's work through the document line by line, starting with the first line:

Columbia, Columbia Metropolitan Airport, SC, 
	United States (KCAE) 33-56-31N 081-07-05W 73M

At first glance, there's a lot of different information on this line. But on closer examination, it is all provided to pinpoint the physical location where this weather information was collected. First, let's look at the human-readable location from the beginning of the line. It reads somewhat like a normal address, so a possible first encoding would be:

<station>Columbia Metropolitan Airport</station>
<city>Columbia</city>
<state>SC</state>
<country>United States</country>

Next comes something that looks like an airport identifier with the letter 'K' in front. Rather than make up a name for it, or take a wild guess, it's worth the savings in future confusion to just reference the text on the METAR site, and use the proper terminology. In general, if you're blazing new trails in XML encoding, you should assume that your work will live and spread far beyond the scope of your project. Using the right name for something now will save many headaches for you (and your successors). This four-letter code is called the "ICAO (International Civil Aviation Organization) location," so we'll add:

<ICAO-location>KCAE</ICAO-location>

The encoding of the rest of this line seems fairly obvious (other than the final number, which turns out to be the location's altitude above mean sea level).

<longitude>33-56-31N</longitude>
<latitude>081-07-05W</latitude>
<altitude>73M</altitude>

Now, before we continue, we should consider the issue of how far to go when breaking a document down into its elements. Technically, we could have encoded the information above like this:

<longitude degrees="33" minutes="56" 
	seconds="31" direction="N"/>
<latitude degrees="081" minutes="07" 
	seconds="05" direction="W"/>
<altitude value="73" units="M"/>

The short answer is: Consider the audience for the document you're encoding. Who (or what) will be viewing it? Will it be important to them to be able to perform filtering based on the value of one or more of the parameters above? Is document size a concern? The first form is definitely more compact. In general, you'll never go wrong if you go to the very lowest level of granularity, but sometimes real-world concerns outweigh the additional flexibility. For instance, with thousands of reports being generated and updated every hour, are the expanded storage and processing requirements worth it?

Although it's not labeled, the next line in the excample code obviously gives the date and time of the observation. The only interesting thing about this line is that two dates and times are provided, local and UTC (Universal Coordinated Time):

<local-date>Mar 09, 2001</local-date> 
<local-time>11:56 PM EST</local-time>
<UTC-date>2001.03.10</UTC-date>
<UTC-time>0456</UTC-time>

The rest of the elements in the example code are fairly easy to encode. Each line follows the same format: a descriptive label and a value, separated by a colon. The simplest way to encode these elements is to use the original field label as the tag name of the resulting XML element:

<Wind>from the N (010 degrees) at 5 MPH (5 KT):0</Wind>
<Visibility>10 mile(s):0</Visibility>
<Sky-conditions>clear</Sky-conditions>
<Temperature>41.0 F (5.0 C)</Temperature>
<Dew-Point>36.0 F (2.2 C)</Dew-Point>
<Relative-Humidity>82%</Relative-Humidity>
<altimeter>29.98 in. Hg (1015 hPa)</altimeter>
<ob>KCAE 100456Z 01005KT 10SM CLR 05/02 A2998 
	RMK AO2 SLP152 T00500022 401170022</ob>
<cycle>5</cycle>

Scott Means has also written another XML article for oreilly.com, What's New in the DOM Level 2 Core?. DOM Level 2 became an official W3C recommendation in November 2000. This article introduces the Document Object Model (DOM) and guides DOM developers through the new Level 2 features.


Separate Presentation from Content

In general, the safest way to determine which parts of a document are there for readability and aesthetic value and which parts actually contain information is to compare two or more instances of the same type of document. When that isn't possible, it is necessary to look at each element for redundant information. For example, take the information presented in the <Wind> element:

<Wind>from the N (010 degrees) at 5 MPH (5 KT):0</Wind>

The short phrase "from the N" is followed by the actual measured direction of the wind (010 degrees). The point of the compass is less precise than the actual directional measurement, and given the measured direction it would be trivial to generate the compass point during the presentation phase. So a more compact (and more useful) version of the wind element would be:

<Wind direction="010">5 MPH (5 KT):0</Wind>

Add Structural Elements

So far, our converted document looks like this:

<station>Columbia Metropolitan Airport</station>
<city>Columbia</city>
<state>SC</state>
<country>United States</country>
<ICAO-location>KCAE</ICAO-location>
<longitude>33-56-31N</longitude>
<latitude>081-07-05W</latitude>
<altitude>73M</altitude>
<local-date>Mar 09, 2001</local-date> 
<local-time>11:56 PM EST</local-time>
<UTC-date>2001.03.10</UTC-date>
<UTC-time>0456</UTC-time>
<Wind direction="010">5 MPH (5 KT):0</Wind>
<Visibility>10 mile(s):0</Visibility>
<Sky-conditions>clear</Sky-conditions>
<Temperature>41.0 F (5.0 C)</Temperature>
<Dew-Point>36.0 F (2.2 C)</Dew-Point>
<Relative-Humidity>82%</Relative-Humidity>
<altimeter>29.98 in. Hg (1015 hPa)</altimeter>
<ob>KCAE 100456Z 01005KT 10SM CLR 05/02 A2998 
	RMK AO2 SLP152 T00500022 401170022</ob>
<cycle>5</cycle>

Of course, XML documents are allowed to have only one top-level element. Adding a top-level element and an XML declaration yields this document (click here to view the document).

This is a valid XML document, but the structure we have so far is completely flat. Looking at the information, certain elements are so closely related that they should be grouped together in a container element. For instance, the first eight elements (<station> through <altitude>) all relate to the physical location where the report was taken. So, it would be appropriate to add a <location> element to contain them:

<location ICAO-location="KCAE">
  <station>Columbia Metropolitan Airport</station>
  <city>Columbia</city>
  <state>SC</state>
  <country>United States</country>
  <longitude>33-56-31N</longitude>
  <latitude>081-07-05W</latitude>
  <altitude>73M</altitude>
</location>

But, it looks as though the location information could be broken down further. The <station>, <city>, <state>, and <country> elements completely define the location in political terms. The <longitude>, <latitude>, and <altitude> elements completely define the location in geographical terms. The resulting element would look like this:

<location ICAO-location="KCAE">
  <political>
    <station>Columbia Metropolitan Airport</station>
    <city>Columbia</city>
    <state>SC</state>
    <country>United States</country>
  </political>
  <geographical>
    <longitude>33-56-31N</longitude>
    <latitude>081-07-05W</latitude>
    <altitude>73M</altitude>
  </geographical>
</location>

Making these relationships explicit now will simplify the task of presenting or processing the document later.

Normalize the Data

Normalization is a concept that should be familiar to anyone who has done relational-database design before. It is a simple concept, and it applies just as well to XML as it does to SQL: don't duplicate data. If you have one piece of information, don't store copies of it elsewhere in your database (or document) unless you have a really good reason. In the database world, a really good reason is often performance related. Let's take a look at some of the redundant information we still have in our document:

The <location> Element

In all truth, this entire container could be omitted and replaced with a single element that contained the four-letter ICAO code. Because this would require looking up the station any time we wanted to display this record to a human, we'll leave it as is for the sake of performance.

The <Wind> Element

We still have the wind speed listed in both statute and nautical miles. I would recommend collapsing these into a single value (probably knots, because that is more familiar to pilots), however, there is that pesky ":0" value appended to the wind-speed string. Without going into the METAR specification and finding out what ":0" means, I would be reluctant to mess with this element. The moral of the story: "If you don't understand it, leave it alone."

The <local-date> and <UTC-date> Elements

Now, given the time zone and location of the station, it would be possible to calculate the local time and date given only the UTC information. However, once again, for performance and simplicity it is better to keep the extra elements around so they can be displayed to a user later.

The <Temperature>, <Dew Point>, and <altimeter> Elements

These three elements contain the same measurement given in two different units of measurement. The first two are temperatures are given in both Fahrenheit and Celsius, the last is a barometer setting given in inches of mercury and hectopascals (hPa). For all three, I would vote for dropping the redundant measurement and moving the units to an element attribute. Although it would require executing a formula on display (if the user requested Celsius or hPa), by converting the value from a text string to a numeric value we open up many more options for further processing:

<Temperature units="F">41.0</Temperature>
<Dew-Point units="F">36.0</Dew-Point>
<altimeter units="in. Hg">29.98</altimeter>

For instance, with both the temperature and dew point encoded as numbers, a transformation script could compare the two and issue warnings about possible carburetor icing. An entire cycle of reports could be easily processed to monitor temperature change trends.

Identify Any Unique Keys

So far, we've been dealing with this single document in a vacuum. But in reality, it is just a snapshot in a continuous stream of reports for a single station, out of thousands of other stations that are generating similar reports. To make this document useful in a larger context, it is necessary to determine what makes it unique from all of the other possible METAR reports, and make these unique keys obvious to the consumers of this information.

There have been many debates about when and what types of information to encode using XML attributes. One group of XML users has even tried to formalize a Simple XML specification that doesn't include attributes at all. The following analysis is based on my personal experience with XML, and should not be taken as anything more than it is: opinion.

When I am developing a new XML document, I tend to think in terms of database design. This saves me from having to develop an entirely new set of criteria for designing documents, and, for the most part, lets me leverage all of the experience I've built in designing large databases. Since XML is hierarchical and most modern databases are not, the mapping is not perfect. But the following rules of thumb seem to work most of the time:

Leaf elements == fields: A single leaf element contains information that would normally be stored in a single column of a table in a database.

Collections of leaves == rows: For this example, the entire <metar> element would equate to a single row in a single database table.

Collections of collections == tables: If multiple <metar> elements were stored in a single document, the container that held them would equate to a database table.

Attributes == keys & indexes: Because of the ID mechanism defined in XML, some attributes are already effectively "unique keys" within an XML document. To make my own documents more consistent and make it more obvious how two of the same type of element are different, I place all of the data that makes an element unique in attributes. In some cases, it is convenient to use the ID mechanism of XML to make these relationships even more apparent, but the limitations on the values of ID and IDREF attributes make this difficult at times.

Based on these guidelines, what values make one METAR report unique from another? The METAR specification indicates that there can be only one report from a given station in a particular hour, so it appears that the two factors that make a report unique are a) location and b) time.

Since the four-letter identifier is unique to a particular station, including it as an attribute of the <metar> element will take care of the location axis. As for the time, since reports come in from all over the world, the UTC date and time fields would be the most logical choice. Also, the <cycle> tag provides bookkeeping information and it isn't really part of the report itself. And to keep with the spirit of normalization, if we include this data in the <metar> tag we should remove it from the body of the document.

After making these changes, click here to view the final translated document.

You'll notice that the ICAO location code is also included as an attribute on the <location> tag. Following with our database metaphor, the <location> tag and its contents really belong in a separate table that is referenced by the ICAO identifier. Including the four-letter code in this element makes that relationship more apparent. It also simplifies the task of extracting a complete database of locations by processing multiple METAR reports and extracting <location> elements.

Summary

We've covered a lot of ground to come up with a workable XML document from our original text file. And in a real-world application we would still have a lot to do. Once the format is solidified, it would be a good idea to construct a DTD (document type definition) to validate METAR reports. It would also behoove us to create a program to perform these translations automatically (because even though I type fast, manually updating 1,000-plus reports per hour might push my incipient carpal tunnel over the edge).

Translating unstructured data to XML is still more of an art than a science. But leveraging the knowledge you've already built around technologies like database and programming data structures can give you the tools you need to get the job done.


W. Scott Means has been a professional software developer since 1988, when he joined Microsoft Corporation at the age of 17. He was one of the original developers of OS/2 1.1 and Windows NT, and did some of the early work on the Microsoft Network for the Advanced Technology and Business Development group. Most recently, he served as the CEO of Enterprise Web Machines, a South Carolina based Internet infrastructure venture. He is currently writing full-time and consulting on XML and Internet topics.


O'Reilly & Associates recently released (January 2001) XML in a Nutshell.


Return to xml.oreilly.com