To XML and Beyond

HTML -> XHTML -> XML
Section 1 - To XML Intermission Section 2 - XML and XSL

Introduction

This lesson will show you:

...and you should end up with a few files to try out in Office 2003.

XML - Extensible Mark-up Language

The first section will show how formatting can be extracted from an XHTML file, leaving behind the content, thus generalising XHTML into XML; with the formatting extracted into a CSS style sheet. The second section will show what should be the more common task of translating an XML file into XHTML using an XSL style sheet.

The lesson contains about 10 parts and about 7 exercises. The exercises are much easier than the large amount of text might seem, and I'd recommend using a text or HTML editor. A WYSIWYG editor could be confusing as we're dealing at the fundamental level of the contents of files, and you're likely to end up keying into its code window anyway. The lesson can probably be completed in about two and a half hours if you use the file supplied, or a simple text only file of your own. You should be able to use a file of your own but in the course of writing this I came across lots of inconsistencies, information that is out-of-date and things that sometimes don't work.

Please note that the space at the start of some examples of tags, is there to get the tag to display nearly correctly in a browser and should not be used. Also, Internet Explorer 6 does not support CSS properly. It should only affect the top part of the page, and I've checked the browser and adjusted the style sheet, but if anything looks bad try Firefox, which runs faster anyway. And if you've installed Netscape Navigator 8; it prevents IE6 from parsing XML files so you'll need to update your version of Navigator. What delights await in IE7 remain to be seen.

The lesson provides an introduction to the main components of an XML system. It only provides an introduction to the most basic functions and syntax of XSL, does not deal with images, and does not go into the writing of schema's.


Part 1 - What XML is

XML is a free open technology developed by the w3c. It does not belong to anyone or microsoft.

It's basically a very simple concept that has been in development for several years and is finally being implemented in common applications. Use of different file formats for data files will decrease and information will increasingly be held as XML.

XHTML was introduced as a stricter form of HTML, but more importantly it is almost an example of an XML. It isn't quite an XML as it relies to some extent on a browser for it's formatting, whereas it should really be independant of the application being used.

XML files can contain any text, allong with mark-up that identifies what that text is. Being text-only, they can be read by any device, and processed in accordance with the mark-up.

In general:


Part 2 - Seperating out formatting

We've seen how we can use use style sheets to seperate the appearance of our documents from the actual content of a document and how different style sheets might be used to specify - different fonts, colours, and positioning, and to make content appear differently on different devices.

However we still have heading, paragraph, and possibly other tags, which rely on built-in-to-a-browser HTML formatting. As a step towards independance, we could replace everything with paragraph tags, and rely more on our style sheet for formatting...

Exercise 1 - Become less reliant on HTML

Open this file, and it's style sheet, in a text editor. Notice how the formatting is controlled mostly by the style class names and there are no HTML tags specifically redefined in the style sheet. Convert one of your own pages so that it only contains paragraph tags and class names and amend the style sheet. (If you don't have a text-only page use this course details page for the other exercises.)

Part 2a - Note

I have not distinguished between the content of each sub-heading nor of each paragraph for a few reasons:

If the intention was to process the file in some manner, it would make more sense to mark-up each part more like this, with each sub-heading and it's associated paragraph(s) enclosed in appropriate tags.


Part 3 - What about the remaining HTML ?

If all our tags are all p's and we don't even have any p's in the style sheet, why do we need the p's in the p tags ? Especially if we are trying to get rid of browser HTML formatting ?

Exercise 2 - Lose all the other HTML


Part 4 - Now it's XML !

Most browsers will parse XML files and report errors.

Exercise 3 - Has it worked ?

Open your XML file in a browser. You will probably have to add: 'display: block;' to each of your style classes and the padding and margins will need to be adjusted to restore the default HTML formatting which we have removed, but it should look recognisable.


Part 5 - Viewing structure

The browsers will show the structure of your file if you temporally remove the style sheet processing instruction. Firefox has a proper DOM and style sheet inspector under 'Tools'.

Exercise 4 - In a browser

Temporally remove the style sheet processing instruction and view your file in a browser.


Part 6 - What's a Schema ?

As we have extracted the formatting of our file into a style sheet, what we have left behind in the XML file is the content; and we have incidentally defined what each part of that content is, by effectively creating our own mark-up language. This is why the 'X' in XML stands for 'Extensible'.

What we really should have is something like a DTD to control the structure of our mark-up language. DTD's can be used with XML files, and this is what we are doing when we specify an XHTML DTD at the start of our XHTML files; however DTD's are being replaced by 'Schema's'. You should find out the three main reasons why Schema's are superior to DTD's. Pre-defined schema's are available for; Chemistry (CML), Music (MML), WAP Devices (WML), and others (like Extensible Business Relationships (or maybe Reporting) Language), TaxML, NewsML and RIML for financial reporting, are being developed.

Schema's are usually presented in XML text-books early in about the second to fourth chapters but I'd recommend skipping those (and namespaces) at first. You should be able to open the XML file you have just created in Office 2003. Try opening the file you have created in a few Office 2003 programmes. Office 2003 doesn't come with any schema's, which means some functions are not possible until you get one.


Part 7 - A couple of notes

Something often ignored in XML text-books is that if we have invented our own mark-up language we need to communicate the meaning of that language to anyone who will be processing the file. It might be fairly obvious, but they need to know that, for example, a pair of 'name' tags, enclose a name, and a pair of 'acno' tags, enclose an account number.

XML will be the default file format in Office 2005. However this is microsoft. If you examine one of their XML files in a text editor you will see that there is a lot of unnecessary, bad, and microsoft specific code. So while you should be able to use Office XML files in any (future) application, it'll probably be at least a good idea to plan on cleaning them up first. There should be tools to at least assist with that. Also, don't rely on any complex document formatting appearing precisely in a non-microsoft application, as it did in Office. And don't be surprised if your files look different in Office; CSS1 layout properties don't seem to work. All of which is contrary to what XML is all about and normal practice for microsoft to take a freely available standardised idea, only implement part of it and introduce compatibility problems.


Part 8 - Displaying an XML file

Lots of us will probably be engaged in converting lots of HTML, and other files, into XML as XML becomes more prevalent, though a fairly common management approach is to buy a whole new system. A more common task, as part of an XML system, would be to do the reverse of what we have done, and display an XML file nicely formatted. We've seen how to attach a CSS style sheet to an XML file, so first here's a quick bit of revision with a different file...

Exercise 5 - A second XML/CSS

Take a look at this XML file - 'foodmenu.xml' in a text editor.

Notice that rather that being blocks of text, it looks more like what we might call a data file: just as the XML file you created from your XHTML document; is now more like a file than a document, since you moved all the formatting into a style sheet.

Create a CSS style sheet for 'foodmenu.xml' and attach it to the file.


Part 9 - Transforming XML

There is an XML technology available for transforming XML files into XHTML called 'Extensible style sheet Transformation' (XSLT), which works by appllying an 'XSL' style sheet to an XML file. There are basically two types of XSL style sheet. One way of writing an XSL style sheet is to put all the XHTML into a single template with values of content being inserted from the XML file. The other method (which is the preffered one) applies a series of templates. Previously no template was required but Office 2003 requires at least one.

Exercise 6 - XSL

Take a look at this version of the XML file in a text editor.

Note that the second line (or maybe first depending on what you are using) is an XSL style sheet processing instruction similar to the CSS processing instruction in Exercise 2.

Internet Explorer or Firefox or Navigator 7 support XSL, so now look at 'foodmenu2.xml' in one of those and notice that the XSL style sheet is applied.

Note

Many books show the style sheet processing instruction (for both CSS and XSL) as:

<?xml:style sheet href="styletrans.xsl" type="text/xsl"?>

which doesn't always work particularly with XSL style sheets. The ':' (colon) in 'xml:style sheet' has to be a hyphen.

Exercise 6a - Continued

Open the XSL style sheet 'foodmenu.xsl' in a text editor.

Most of the file looks like an XHTML file, but with a few 'xsl' tags.

The tag:

< xsl:for-each select="breakfast-menu/food">

- does pretty much what it says and causes the XSLT parser to select each one of the 'food' nodes. Within each 'food' node the values of each field (name, price, description and calories) are then inserted into the page by the four tags:-

< xsl:value-of select="name" />
< xsl:value-of select="price" />
< xsl:value-of select="description" />
< xsl:value-of select="calories" />

Convert your origonal HTML file (or course.htm) into an XSL style sheet.

Now attach it to the XML file you created in Exercise 3 and view it in Firefox or Internet Explorer.

Exercise 7 - Another XSL with templates

Open the XSL style sheet 'menutemplates.xsl' in a text editor.

The first 'xsl:template' contains the basic page structure of 'HTML', 'HEAD', STYLE' and 'BODY' tags.

The -

< xsl:apply-templates />

tag tells the XML parser where to insert the results of applying the other templates.

There are four more templates which will be applied to each matching XML node with the -

 < xsl:value-of select="."/>

tags inserting the corresponding value of the node. The "." operater acts like 'this' in Javascript.

There can be any valid XHTML markup in a template and you might find positioning content easier using 'BR' tags rather than using 'DIV' style sheet properties as I have, especially if you're going to try and use the file in Office 2003. Note that I've changed the CSS style sheet as I need to define properties for each element; I can't have a 'DIV' which spans two elements as '.firstline' and '.secondline' did in 'foodmenu.xsl'.

Look at: 'foodmenu3.xml' file, in a text editor, and notice that it's the same as 'foodmenu2.xml' apart from using a different XSL style sheet ('menutemplates.xsl'). Open the file in Internet Explorer and notice that hopefully there isn't much difference than with the style sheet 'foodmenu.xsl'.


Part 10 - Conclusion

That's allmost all.

You should be able to open any of the XML files in Word 2003, and in the Task Pane, view 'Data only', or 'Browse...' and apply an XSL style sheet, which Office seems to call a 'Dataview'.

Another XML language is 'Scalable Vector Graphics' which uses an XML file to define an image. It's been around for years and surprisingly isn't a lot more common. (I don't know if they are, but the graphics in document templates in Office 2003 might be SVG's.)

X3D is an XML 3D graphics format in development. (Wonder if Longhorn uses X3D.)

XForms is a way of specifying forms controls which is probably what Office 2003 InfoPath uses and is to be more of a feature of Office 12.

There are some nice, simple free XML editors available and some very complicated expensive ones.

Here is a set of files for a Book Shop for you to try in Office.

bookStore.xml - booksTable.xsl - bookSchema.xsd