Colin's Journal: A place for thoughts about politics, software, and daily life.
In the course of adding image support to PubTal’s OpenOffice converter, I noticed that the HTML it was generating was not always valid, and so I set about trying to fix it.
OpenOffice is a huge application with a wide range of features, and it has a correspondingly large file format. The specification is 571 pages long, and the book on the subject is inaccurate. The book appears to have been written by looking at the output of the program, rather than the excellent (but large) DTDs.
I’ve not had the time, nor the motivation, to write something that would handle the whole format. With OpenOffice using an XML format, I could however pluck out a few basic things that I could easily convert to HTML. The problem with this approach is that supported XML structures can appear in unexpected places within a file. This meant several assumptions made by the conversion code, such as text:p never being nested, turn out to be wrong under hard-to-predict circumstances.
To correct this I’ve added a filter to the OpenOffice plugin. This filter silently blocks any XML structures that are not explicitly supported, while passing through all the others to the conversion code. To make this useful I’ve had to trawl through the conversion code in conjunction with the DTDs, and work out exactly what XML fragments I can support.
This reduces the chances of the code producing bad HTML, but it doesn’t eliminate it. The conversion code is modular, which means that one part might accidentally produce HTML that combines in an invalid way with the output of a different part. To solve this half of the problem I’ve written another filter, applied on the output of the conversion code.
This HTML filter increases the chances of valid HTML being written, by keeping track of what elements are valid within other elements. Ideally it would do full validation against a relevant DTD, but that seems like too much work, and would probably impose too much processing overhead.
I’m fairly certain that the combination of these improvements will result in only valid HTML or XHTML being produce, but I can’t be certain without significantly more work.
At least the code now handles images.
Email: colin at owlfish.com