Colin's Journal

Colin's Journal: A place for thoughts about politics, software, and daily life.

June 26th, 2003

Broken HTML

I feel like I’ve been catching up ever since last weekend, and now another one is almost upon us again. This time it’ll be different, with a national holiday on Tuesday and a day off on Monday, it’s going to be an extra long, long weekend.

The combination of Harry Potter, various birthdays (including my own), parties, and work seem to have consumed most of my week. The latest distraction was implementing a work around for broken HTML in my RSS aggregator.

My aggregator strips out HTML from RSS descriptions for a variety of reasons. Rendering HTML delivered via RSS is a security problem, is unreliable, and almost certainly means that the resulting web page will not be valid. To solve this I strip out all HTML tags and just display the plain text.

This has worked well for many months until the past couple of days. The problem is that someone’s feed I subscribe to contains severely broken HTML. They have entered some HTML comments in their RSS feed (sigh), only instead of using <!-- to start the comment they have instead put <! //

This was causing TALAggregator to log exceptions when trying to parse the feed, resulting in an email to me every 24 hours informing me that it was having difficulties. The solution I’ve used is to abandon using the standard SGML parser that comes with Python, and instead resort to some regular expressions.

Hopefully I can now turn my attention to something more interesting…

Comments are closed.

Copyright 2015 Colin Stewart

Email: colin at owlfish.com