Colin's Journal

Colin's Journal: A place for thoughts about politics, software, and daily life.

March 31st, 2003

Unicode and databases

I updated my RSS Aggregator this weekend to make it distinguish between changes to posts and new posts. Originally it would compare the title and description of every RSS item that it read in with those already in the database (via a checksum for performance reasons). A problem I kept encountering was that some items would be updated several times after they first appeared, and so my aggregator would treat them as new posts.

Now I use the <guid> element if it is present to distinguish unique items, or if these are not present I use the title and link of the items. If the description of the item has updated since the last time it was read, I update the version in my database, but leave the date of discovery the same so that the reverse chronological ordering isn’t affected. While doing this I encountered a problem when pulling data out of MySQL.

The problem is that the python module I use to access mysql (MySQL for Python), while happy to accept Unicode strings as parameters, will present any data retrieved from the database as a plain string. When doing a comparison between the Unicode extracted from the RSS feed and the results from the database query Python attempted to convert the string to Unicode, treating it as ASCII, which would cause an error if it contained latin1 characters.

Unfortunately MySQL doesn’t seem to support the storage of Unicode (certainly not at version 3.23.49), you have to store your strings in a particular character set. This will work fine for myself (latin1 will cover everything I need), but I can’t see how it would work if you subscribed to two RSS feeds, say one in big5 and one in latin1. The documentation for version 4.1 states that it adds “Extensive Unicode (UTF8) support.”, so hopefully once it makes it into Debian stable this problem will go away…

Comments are closed.

Copyright 2015 Colin Stewart

Email: colin at owlfish.com