Colin's Journal: A place for thoughts about politics, software, and daily life.
I’ve finished packaging together my RSS Aggregator. It’s at a point where you can use it on an every day basis without hacking code or fiddling with the database.
I’m releasing it on the off chance that someone else might need software that does a similar sort of thing, it would be a shame for two people to have to write it!
If you are curious as to what it looks like, here’s a screen shot of my “recently updated articles” page.
Deploying a LAMP (Linux, Apache, MySQL, Python/Perl) application is difficult. I’ve just put together the briefest description of how to install my web-based, multi-user, RSS aggregation application – and frankly it requires a Unix administrator to do it. I new it would be difficult (I wrote this for myself, I’m just planning on releasing it on the off chance that someone else might want/need a similar thing), but when you finally write a document which describes the steps it’s driven home.
For a start there are eight different software packages that it depends on, although it’s a fair guess that four of them are installed by the distribution of Linux you are using (in theory this is cross platform, but that’s just one complication too far). Then there is database creation, schema creation, basic configuration data setup, the apache configuration, and finally the application configuration. Then you can log-in to the system and start using it…
I see that there is going to be an attempt to stop the worst of the flooding that happens to Venice. I don’t know enough about the politics and plans surrounding this to comment on the significance of this particular announcement, but it does raise a thought. I wonder how many other dynamic flood defences like this exist in Europe? I know about the Thames Barrier, but there are probably others…
Firstly it should be made clear that BitTorrent itself is not a piracy tool. It has many perfectly legitimate uses for transferring large files whose author has given permission for such free distribution. Having said that there do appear to be many easily accessible sites, such as this and this, that are hosting the information required to get access to TV series, films, and music which can not be legally distributed freely.
These sites only hold the .torrent files, which as I explained in an earlier post do not actually contain the copyrighted material. They instead point to a central server, which in turn keeps track of those IP addresses that are involved in distributing the material. It’s surprising that these sites have not been taken down yet, they are not hard to find, and while not many people have the time or bandwidth to download ~1GB files, the number which can is growing steadily.
It’s possible that, if the owners of one of these sites actually had the money available to take such a matter to court, there would be some countries where the hosting of these .torrent files would be found to be legal. They do not after all tell you directly where copyrighted material can be found, they simply point to an IP address that in turn lists people who do have such material. In most places this argument would probably fail, but you only really need one or two jurisdictions in which it’s legal to host these files, and they will continue to be available.
Those running trackers are far more vulnerable, they are the closest thing to the central server used by Napster. The major difference is that while Napster had one central location that everyone knew about, with BitTorrent you can have many different trackers managing different or overlapping sets of files.
This means that while individual legal victories might be had at any level of the BitTorrent architecture (torrent hosts, trackers, or peer-to-peer clients), it would be very hard to stop the distribution of copyrighted material this way. However by taking action against the torrent hosts it would slow down the spread of such material, pushing the location of .torrents underground onto IRC and other such networks. Ensuring that getting the material is more difficult than a search on google would be at least a tactical victory for those trying to suppress the free distribution of copyrighted material.
Both Sunday evening and last night were spent playing various Cheap Ass games, and on the off chance that you have never heard of them before, be assured that they are great fun indeed. One of the new ones that we picked up at Ad Astra is Witch Trial, a fun card game with significant gambling elements, and a touch of role play to keep things interesting.
The premise of the game is that you are a lawyer during the witch trials in the US, and you are out to make money by prosecuting and defending cases. The play is varied enough that I think we’ll come back to playing it many times again in the future, joining Kill Doctor Lucky as a classic.
I’ve heard about BitTorrent before, but it was only today that I saw a great example of how it can change the nature of distribution of large files on the Internet. Red Hat released ISOs of version 9 of their Linux distribution to paying subscribers, and someone (legally) made them available through BitTorrent and announced their availability on slashdot. The result was that people could get hold of the ISOs through the peer-to-peer swarm more quickly than they could through the overloaded FTP site.
The reason the peer-to-peer network was faster is down to the way BitTorrent works, which is that each downloading client also becomes a provider of the file. A major strength of BitTorrent is that the downloading client doesn’t have to complete the download before it can offer uploads, whatever portions have already been downloaded are made available for upload to others in the swarm that might need them.
The architecture consists of three main components. The .torrent file contains a description of the file (or directory) that is to be downloaded, including name, file size, and a secure hash of each chunk of the file. It additionally contains the URL of a BitTorrent tracker.
The tracker maintains a list of peers currently involved in transferring a particular file (or directory), as well as some stats around what each peer is up to. The client, after parsing the .torrent file, connects to the tracker and gets the list of peers in the swarm. The client then contacts peers from this list directly, offering up portions of the file that the client already has, and asking for portions that it requires.
There is some load balancing to ensure that clients are uploading their fair share, you get faster downloads the more bandwidth you can provide on upload, and multiple downloads are performed at once (so enabling modem users to make a real contribution of bandwidth even to those on broadband connections).
It’s an excellent way of distributing large files without having to foot a huge bandwidth bill.
I updated my RSS Aggregator this weekend to make it distinguish between changes to posts and new posts. Originally it would compare the title and description of every RSS item that it read in with those already in the database (via a checksum for performance reasons). A problem I kept encountering was that some items would be updated several times after they first appeared, and so my aggregator would treat them as new posts.
Now I use the <guid> element if it is present to distinguish unique items, or if these are not present I use the title and link of the items. If the description of the item has updated since the last time it was read, I update the version in my database, but leave the date of discovery the same so that the reverse chronological ordering isn’t affected. While doing this I encountered a problem when pulling data out of MySQL.
The problem is that the python module I use to access mysql (MySQL for Python), while happy to accept Unicode strings as parameters, will present any data retrieved from the database as a plain string. When doing a comparison between the Unicode extracted from the RSS feed and the results from the database query Python attempted to convert the string to Unicode, treating it as ASCII, which would cause an error if it contained latin1 characters.
Unfortunately MySQL doesn’t seem to support the storage of Unicode (certainly not at version 3.23.49), you have to store your strings in a particular character set. This will work fine for myself (latin1 will cover everything I need), but I can’t see how it would work if you subscribed to two RSS feeds, say one in big5 and one in latin1. The documentation for version 4.1 states that it adds “Extensive Unicode (UTF8) support.”, so hopefully once it makes it into Debian stable this problem will go away…
I carry bad news with these words. Citron, our favourite restaurant in Toronto, has passed away. It is no more, replaced physically but not culinary by a third version of the Butler’s Pantry. It’s friendly staff, great selection of new world and fusion dishes, and delicious deserts will be most sorely missed. Citron was a great little restaurant for spending the entire evening in, relaxing with a bottle of wine and conversing over great food, with no worry about the passing of time. They updated their menu a few times during the year with the passing of the seasons, making it hard to tire of their offerings as it is so easy to do with favourite eateries.
The spread of SARS is being particularly felt in Ontario this week. We have had the request for voluntary quarantine of all those that had visited Toronto’s Scarborough Grace Hospital on or after the 16th of March. It’s estimated that this will affect thousands, although how many will actually place themselves into quarantine for ten days is questionable. In fact it seems like the perfect cover for 10 days sick leave – “Boss, I got this tan in quarantine!”.
Today it’s been announced that starting this weekend there will be screening of passengers at the airport to try and limit the further import and export of the disease. The total number of cases around the world, broken down by country, and other interesting information can be found at the WHO site. Currently it stands at 53 deaths and 1485 total cases.
It’s been a quiet start to the week, with work taking up most of my time. There are a few eager skaters around which is encouraging, and I’m thinking it’s nearly time for me to dust mine off and try and remember how to use them.
I seem to have got my aggregator working OK at this point, so at some point in the near future I’ll add some configuration options to it and look at releasing it to the world. I’m not sure how much the world will care, but someone somewhere might find it useful.
I’ve started reading the Tesseracts, a collection of short science fiction stories by Canadian authors, that I received at Ad Astra (see Friday). I’m travelling back in science fiction time, having already read (most of) the fourth collection of the series, which I got last year. We now have the third collection as well, with just the second to acquire at some point, maybe next year.
The first story is a variation on the Blade Runner world, not badly written, but not particularly interesting. The second is hard to describe, but I’ll try anyway. Set in a far future with human immortality, the inability to reproduce, a decaying society, an automated baby factory, some off-worlders of indeterminate species, and finally the development of warrior children by encouraging them to fight to the death over Christmas. I doubt I’ve given the plot away somehow…
Email: colin at owlfish.com