Colin's Journal: A place for thoughts about politics, software, and daily life.
It’s been a while since I’ve taken my camera out and done any photography, and that’s part of the reason why I’ve not updated this web journal over the last two weeks. The other reason for lack of updates is that I’ve been refining the software that tries to answer the question asked above: How many people are reading this?
Every time a web server receives a request for a URL from a client (web browser, search engine, RSS Aggregator, etc) it logs the event into a file. By analysing the web server log file it’s possible to approximate how many people loaded a particular web page, which hopefully gives an indication of how many people read it.
Determining the number of people, versus the number of search engines or other robots, requesting your web page is very difficult. Each log file entry contains the user agent string sent by the client making the request. Most web browsers provide their own distinct user agent string, enabling you to determine whether the request was made by Firefox, IE, Safari or some other web browser. Unfortunately IE’s user agent string includes extensions, e.g. “Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.1.4322)” means IE 6.0 with the .NET framework installed. The range of possible user agents for IE is huge, and so pattern matching is the only practical way to determine whether or not a request came from IE.
Good search engines (such as Google which uses “Googlebot/2.1 (+http://www.google.com/bot.html)”) provide a user-agent that looks nothing like IE’s user-agent, and so is very easy to tell apart. Other’s such as the “Grifabot” deliberately use user-agents that are easy to confuse with IE, such as “Mozilla/4.0 (compatible; MSIE 5.0; Windows NT; Girafabot; girafabot at girafa dot com; http://www.girafa.com)”.
Even if the user-agent does match a known web browser this doesn’t necessarily mean that it really was that web browser that sent the request – a user can change their browser’s user agent to be anything they like. The only saving grace here is that the vast majority of users don’t bother as there isn’t really any point to changing it.
Having decided which requests are legitimately from a web browser rather than a robot the next challenge is to determine what counts as a unique page visit. If I reload the web page within 5 minutes, it will generate two requests for the page and two entries in the log file. I probably want to count such reloads as a single page visit, up to a cut off point (say 2 hours).
Detecting whether the same client has re-requested the page is fairly easy because the IP address is included in the information logged by the web server, and is unlikely to change between requests. Writing software that counts unique requests in a scalable manner is a challenge, because each request within the last 2 hours for every URL has to be remembered.
Photo: Seaweed on the beach at Kejimkujik Sea Adjunct, Nova Scotia.
Email: colin at owlfish.com