Colin's Journal

Colin's Journal: A place for thoughts about politics, software, and daily life.

October 13th, 2003

Writing content for web pages

Last week I wrote a short article on the importance of a template based solution for web page maintenance, and the sort of innovations that could be made to ease template design. At the end of that article I noted two other problems with web publication tools today: markup of the content, and the handling of non-journal style pages. This article addresses some thoughts I’ve had on the first of these two problems.

The most popular template based systems today are those provided by blogging software. They allow an author to enter new web content either using a web browser (thin client) or a small application (fat client). When the author decides to include a link, make some text bold, or apply some other markup the most common solution is to have them enter HTML codes.

Alternatives to entering the HTML manually include utilising IE specific enhanced textarea widgets, using a different markup language such as Textile, or providing buttons that automatically insert the HTML tags.

The markup in which an author’s content is written, and the markup in which it is published, must be treated separately, even if they happen to be the same. The reason for this is that an evolving web also means evolving markup for web pages, for example the transistion between HTML4 and XHTML1. When an author of a site chooses to move their pages from HTML to XHTML the software they use needs to be able to rebuild old pages using XHTML.

For software to be able to perform transformations of markup from one language to another it needs to be able to parse the original markup perfectly. If the original markup is HTML this poses two problems: writing and parsing correct HTML programmatically is fairly difficult, and if users enter markup by hand then there will be errors in it. The inability to convert cleanly to a new publishing markup language is a major defect in all of the blogging tools today that store and accept content using HTML markup. It is a hole that can be coded out of, but never in a 100% satisfactory way.

The solution to this problem requires a combination of three things:

  1. The adoption of a strict, easy to parse, format for authoring content.
  2. The rejection of any content which does not adhere completely to this strict format.
  3. The use of tools, rather than users, to generate this strict format.

The critical piece missing today of these three items is the third one: a GUI tool that allows the markup of web content in a strict, easy to parse, format. The bare minimum that such a tool should be able to support includes: links, text decoration (bold, italic, etc), lists (bullet and numeric), and images. There are lots of other types of markup which would be very useful (e.g. tables), but for most web content this limited list would suffice. Today there are many weblog authors who have tools and knowledge such that they don’t use the most basic of markup in their content. A GUI application supporting these features, and whoes output is in a strict format, would be enough to bring painless, sustainable content authoring to a much wider audiance.

Writing such a tool, while not technically difficult, does take time and effort. I hope to one day soon find an open source tool to do this. In the meantime however I have a partial solution: AbiWord.

AbiWord is an open source word processor. A word processor isn’t really the best choice of tool for editing web content, simply beacuse it has too many features that are not needed or do not apply to the web. For example AbiWord supports Mail Merge, multiple document sections with different headers and footers on the pages, and other such features that are needed for document creation, but not for editing web content.

Despite these drawbacks the use of AbiWord does bring some significant advantages:

  • It is a full GUI: the user has no opportunity or need to edit markup themselves.
  • It has many useful features such as spell check as you type.
  • The output format is in XML, making it fairly easy to convert into HTML/XHTML or any other markup language.

To see whether or not this can work I’ve written a plugin for PubTal which takes AbiWord documents, converts it to HTML markup, and then publishes it using PubTal templates. There is still much testing to be done, but it now handles: headings, text decoration (bold, italic, underline, strikeout, overline, superscript, subscript), pre-formated text, hyperlinks, bookmarks (anchor’s), bullet lists, numeric lists, footnotes/endnotes, and tables.

The biggest missing feature is the ability to include images in the content. The problem here is that AbiWord doesn’t record the original location of the image file – it just places the binary content (encoded using base64) into the XML file. I can probably live with that restriction for most pages, at least until I can find a better solution.

Comments are closed.

October 13th, 2003

Spam in blogs

Spam in blog comments was always inevitable because it brings two benefits to spammers:

  1. It gets lots of people seeing their message (in the same way as Usenet spam).
  2. It spams search engines into giving their website a higher ranking.

As is clear from the discussion on Making Light it is a loosing battle to try and block comment spammers based on their IP addresses.

I’m currently thinking that there are two likely approaches to blocking this kind of spam that might stand a chance. The first approach is to show an image of a random letter in a hard to OCR font, and then asking the user to enter the letter (or series of letters) into the form with their comment. This is used on several large sites today, but I don’t know how effective it actually is.

The second approach would be to apply statistical filtering to comments in the same way as it is used for email. This approach has been very successful in reducing email spam getting into in-boxes as can be seen by the technique’s continued roll-out. It seems like an easy enough extension to apply this kind of filtering to comments in weblogs.

I’m sure we’ll hear a lot more about weblog comment spam as time goes on.

Comments are closed.

Copyright 2015 Colin Stewart

Email: colin at owlfish.com