Too Busy For Words - The PaulWay Weblog

Wed, 20 Aug 2008

Error Message Hell
If there's one thing anyone that works with computers hates, it's an error message that is misleading or vague. "Syntax Error", "Bad Command Or File Name", "General Protection Fault", and so forth have haunted us for ages; kernel panics, strange reboots, devices that just don't seem to be recognised by the system, and programs mysteriously disappearing likewise. The trend has been to give people more information, and preferably a way to understand what they need to do to fix the problem.

I blog this because I've just been struggling with a problem in Django for the last day or so, and after much experimentation I've finally discovered what the error really means. Django, being written in Python, of course comes with huge backtraces, verbose error messages, and neat formatting of all the data in the hopes that it will give you more to work with when solving your problem. Unfortunately, this error message was both wrong - in that the error it was complaining about was not actually correct - and misleading - in that the real cause of the error was something else entirely.

Django has a urls.py file which defines a set of regular expressions for URLs, and the appropriate action to take when receiving each one. So you can set up r'/poll/(?P\d+)' as a URL, and it will call the associated view's method and pass the parameter poll_id to be whatever the URL contained. In the spirit of Don't Repeat Yourself, you can also name this URL, for example:

url(r'/poll/(?P\d+)', 'view_poll', name = 'poll_view_one')

And then in your templates you can say:

<a href="{{ url poll_view_one poll_id=poll.id }}">{{ poll.name }}</a>

Django will then find the URL with that name, feed the poll ID in at the appropriate place in the expression, and there you are - you don't have to go rewriting all your links when your site structure changes. This, to me, is a great idea.

The problem was that Django was reporting that "Reverse for 'portal.address_new_in_street' not found." when it was clearly listed in a clearly working urls.py file. Finally, I started playing around with the expression, experimenting with what would work and what wouldn't in the expression. In this case, the pattern was:

new/in/(?P\d+)/(?P[A-Za-z .'-]+)

When I changed this to:

new/in/(?P.+)/(?P.+)

It suddenly came good. And then I discovered that the the thing being fed into the 'suburb_id' was not a number, but a string. So what that error message really means is "The pattern you tried to use didn't match because of format differences between the parameters and the regular expression." Maybe it means that you can have several patterns with the same name that will try to match based on the first such pattern that does so. But until then, I'll remember this; and hopefully someone else trying to figure out this problem won't butt their head against a wall for a day like I did.

posted at: 16:13 | path: /tech/web | permanent link to this entry

Tue, 29 Jul 2008

Django 101
At work I've started working on a portal written in Python using the Django framework. And I have to say I'm pretty impressed. Django does large quantities of magic to make mothe model data accessible, the templating language is pretty spiffy (it's about on a par with ClearSilver, which I'm more familiar with - each has bits that the other doesn't do), and the views and url mapping handling is nice too. I can see this as being a very attractive platform to get into in the future - I'm already considering writing my Set Dance Music Database in it just to see what it can do.

So how do I feel as a Perl programmer writing Python? Pretty good too. There are obvious differences, and traps for new players, but the fact that I can dive into something and fairly quickly be fixing bugs and implementing new features is pretty nice too. Overall, I think that once you get beyond the relatively trivial details of the structure of the code and how variables work and so on, what really makes languages strong is their libraries and interfaces, and this to me is where Perl stands out with its overwhelmingly successfull CPAN and Python, while slightly less organised from what I've seen so far, still has a similar level of power.

About the only criticism I have is the way the command line option processing is implemented - Python has tried one way (getopt) which is clearly thinking just like a C programmer, and another (optparse) which is more object oriented but is hugely cumbersome to use in its attempt to be flexible. Neither of these hold a candle to Perl's GetOpt::Long module.

posted at: 13:53 | path: /tech/web | permanent link to this entry

Sun, 15 Jun 2008

Common code in ClearSilver 001
I've been using ClearSilver as a template language for my CGI websites in earnest for about half a year now. I decided to rewrite my Set Dance Music Database in it and it's generally been a good thing. Initially, though, I had two problems: it was hard to know exactly what data had been put into the HDF object, and it was a pain to debug template rendering problems by having to upload them to the server (surprisingly, but I think justifiably, I don't run Apache and PostgreSQL on my laptop so as to have a 'production' environment at home).

I solved this problem rather neatly by getting my code to write out the HDF object to a file, rsync'ing that file back to my own machine, and then test the template locally.

I knew that ClearSilver's Perl library had a 'readFile' method to slurp an HDF file directly into the HDF object, and a quick check of the C library said that it had an equivalent 'writeFile' call. So happily I found that they'd also provided this call in Perl. My 'site library' module provided the $hdf object and a Render function which took a template name; it was relatively simple to write to a file derived from the template name. That way I had a one-to-one correspondence between template file and data file.

Then I can run ClearSilver's cstest program to test the template - it takes two parameters, the template file and the HDF file. You either get the page rendered, or a backtrace to where the syntax error in your template occurred. I can also browse through the HDF file - which is just a text file - to work out what data is being sent to the template, which solves the problem of "why isn't that data being shown" fairly quickly.

Another possibility I haven't explored is to run a test suite against the entire site using standard HDF files each time I do a change to make sure there aren't any regressions before uploading.

Hopefully I've piqued a few people's interest in ClearSilver, because I'm going to be talking more about it in upcoming posts.

posted at: 11:10 | path: /tech/web | permanent link to this entry

Tue, 18 Mar 2008

Standard Observations
Simon Rumble mentioned Joel Spolsky's post on web standards and it really is an excellent read. The fundamental point is that as a standard grows, testing any arbitrary device's compliance with it it grows harder. Given that, for rendering HTML, not only do we have a couple of 'official' standards: HTML 4, XHTML, etc., but we also have a number of 'defacto' standards - IE 5, IE 5.5, IE 6, IE 7, Firefox, Opera, etc. etc. etc ad nauseam. For a long time, Microsoft has banked on their desktop monopoly to lever their own defacto standards onto us, but I think they never intended it to be because of bugs in their own software. And now the chickens are coming home to roost, and they're stuck with either being bug-for-bug compatible with their own software (i.e. making it more expensive to produce) or breaking all those old web pages (i.e. making it much more unpopular).

I wonder if there was anyone in Microsoft Internet Explorer development team around the time they were producing 5.0 that was saying, "No, we can't ship this until it complies with the standard; that way we know we'll have less work to do in the future." If so, I feel doubly sorry for you: you've been proved right, but you're still stuck.

However, this is not a new problem to us software engineers. We've invented various test-based coding methodologies that ensure that the software probably obeys the standard, or at least can be proven to obey some standard (as opposed to being random). We've also seen the nifty XSLT macro that takes the OpenFormula specification and produces an OpenDocument Spreadsheet that tests the formula - I can't find any live links to it but I saved a copy and put it here. So it shouldn't actually be that hard to go through and implement, if not all, then a good portion of the HTML standard as rigorous tests and then use browser scripting to test its actual output. Tell me that someone isn't doing this already.

But the problem isn't really with making software obey the standard - although obviously Microsoft has had some problem with that in the past, and therefore I don't feel we can trust them in the future. The problem is that those pieces of broken software have formed a defacto standard that isn't mapped by a document. In fact, they form several inconsistent and conflicting standards. If you want another problem, it's that people writing web site code to detect browser type in the past have written something like:

if ($browser eq 'IE') {
    if ($version <= 5.0) {
        write_IE_5_0_HTML();
    } elsif ($version <= 5.5) {
        write_IE_5_5_HTML();
    } else {
        write_IE_HTML();
    }
    ...
}
When IE 7 came along and broke new stuff, they added:
    } elsif ($version <= 6.0) {
        write_IE_6_0_HTML();
It doesn't take much of a genius to work out that you can't just assume that this current version is the last version of IE, or that new versions of IE aren't necessarily going to be bug-for-bug compatible with the last version. So really the people writing the websites are to blame.

Joel doesn't identify Microsoft's correct response in this situation. The reason for this is that we're all small coders reading Joel's blog and we just don't have the power of Microsoft. It should be relatively easy for them to write a program that goes out and checks web sites to see whether they render correctly in IE 8, and then they should work together with the web site owners whose web sites don't render correctly to fix this. Microsoft does a big publicity campaign about how it's cleaning up the web to make sure it's all standard compliant for its new standards-compliant browser, they call it a big win, everyone goes back to work without an extra headache. Instead, they're carrying on like it's not their fault that the problem exists in the first place.

Microsoft's talking big about how it's this nice friendly corporate citizen that plays nice these days - let's see it start fixing up some of its past mistakes.

posted at: 22:41 | path: /tech/web | permanent link to this entry

Tue, 29 Jan 2008

Finding Sets Made Easy
I can't believe I only just thought of it. My Set Dancing Music Database has its sets and CDs referenced on the URL line by the internal database IDs. While this is unique and easy to link to, it looks pretty useless if you're sending the link to someone. I realised this when writing my post on my experiences at Naughton's Hotel I wanted to link to my page on the South Galway Reel Set and thought "how dull is that?"

Suddenly I realised that I should do what wikis and most other good content management systems have done for ages - made URLs which reference things by name rather than number and let the software work it out in the background. Take the name for the set, flatten it into lower case and replace spaces with underscores; it would also be easily reversible. CDs might be a bit more challenging but there are only one or two that have a repeated name, and I'd have to handle such conflicts anyway at some point.

That combined with my planned rewrite of the site to use some sane HTML templating language - my current choice is ClearSilver - so that it's not all ugly HTML-in-the-code has given me another project for a good week or so of coding. Pity I'm at LCA and have to absorb all those other great ideas...

posted at: 07:32 | path: /tech/web | permanent link to this entry

Tue, 20 Nov 2007

Wiki Documentulation
In the process of writing up the new manual for LMMS, I've been asked by the lead developer to be able to render the entire manual as one large document. This he will feed into a custom C++ program written to take MediaWiki markup and turn it into Tex markup, for on-processing into a PDF. Presumably he sees a big market for a big chunk of printed document as opposed to distributing the HTML of the manual in some appropriately browsable format, and doesn't mind reinventing the wheel - his C++ program implements a good deal of Perl's string processing capabilities in order to step through the lines byte-by-byte and do something very similar to regular expressions. Although I might be mistaken in this opinion - I don't read C++ very well.

I had originally considered writing a Perl LWP [1] program that performed a request to edit the page, with my credentials, but I figured that was a ghastly kludge and would cause some sort of modern day wiki-equivalent of upsetting the bonk/oif ratio (even though MediaWiki obviously doesn't try to track who's editing what document when). But then I discovered MediaWiki's Special:Export page and realised I could hack it together with this.

The question, however, really comes down to: how does one go about taking a manual written in something like MediaWiki and producing some more static, less infrastructure-dependent, page or set of pages that contains the documentation while still preserving its links and cross-referencing? What tools are there for converting Wiki manuals into other formats? I know that toby has written the one I mentioned above; the author of this ghastly piece of giving-Perl-a-bad-name obviously thought it was useful enough to have another in the same vein. CPAN even has a library specifically for wikitext conversion.

This requires more research.

[1] - There's something very odd about using a PHP script on phpman.info to get the manual of a Perl module. But it's the first one I found. And it's better than search.cpan.org, which requires you to know the author name in order to list the documentation of the module. I want something with a URL like http://search.cpan.org/modules/LWP.

posted at: 14:25 | path: /tech/web | permanent link to this entry

Fri, 09 Nov 2007

Perl, Ajax and the learning experience - part 001
AJAX as a thing I use regularly on web pages is still an unknown territory to me, a person who's still not entirely au fait with CSS and who still uses Perl's CGI module to write scripts from scratch. I understand the whole technology behind AJAX - call a server-side function and do something with the result when it comes back later - but I lacked a toolkit that could make it relatively easy for me to use. Then I discovered CGI::Ajax and a light begun to dawn.

Of course, there were still obstacles. CGI::Ajax's natural way of doing things is for you to feed all your HTML in and have it check for the javascript call and handle it, or mangle the script headers to include the javascript, and spit out the result by itself. All of my scripts are written so that the HTML is output progressively by print statements. This may be primitive to some and alien to others, but I'm not going to start rewriting all my scripts to pass gigantic strings of HTML around. So I started probing.

Internally this build_html function basically does:

if ($cgi->param('fname')) {
    print $ajax->handle_request;
} else {
    # Add the <script> tags into your HTML here
}
For me this equates to:

if ($cgi->param('fname')) {
    print $ajax->handle_request;
} else {
    print $cgi->header,
        $cgi->start_html( -script => $ajax->show_javascript ),
        # Output your HTML here
        ;
}
I had to make one change to the CGI::Ajax module, which I duly made up as a patch and sent upstream: both CGI's start_html -script handler and CGI::Ajax's show_javascript method put your javascript in a <script> tag and then a CDATA tag to protect it against being read as XML. I added an option to the show_javascript method so that you say:

        $cgi->start_html( -script => $ajax->show_javascript({'no-script-tags' => 1}) ),
and it doesn't output a second set of tags for you.

So, a few little tricks to using this module if you're not going to do things exactly the way it expects. But it can be done, and that will probably mean, for the most of us, that we don't have to extensively rewrite our scripts in order to get started into AJAX. And I can see the limitations of the CGI::Ajax module already, chief amongst them that it generates all the Javascript on the fly and puts it into every page, thus not allowing browsers to cache a javascript file. I'm going to have a further poke around and see if I can write a method for CGI::Ajax that allows you to place all the standard 'behind-the-scenes' Javascript it writes into a common file, thus cutting down on the page size and generate/transmit time. This really should only have to be done once per time you install or upgrade the CGI::Ajax module.

Now to find something actually useful to do with Ajax. The main trap to avoid, IMO, is to cause the page's URL to not display what you expect after the Javascript has been at work. For instance, if your AJAX is updating product details, then you want the URL to follow the product's page. It should always be possible to bookmark a page and come back to that exact page - if nothing else it makes it easier for people to find your pages in search engines.

posted at: 18:12 | path: /tech/web | permanent link to this entry

Wed, 11 Jul 2007

Accessing the Deep Web
IP Australia has an interesting post about the "Deep Web" - those documents which are available on the internet but only by typing in a search query on the relevant website.

On reading their article I get the impression that they think that this is both a hitherto-unknown phenomenon and one which is still baffling web developers. This puzzles me, as even a relative neophyte such as myself knows how to make these documents available to search engines: indexes. All you need is a linked-to page somewhere which then lists all of the documents available. This page doesn't have to be as obvious as my Set Dance Music Database index - it can be tucked away in a 'site map' page somewhere so that it doesn't confuse too many people into thinking that that's the correct way to get access to their documents. However, don't try to hide it so that only search engines can see it, or you'll fall afoul of the regular 'link-farming' detection and elimination mechanisms most modern search engines employ.

Of course, being a traditionalist (as you can see from both the content and design of the Set Dance Music Database) I tend to think that lists are still useful, at least if kept small. And I do need to put in some mechanisms for searching on the SDMDB, as well as a few other drill-down methods. So giving your people just a search form alone may not be catering to all the methods people employ when finding content. Wikis have realised this years ago - people like interlinking. And given that these 'deep web' documents are still accessible via a simple URL, if you really need to you can assist the search engines by creating your own index page to their documents by basically scripting up a search on their website that then puts the links into your index, avoiding listing duplicates.

So the real question is: why are the owners of these web sites not doing this? We may just need to suggest it to them if they haven't thought of it themselves. The benefits of having their documents listed on Google are many - what downsides are there? I'm sure the various criticisms of such indexing are mainly due to organisational bias and narrow-mindedness, and can either be solved or routed around.

There are two variants of this that annoy me. One is the various websites where the only way to get to what you want is by clicking - no direct link is ever provided and your entire navigation is all done through javascript, flash or unspeakable black magic. These people are making it purposefully hard for you to get straight to what you want, either because they want to show you a bunch of advertising on the way or because they want to know exactly what you're up to on their site for some insidious purpose. There is already one Irish music CD store online that I've basically had to completely ignore (except for cross-checking with material on other sites) because there is no way for me to refer people directly to a CD. I refuse outright to give instructions such as "go to http://example.com and type in the words 'Tulla Ceili Band' in the search box", because that's not good navigation.

The other type of annoyance I find ties in with this: it is the practice of making a hidden index, or a privileged level of access, available to search engines that normal people don't see. I've seen a few computing and engineering websites do this, and Experts Exchange is particularly annoying for it: you can google your query and see an excerpt from the page with the question but when you go there you find out that access to the answers requires membership and/or payment. This, as far as I'm concerned, is just a blatant money-grabbing exercise and should be anathema. Either your results are free to access, or they're not - search engines should not be privileged in that respect.

posted at: 12:21 | path: /tech/web | permanent link to this entry

Tue, 06 Mar 2007

Wiki defacement
To: abuse@ttnet.net.tr
From: paulway@mabula.net
Subject: Defacement of our wiki page by your user dsl.dynamic81213236104.ttnet.net.tr

Dear people,

On Wednesday the 28th of February, a user from your address dsl.dynamic81213236104.ttnet.net.tr made two edits to our Wiki. You can see the page as changed at http://mabula.net/rugbypilg/index.cgi?action=browse&id=HomePage&revision=18, including the above address as the editor. Your client is obviously defacing our and other sites like it, which is probably against your terms of service. In addition, they are too lame to be on the internet. Please take them off it so that they do not do any further damage to themselves and others.

We have reversed their changes and our site is back to normal.

Yours sincerely,

Paul Wayper

posted at: 08:59 | path: /tech/web | permanent link to this entry

Fri, 16 Feb 2007

Comment spam eradication, attempt 2
Dave's Web Of Lies allows people to submit new lies, a facility that is of course abused by comment spammers. These cretins seem to not notice the complete absence of any linkback generation and the proscription of any text including the magic phrase http://. Like most spammers, they don't care if 100% of their effort is blocked somewhere, because it won't be blocked somewhere else. And there's no penalty for them brutalising a server: their botnets are just trawling away spamming continuously, leaving the spammers free to exploit new markets. It is vital to understand these two factors when considering how to avoid and, ultimately, eradicate spam.

For a while now, I've done a certain amount of checking that the lie submitted meets certain sanity guidelines that also filter out a lot of comment spam. In each case, the user is greeted with a helpful yet not prescriptive error message: for instance, when the lie contains an exclamation point the user is told "Your lie is too enthusiastic". (We take lying seriously at Dave's Web Of Lies.) This should be enough for a person to read and deduce what they need to do to get a genuine lie submitted, but not enough for a spammer to work out quickly what characters to remove for their submission to get anywhere. Of course, this is violating rule 1 above: spammers don't care if any number of messages get blocked, so long as one message gets through somehow.

This still left me with a healthy chunk of spam to wade through and mark as rejected. This also fills up my database (albeit slowly), and I object to this on principle. So I implemented a suggestion from someone's blog: include a hidden field called "website" that, when filled in, indicates that it's from a spammer (since it's ordinarily impossible for a real person to fill any text in the field). Then we silently ignore this field. No false positives? Sounds good to me.

Initial indications, however, were that it was having no effect. I changed the field from being hidden to having the style property "display: none", which causes any modern browser to not display it, but since this was in the stylesheet a spammer would have no real indication just by scraping the submit page that this field was not, in fact, used. This, alas, also had no effect. I surmised that this was probably because the form previously had no 'website' field and spammers were merely remembering what forms to fill in where, rather than re-scraping the form (though I have no evidence for this). Pity.

So my next step was to note that a lot of the remaining spam had a distinctive form. The 'lie' would be some random comment congratulating me on such an informative and helpful web site, the 'liar' would be a single word name, and there was a random character or two tacked on the lie to make it unlikely to be exactly the same as any previous submission. So I hand-crafted a 'badstarts.txt' file and, on lie submission, I read through this file and silently ignore the lie if it starts with a bad phrase. Since almost all of these are crafted to be such that no sane or reasonable lie could also start with the same words, this reduces the number of false positives - important (in my opinion) when we don't tell people whether their submission has succeeded or failed.

Sure enough, now we started getting rejected spams. The file now contains about 36 different phrases. I don't have any statistics on how many got through versus how many got blocked, but that's just a matter of time... And I'm probably reinventing some wheel somewhere, but it's a simple thing and I didn't want to use a larger, more complex but generalised solution.

I'd be willing to share the list with people, but I won't post the link in case spammers find it.

I really want to avoid a captcha system on the Web Of Lies. I like keeping Dave's original simplistic design, even if there are better, all-text designs that I could (or perhaps should) be using.

posted at: 13:37 | path: /tech/web | permanent link to this entry

Mon, 29 Jan 2007

Domain Search Squatters Must Die episode #001
It looks like the SpinServer people that I mentioned nigh on nine months ago have disappeared. That I can cope with - a pity, because I liked their designs, but businesses come and go.

What INFURIATES me beyond measure is the way the people who run the domain registers then cash in on any businesses' past success by installing a copy-cat templated redirector site that earns them a bit of money from the hapless people who mistake it for the real thing. They're getting good too: it was so well layed out it took me several moments to work out that there was nothing actually useful on the site. Previous attempts I've seen have been pretty much just a bunch of prepackaged searches on the keywords in your previous site listed down the page, with a generic picture of a woman holding a mouse or going windsurfing (or for the more extreme sites going windsurfing holding a mouse). Now it's getting nasty.

It's not good enough that these domain registrars take money for something they've been proven to lose, 'mistakenly' swap to another person, revoke without the slightest authority, fraudulently bill for, and costs them nothing to generate. They they have to leech off the popularity of any site that goes under, not only scamming a few quick thousand bucks in the process but confusing anyone who wanted just a simple page saying "this company is no longer doing business". There must be something preventing this from happening in real life - businesses registering the name of a competitor as soon as they'd closed, buying up the office space and setting up a new branch. Except that there'd be some dodgy marketing exec handing them money for every person who wandered in and asked "Is this where I get my car repaired?". This sounds criminal to me.

posted at: 07:53 | path: /tech/web | permanent link to this entry

Mon, 18 Sep 2006

They just don't care, do they?
As I've mentioned before, I run the large and well-designed site known as Dave's Web Of Lies. Amongst it's thousands of features is the ability to submit new lies to the database; naturally they are intensely scrutinised for any speck of truth beforehand. Now, the site's name might seem to give the game away, but those industrious linkback spammers obviously don't have time for such niceties as checking whether their handiwork has had any effect, or is even meaningful. My favourite 'comments' left in the submission form so far have been:

Looking for information and found it in this great site... - Jimpson.

Thank you for your site. I have found here much useful information... - Jesus.

The irony is that I don't know whether to include them because they are, indeed, genuine lies. But, on principle, I reject them. It's not as if liars get linkbacks on DWOL anyway...

posted at: 22:54 | path: /tech/web | permanent link to this entry

Thu, 31 Aug 2006

Pandora opens my box
I see the National Library of Australia is now scanning my home photo gallery with a spider taken from the archive.org people. The project is called Pandora and the crawler site says that they're doing some kind of archiving for Australian pages. Well, that's certainly true of mine. But searching for the term 'Linux' on the main page produces "Linux at the Parkes Observatory", "Linux Australia", AusCERT's page, "Learning Linux" on www.active.org.au (which Pandora tells me is currently restricted for some reason), and then we go onto international sites. So I don't know what that's all about...

posted at: 18:25 | path: /tech/web | permanent link to this entry

Fri, 04 Aug 2006

Get your nearest mirror here!
It just occurred to me, as I fired up my VMWare copy of Ubuntu and searched its universe repositories, and searched my local RPM mirrors on Fedora Core, for packages of "dar", the Disk Archiver of which I am enamoured, that surely there are local Ubuntu mirrors that I can use here on the ANU campus (I'm doing this from work). I've already found the local mirrors of the various RPM repositories that I use: http://mirror.aarnet.edu.au, http://mirror.optus.net, http://mirror.pacific.net.au/, http://public.www.planetmirror.com/, and others.

I know other people on campus use Ubuntu. I know about http://debian.anu.edu.au, although I haven't configured my Ubuntu installation to use it as a source. I personally think it makes the Internet a better place to get your new and updated packages from the closest mirror you can. If your ISP has a mirror, then definitely use that because it almost certainly won't use up your download gigabytes per month quota.

So imagine if there was a system whereby users could submit and update yum and apt-get configurations based on IP ranges. Then a simple package would be able to look up which configuration would apply to their IP address, and it would automatically be installed. Instantly you'd get the fastest, most cost-effective mirrors available. You could probably do the lookup as a DNS query, too. It'd even save bandwidth for the regular mirrors and encourage ISPs to set up mirrors and maintain the configurations, knowing that this bandwidth saving would be instantly felt in their network rather than relying on their customers to find their mirrors and know how to customise their configurations to suit.

Hmmmm.... Need to think about this.

posted at: 18:28 | path: /tech/web | permanent link to this entry


All posts licensed under the CC-BY-NC license. Author Paul Wayper.