Django 101
At work I've started working on a portal written in Python using the Django
framework. And I have to say I'm pretty impressed. Django does large
quantities of magic to make mothe model data accessible, the templating
language is pretty spiffy (it's about on a par with ClearSilver, which I'm
more familiar with - each has bits that the other doesn't do), and the
views and url mapping handling is nice too. I can see this as being a
very attractive platform to get into in the future - I'm already considering
writing my Set Dance Music Database in it just to see what it can do.
So how do I feel as a Perl programmer writing Python? Pretty good too. There are obvious differences, and traps for new players, but the fact that I can dive into something and fairly quickly be fixing bugs and implementing new features is pretty nice too. Overall, I think that once you get beyond the relatively trivial details of the structure of the code and how variables work and so on, what really makes languages strong is their libraries and interfaces, and this to me is where Perl stands out with its overwhelmingly successfull CPAN and Python, while slightly less organised from what I've seen so far, still has a similar level of power.
About the only criticism
I have is the way the command line option processing is implemented - Python
has tried one way (getopt) which is clearly thinking just like a C
programmer, and another (optparse) which is more object oriented but is
hugely cumbersome to use in its attempt to be flexible. Neither of these
hold a candle to Perl's GetOpt::Long module.
posted at: 13:53 | path: /tech/web | permanent link to this entry
Common code in ClearSilver 001
I've been using ClearSilver as
a template language for my CGI websites in earnest for about half a year
now. I decided to rewrite my Set Dance Music
Database in it and it's generally been a good thing. Initially,
though, I had two problems: it was hard to know exactly what data had
been put into the HDF object, and it was a pain to debug template
rendering problems by having to upload them to the server (surprisingly,
but I think justifiably, I don't run Apache and PostgreSQL on my laptop
so as to
have a 'production' environment at home).
I solved this problem rather neatly by getting my code to write out the HDF object to a file, rsync'ing that file back to my own machine, and then test the template locally.
I knew that ClearSilver's Perl library had a 'readFile' method to slurp an HDF file directly into the HDF object, and a quick check of the C library said that it had an equivalent 'writeFile' call. So happily I found that they'd also provided this call in Perl. My 'site library' module provided the $hdf object and a Render function which took a template name; it was relatively simple to write to a file derived from the template name. That way I had a one-to-one correspondence between template file and data file.
Then I can run ClearSilver's cstest program to test the template - it takes two parameters, the template file and the HDF file. You either get the page rendered, or a backtrace to where the syntax error in your template occurred. I can also browse through the HDF file - which is just a text file - to work out what data is being sent to the template, which solves the problem of "why isn't that data being shown" fairly quickly.
Another possibility I haven't explored is to run a test suite against the entire site using standard HDF files each time I do a change to make sure there aren't any regressions before uploading.
Hopefully I've piqued a few people's interest in ClearSilver, because
I'm going to be talking more about it in upcoming posts.
posted at: 11:10 | path: /tech/web | permanent link to this entry
Standard Observations
Simon
Rumble mentioned Joel
Spolsky's post on web standards and it really is an excellent read.
The fundamental point is that as a standard grows, testing any arbitrary
device's compliance with it it grows harder. Given that, for rendering
HTML, not only do we have a couple of 'official' standards: HTML 4, XHTML,
etc., but we also have a number of 'defacto' standards - IE 5, IE 5.5,
IE 6, IE 7, Firefox, Opera, etc. etc. etc ad nauseam. For a long time,
Microsoft has banked on their desktop monopoly to lever their own
defacto standards onto us, but I think they never intended it to be
because of bugs in their own software. And now the chickens are coming
home to roost, and they're stuck with either being bug-for-bug compatible
with their own software (i.e. making it more expensive to produce) or
breaking all those old web pages (i.e. making it much more unpopular).
I wonder if there was anyone in Microsoft Internet Explorer development team around the time they were producing 5.0 that was saying, "No, we can't ship this until it complies with the standard; that way we know we'll have less work to do in the future." If so, I feel doubly sorry for you: you've been proved right, but you're still stuck.
However, this is not a new problem to us software engineers. We've invented various test-based coding methodologies that ensure that the software probably obeys the standard, or at least can be proven to obey some standard (as opposed to being random). We've also seen the nifty XSLT macro that takes the OpenFormula specification and produces an OpenDocument Spreadsheet that tests the formula - I can't find any live links to it but I saved a copy and put it here. So it shouldn't actually be that hard to go through and implement, if not all, then a good portion of the HTML standard as rigorous tests and then use browser scripting to test its actual output. Tell me that someone isn't doing this already.
But the problem isn't really with making software obey the standard - although obviously Microsoft has had some problem with that in the past, and therefore I don't feel we can trust them in the future. The problem is that those pieces of broken software have formed a defacto standard that isn't mapped by a document. In fact, they form several inconsistent and conflicting standards. If you want another problem, it's that people writing web site code to detect browser type in the past have written something like:
if ($browser eq 'IE') {
if ($version <= 5.0) {
write_IE_5_0_HTML();
} elsif ($version <= 5.5) {
write_IE_5_5_HTML();
} else {
write_IE_HTML();
}
...
}
When IE 7 came along and broke new stuff, they added:
} elsif ($version <= 6.0) {
write_IE_6_0_HTML();
It doesn't take much of a genius to work out that you can't just
assume that this current version is the last version of IE, or that
new versions of IE aren't necessarily going to be bug-for-bug compatible
with the last version. So really the people writing the websites are to
blame.Joel doesn't identify Microsoft's correct response in this situation. The reason for this is that we're all small coders reading Joel's blog and we just don't have the power of Microsoft. It should be relatively easy for them to write a program that goes out and checks web sites to see whether they render correctly in IE 8, and then they should work together with the web site owners whose web sites don't render correctly to fix this. Microsoft does a big publicity campaign about how it's cleaning up the web to make sure it's all standard compliant for its new standards-compliant browser, they call it a big win, everyone goes back to work without an extra headache. Instead, they're carrying on like it's not their fault that the problem exists in the first place.
Microsoft's talking big about how it's this nice friendly corporate
citizen that plays nice these days - let's see it start fixing up some
of its past mistakes.
posted at: 22:41 | path: /tech/web | permanent link to this entry
Finding Sets Made Easy
I can't believe I only just thought of it. My
Set Dancing Music Database has its sets and
CDs referenced on the URL line by the internal database IDs. While
this is unique and easy to link to, it looks pretty useless if you're
sending the link to someone. I realised this when writing my post on
my experiences at
Naughton's Hotel I wanted to link to my page on the South Galway
Reel Set and thought "how dull is that?"
Suddenly I realised that I should do what wikis and most other good content management systems have done for ages - made URLs which reference things by name rather than number and let the software work it out in the background. Take the name for the set, flatten it into lower case and replace spaces with underscores; it would also be easily reversible. CDs might be a bit more challenging but there are only one or two that have a repeated name, and I'd have to handle such conflicts anyway at some point.
That combined with my planned rewrite of the site to use some sane
HTML templating language - my current choice is ClearSilver - so that
it's not all ugly HTML-in-the-code has given me another project for
a good week or so of coding. Pity I'm at LCA and have to absorb all
those other great ideas...
posted at: 07:32 | path: /tech/web | permanent link to this entry
Wiki Documentulation
In the process of writing up the
new
manual for LMMS, I've been
asked by the lead developer to be able to render the entire manual as one
large document. This he will feed into a custom C++ program written to
take MediaWiki markup and turn it into Tex markup, for on-processing into
a PDF. Presumably he sees a big market for a big chunk of printed
document as opposed to distributing the HTML of the manual in some
appropriately browsable format, and doesn't mind reinventing the wheel
- his C++ program implements a good deal of Perl's string processing
capabilities in order to step through the lines byte-by-byte and do
something very similar to regular expressions. Although I might be
mistaken in this opinion - I don't read C++ very well.
I had originally considered writing a Perl LWP [1] program that performed a request to edit the page, with my credentials, but I figured that was a ghastly kludge and would cause some sort of modern day wiki-equivalent of upsetting the bonk/oif ratio (even though MediaWiki obviously doesn't try to track who's editing what document when). But then I discovered MediaWiki's Special:Export page and realised I could hack it together with this.
The question, however, really comes down to: how does one go about taking a manual written in something like MediaWiki and producing some more static, less infrastructure-dependent, page or set of pages that contains the documentation while still preserving its links and cross-referencing? What tools are there for converting Wiki manuals into other formats? I know that toby has written the one I mentioned above; the author of this ghastly piece of giving-Perl-a-bad-name obviously thought it was useful enough to have another in the same vein. CPAN even has a library specifically for wikitext conversion.
This requires more research.
[1] - There's something
very odd about using a PHP script on phpman.info to get the manual
of a Perl module. But it's the first one I found. And it's better than
search.cpan.org, which requires you to know the author
name in order to list the documentation of
the
module. I want something with a URL like
http://search.cpan.org/modules/LWP.
posted at: 14:25 | path: /tech/web | permanent link to this entry
Perl, Ajax and the learning experience - part 001
AJAX as a thing I use regularly on web pages is still an unknown
territory to me, a person who's still not entirely au fait with
CSS and who still uses Perl's
CGI module
to write scripts from scratch. I understand the whole technology behind
AJAX - call a server-side function and do something with the result when
it comes back later - but I lacked a toolkit that could make it relatively
easy for me to use. Then I discovered
CGI::Ajax
and a light begun to dawn.
Of course, there were still obstacles. CGI::Ajax's natural way of doing things is for you to feed all your HTML in and have it check for the javascript call and handle it, or mangle the script headers to include the javascript, and spit out the result by itself. All of my scripts are written so that the HTML is output progressively by print statements. This may be primitive to some and alien to others, but I'm not going to start rewriting all my scripts to pass gigantic strings of HTML around. So I started probing.
Internally this build_html function basically does:
if ($cgi->param('fname')) {
print $ajax->handle_request;
} else {
# Add the <script> tags into your HTML here
}
For me this equates to:
if ($cgi->param('fname')) {
print $ajax->handle_request;
} else {
print $cgi->header,
$cgi->start_html( -script => $ajax->show_javascript ),
# Output your HTML here
;
}
I had to make one change to the CGI::Ajax module, which I duly
made up as a patch and sent upstream: both CGI's start_html
-script handler and CGI::Ajax's show_javascript
method put your javascript in a <script> tag and then a CDATA
tag to protect it against being read as XML. I added an option to the
show_javascript method so that you say:
$cgi->start_html( -script => $ajax->show_javascript({'no-script-tags' => 1}) ),
and it doesn't output a second set of tags for you.So, a few little tricks to using this module if you're not going to do things exactly the way it expects. But it can be done, and that will probably mean, for the most of us, that we don't have to extensively rewrite our scripts in order to get started into AJAX. And I can see the limitations of the CGI::Ajax module already, chief amongst them that it generates all the Javascript on the fly and puts it into every page, thus not allowing browsers to cache a javascript file. I'm going to have a further poke around and see if I can write a method for CGI::Ajax that allows you to place all the standard 'behind-the-scenes' Javascript it writes into a common file, thus cutting down on the page size and generate/transmit time. This really should only have to be done once per time you install or upgrade the CGI::Ajax module.
Now to find something actually useful to do with Ajax. The main trap to
avoid, IMO, is to cause the page's URL to not display what you expect
after the Javascript has been at work. For instance, if your AJAX is
updating product details, then you want the URL to follow the product's
page. It should always be possible to bookmark a page and come back to
that exact page - if nothing else it makes it easier for people to find
your pages in search engines.
posted at: 18:12 | path: /tech/web | permanent link to this entry
Accessing the Deep Web
IP
Australia has an interesting post about the "Deep Web" - those
documents which are available on the internet but only by typing in
a search query on the relevant website.
On reading their article I get the impression that they think that this is both a hitherto-unknown phenomenon and one which is still baffling web developers. This puzzles me, as even a relative neophyte such as myself knows how to make these documents available to search engines: indexes. All you need is a linked-to page somewhere which then lists all of the documents available. This page doesn't have to be as obvious as my Set Dance Music Database index - it can be tucked away in a 'site map' page somewhere so that it doesn't confuse too many people into thinking that that's the correct way to get access to their documents. However, don't try to hide it so that only search engines can see it, or you'll fall afoul of the regular 'link-farming' detection and elimination mechanisms most modern search engines employ.
Of course, being a traditionalist (as you can see from both the content and design of the Set Dance Music Database) I tend to think that lists are still useful, at least if kept small. And I do need to put in some mechanisms for searching on the SDMDB, as well as a few other drill-down methods. So giving your people just a search form alone may not be catering to all the methods people employ when finding content. Wikis have realised this years ago - people like interlinking. And given that these 'deep web' documents are still accessible via a simple URL, if you really need to you can assist the search engines by creating your own index page to their documents by basically scripting up a search on their website that then puts the links into your index, avoiding listing duplicates.
So the real question is: why are the owners of these web sites not doing this? We may just need to suggest it to them if they haven't thought of it themselves. The benefits of having their documents listed on Google are many - what downsides are there? I'm sure the various criticisms of such indexing are mainly due to organisational bias and narrow-mindedness, and can either be solved or routed around.
There are two variants of this that annoy me. One is the various websites where the only way to get to what you want is by clicking - no direct link is ever provided and your entire navigation is all done through javascript, flash or unspeakable black magic. These people are making it purposefully hard for you to get straight to what you want, either because they want to show you a bunch of advertising on the way or because they want to know exactly what you're up to on their site for some insidious purpose. There is already one Irish music CD store online that I've basically had to completely ignore (except for cross-checking with material on other sites) because there is no way for me to refer people directly to a CD. I refuse outright to give instructions such as "go to http://example.com and type in the words 'Tulla Ceili Band' in the search box", because that's not good navigation.
The other type of annoyance I find ties in with this: it is the practice
of making a hidden index, or a privileged level of access, available to
search engines that normal people don't see. I've seen a few computing
and engineering websites do this, and Experts Exchange is particularly
annoying for it: you can google your query and see an excerpt from the
page with the question but when you go there you find out that access to
the answers requires membership and/or payment. This, as far as I'm
concerned, is just a blatant money-grabbing exercise and should be
anathema. Either your results are free to access, or they're not -
search engines should not be privileged in that respect.
posted at: 12:21 | path: /tech/web | permanent link to this entry
Wiki defacement
To: abuse@ttnet.net.tr
From: paulway@mabula.net
Subject: Defacement of our wiki page by your user
dsl.dynamic81213236104.ttnet.net.tr
Dear people,
On Wednesday the 28th of February, a user from your address dsl.dynamic81213236104.ttnet.net.tr made two edits to our Wiki. You can see the page as changed at http://mabula.net/rugbypilg/index.cgi?action=browse&id=HomePage&revision=18, including the above address as the editor. Your client is obviously defacing our and other sites like it, which is probably against your terms of service. In addition, they are too lame to be on the internet. Please take them off it so that they do not do any further damage to themselves and others.
We have reversed their changes and our site is back to normal.
Yours sincerely,
Paul Wayper
posted at: 08:59 | path: /tech/web | permanent link to this entry
Comment spam eradication, attempt 2
Dave's Web Of Lies allows
people to submit new lies, a facility that is of course abused by comment
spammers. These cretins seem to not notice the complete absence of any
linkback generation and the proscription of any text including the
magic phrase http://. Like most spammers, they don't care if 100% of
their effort is blocked somewhere, because it won't be blocked somewhere
else. And there's no penalty for them brutalising a server: their botnets
are just trawling away spamming continuously, leaving the spammers free to
exploit new markets. It is vital to understand these two factors when
considering how to avoid and, ultimately, eradicate spam.
For a while now, I've done a certain amount of checking that the lie submitted meets certain sanity guidelines that also filter out a lot of comment spam. In each case, the user is greeted with a helpful yet not prescriptive error message: for instance, when the lie contains an exclamation point the user is told "Your lie is too enthusiastic". (We take lying seriously at Dave's Web Of Lies.) This should be enough for a person to read and deduce what they need to do to get a genuine lie submitted, but not enough for a spammer to work out quickly what characters to remove for their submission to get anywhere. Of course, this is violating rule 1 above: spammers don't care if any number of messages get blocked, so long as one message gets through somehow.
This still left me with a healthy chunk of spam to wade through and mark as rejected. This also fills up my database (albeit slowly), and I object to this on principle. So I implemented a suggestion from someone's blog: include a hidden field called "website" that, when filled in, indicates that it's from a spammer (since it's ordinarily impossible for a real person to fill any text in the field). Then we silently ignore this field. No false positives? Sounds good to me.
Initial indications, however, were that it was having no effect. I changed the field from being hidden to having the style property "display: none", which causes any modern browser to not display it, but since this was in the stylesheet a spammer would have no real indication just by scraping the submit page that this field was not, in fact, used. This, alas, also had no effect. I surmised that this was probably because the form previously had no 'website' field and spammers were merely remembering what forms to fill in where, rather than re-scraping the form (though I have no evidence for this). Pity.
So my next step was to note that a lot of the remaining spam had a distinctive form. The 'lie' would be some random comment congratulating me on such an informative and helpful web site, the 'liar' would be a single word name, and there was a random character or two tacked on the lie to make it unlikely to be exactly the same as any previous submission. So I hand-crafted a 'badstarts.txt' file and, on lie submission, I read through this file and silently ignore the lie if it starts with a bad phrase. Since almost all of these are crafted to be such that no sane or reasonable lie could also start with the same words, this reduces the number of false positives - important (in my opinion) when we don't tell people whether their submission has succeeded or failed.
Sure enough, now we started getting rejected spams. The file now contains about 36 different phrases. I don't have any statistics on how many got through versus how many got blocked, but that's just a matter of time... And I'm probably reinventing some wheel somewhere, but it's a simple thing and I didn't want to use a larger, more complex but generalised solution.
I'd be willing to share the list with people, but I won't post the link in case spammers find it.
I really want to avoid a captcha system on the Web Of Lies. I like
keeping Dave's original simplistic design, even if there are better,
all-text designs that I could (or perhaps should) be using.
posted at: 13:37 | path: /tech/web | permanent link to this entry
Domain Search Squatters Must Die episode #001
It looks like the
SpinServer
people that I mentioned nigh on nine months ago have disappeared.
That I can cope with - a pity, because I liked their designs, but
businesses come and go.
What INFURIATES me beyond measure is the way the people who run the domain registers then cash in on any businesses' past success by installing a copy-cat templated redirector site that earns them a bit of money from the hapless people who mistake it for the real thing. They're getting good too: it was so well layed out it took me several moments to work out that there was nothing actually useful on the site. Previous attempts I've seen have been pretty much just a bunch of prepackaged searches on the keywords in your previous site listed down the page, with a generic picture of a woman holding a mouse or going windsurfing (or for the more extreme sites going windsurfing holding a mouse). Now it's getting nasty.
It's not good enough that these domain registrars take money for
something they've been proven to
lose,
'mistakenly' swap to another person, revoke without the slightest
authority, fraudulently bill for,
and costs them
nothing to generate. They they have to leech off the popularity of
any site that goes under, not only scamming a few quick thousand bucks
in the process but confusing anyone who wanted just a simple page
saying "this company is no longer doing business". There must be
something preventing this from happening in real life - businesses
registering the name of a competitor as soon as they'd closed, buying
up the office space and setting up a new branch. Except that there'd
be some dodgy marketing exec handing them money for every person who
wandered in and asked "Is this where I get my car repaired?". This
sounds criminal to me.
posted at: 07:53 | path: /tech/web | permanent link to this entry
They just don't care, do they?
As I've mentioned before, I run the large and well-designed site known as
Dave's Web Of Lies. Amongst
it's thousands of features is the ability to submit new lies to the database;
naturally they are intensely scrutinised for any speck of truth beforehand.
Now, the site's name might seem to give the game away, but those industrious
linkback spammers obviously don't have time for such niceties as checking
whether their handiwork has had any effect, or is even meaningful. My
favourite 'comments' left in the submission form so far have been:
Looking for information and found it in this great site... - Jimpson.
Thank you for your site. I have found here much useful information... - Jesus.
The irony is that I don't know whether to include them because they are,
indeed, genuine lies. But, on principle, I reject them. It's not as if
liars get linkbacks on DWOL anyway...
posted at: 22:54 | path: /tech/web | permanent link to this entry
Pandora opens my box
I see the National Library of Australia is now scanning my home photo gallery with
a spider taken from the archive.org people. The project is called
Pandora and the
crawler site says that they're
doing some kind of archiving for Australian pages. Well, that's certainly true
of mine. But searching for the term 'Linux' on the main page produces "Linux at
the Parkes Observatory", "Linux Australia", AusCERT's page, "Learning Linux" on
www.active.org.au (which Pandora tells me is currently restricted for some reason),
and then we go onto international sites. So I don't know what that's all about...
posted at: 18:25 | path: /tech/web | permanent link to this entry
Get your nearest mirror here!
It just occurred to me, as I fired up my VMWare copy of Ubuntu and
searched its universe repositories, and searched my local RPM
mirrors on Fedora Core, for packages of
"dar", the Disk Archiver of
which I am enamoured, that surely there are local Ubuntu mirrors that
I can use here on the ANU campus (I'm doing this from work). I've
already found the local mirrors of the various RPM repositories that
I use: http://mirror.aarnet.edu.au,
http://mirror.optus.net,
http://mirror.pacific.net.au/,
http://public.www.planetmirror.com/,
and others.
I know other people on campus use Ubuntu. I know about http://debian.anu.edu.au, although I haven't configured my Ubuntu installation to use it as a source. I personally think it makes the Internet a better place to get your new and updated packages from the closest mirror you can. If your ISP has a mirror, then definitely use that because it almost certainly won't use up your download gigabytes per month quota.
So imagine if there was a system whereby users could submit and update yum and apt-get configurations based on IP ranges. Then a simple package would be able to look up which configuration would apply to their IP address, and it would automatically be installed. Instantly you'd get the fastest, most cost-effective mirrors available. You could probably do the lookup as a DNS query, too. It'd even save bandwidth for the regular mirrors and encourage ISPs to set up mirrors and maintain the configurations, knowing that this bandwidth saving would be instantly felt in their network rather than relying on their customers to find their mirrors and know how to customise their configurations to suit.
Hmmmm.... Need to think about this.
posted at: 18:28 | path: /tech/web | permanent link to this entry
Too much time, too little gain?
My 'home' home page -
http://tangram.dnsalias.net/~paulway/
- has, for a while now, had the appearance of an old greenscreen monitor
playing a text adventure. Since it's more or less just a method for me
to gather up a few bits and pieces that I can't be bothered putting up
on my regular page - http://www.mabula.net
- I'm not really worried by creating a work of art.
But, the temptation to carry things too far has always been strong within me. So, of course, the flashing cursor at the end of the page wasn't good enough on its own: I had to have an appropriate command come up when you hovered over the link. After a fair amount of javascript abuse, and reading of tutorials, I finally got it working; I even got it so that the initial text (which has to be there for the javascript to work) doesn't get displayed when the document loads.
Score one for pointless javascript!
posted at: 23:17 | path: /tech/web | permanent link to this entry
All posts licensed under the CC-BY-NC license. Author Paul Wayper.