Too Busy For Words - The PaulWay Weblog

Mon, 26 Nov 2007

Saves some typing?
I had an idea on Friday for a utility that fills a little niche that I hit regularly. The particular example was wanting to save the iptables configuration after a couple of updates. This is put (on Red Hat standard boxes) in /etc/sysconfig/iptables, and I keep copies named /etc/sysconfig/iptables.yyyymmdd (where yyyymmdd is the current year, month and day) in case a change breaks something and I need to go back to a previous version. Other people use revision control systems like Mercurial for this, with its ability to watch edits to a file that isn't in a pre-set directory. I may be old fashioned here but this method does me fine. Normally, in order to roll the configuration over to the new version you would do:

mv /etc/sysconfig/iptables /etc/sysconfig/iptables.yyyymmdd
iptables_save > /etc/sysconfig.iptables

But what if you'd already done one edit today? Then you'd use a name like /etc/sysconfig/iptables.yyyymmdd.inc, where inc is an increment number or something. And you want that number to increment up until it finds a 'free' number. The usual convention for log files is to roll each file down, so /etc/sysconfig/iptables.yyyymmdd becomes /etc/sysconfig/iptables.yyyymmdd.1, /etc/sysconfig/iptables.yyyymmdd.1 becomes /etc/sysconfig/iptables.yyyymmdd.2 and so forth; I usually end up putting the latest revision at the end of the sequence rather than the earliest.

Now, of course, it would be relatively simple to do that renaming automatically given the base file name. Cafuego coded up a Bash one-liner in half an hour or so, and Debian already has the savelog utility to do just this (a fact I found out much later, not running Debian). However, that only really does half the job. We still end up with:

savelog /etc/sysconfig/iptables
iptables_save > /etc/sysconfig.iptables

That's one repetition of that annoying path too many, with its hostile tab-unfriendly sysconfig directory, for my taste. I realised that what I wanted was something like:

iptables_save | roll /etc/sysconfig.iptables

that would both roll the log file over and then 'redirect' standard input to the target file. Again, a relatively short piece of work in Perl or bash. But do you really want to have to call up all that framework just to roll one file over? I resolved to learn a bit more and do it in C. Not only that, but I'd forswear my usual use of the talloc library and do it as raw as possible.

It took a day, but by the end of it I had coded up the first working version of the code. I presented it to the gallery on the #linux-aus IRC channel on Freenode and Cafuego pointed out that I'd only implemented the all-move-down method, not the move-to-last method. A bit more work that night added that. A bit more work with valgrind found the couple of memory leaks and odd overwrites. More work today put the command-line options processing in place, and corrected the move-to-last method to not only work, but in the process be more efficient.

So, now I release to the wider Linux and Unix community the roll command. You can find the source code at http://mabula.net/code/roll.c and check it out via Subversion through svn://tangram.dnsalias.net/roll/trunk. Comments, criticisms and suggestions as always are welcomed via paulway@mabula.net. Of course, the irony is that I could have written that mv /etc/sysconfig/iptables /etc/sysconfig/iptables.20071123 command by now...

posted at: 18:13 | path: /tech/c | permanent link to this entry

Mon, 28 May 2007

Hardware problems
I completely failed to record the CLUG meeting last Thursday night, basically because I just didn't think it through enough. I set out with good intentions, a nice stereo microphone, a five meter extension cable to allow me to monitor the recording without being up the speaker's nose, and a little impromptu microphone stand I'd fashioned out of some handy wood (the old 'if all you have is some wood' design criteria). I'd tested the set-up before and made sure the microphone was working, and got the levels right beforehand. I started recording way before the first speaker, so I could check that it was all working again.

The first minor glitch was discovering that I couldn't record and play back simultaneously. So I could watch the levels moving up and down, and note that they were at least vaguely reacting to the speaker and the audience, but I couldn't check that what was recorded was actually sensible. I tried stopping one recording (again, before the first speaker started) and then continuing recording in another session and playing back in the first session. Audacity played nothing. I tried to normalise the first while the second was recording. Audacity crashed. Don't do that again, then.

The first clue that something was going wrong was that the levels weren't following David's talking closely enough. They'd pick up loud noises in the room, but not David's pauses. The second clue was that this changed before Tridge started talking - the levels went up - but they stayed even flatter. Tridge knows when to raise his voice - we have no amplification for the CLUG meetings - and I should have seen definite peaks. And the levels weren't flat, either - so it wasn't recording silence. I thought I hadn't changed anything when the levels went up, but as it turned out I had.

The third clue had to wait until I got the recordings home: I listened to one and I could hear my comment to Tridge sitting beside me fairly clearly, but not his response. Otherwise, it was mostly buzz. Diagnosis: something was working, but the noise from the circuitry on the rather ordinary on-board audio was pretty high (no surprises there, laptops are not renowned for their onboard sownd, er, sound). And why had the levels gone up for the second recording? What had I changed? I couldn't remember. It took a sleep for it to all congeal in my head, and the next morning the ghastly truth of my elementary error flooded into my brain unbidden:

I had put the microphone in the headphone jack, and vice versa.

This is why it picked up vague noises, and could hear my comments: because headphones will act as fairly microphones in a pinch (it's just the same circuit, only being read from rather than written to, if you will). This explained why the noise had gone up but no voice had been actually recorded in the second phase: because I'd unplugged the headphones half-way through (since they were useless as monitors). From then on it was picking up uncompleted circuit noise. And what had probably done it was that the extension cord from the microphone looks a lot like the extension cord that I use at home to plug my audio output into my home-made switch board.

So: save up for a nice Edirol UA-1EX USB adapter which can do 24-bit 96KHz recording and playback and has a microphone input, and possibly a nicer microphone stand so I don't look quite like a reject from Woodworking 101 with it. And try to solve the hardware problems in my brain before doing this again.

posted at: 10:02 | path: /tech/clug | permanent link to this entry

Thu, 26 Apr 2007

CLUG meeting April 2007 - DAR, OSM, ISO
Tonight was a bit of a last-minute effort, as Chris hadn't managed to arrange any speakers. I volunteered to do a talk on DAR, since I'd recently had a think-my-data-all-gone scare, and Andrew Loughhead volunteered to do a talk on OpenStreetMap, a project he's been involved in for a while. Brad Hards also volunteered to bring along some DVDs and some ISOs of the recent Feisty Fawn release, so our trio of TLAs could be complete.

I'd also managed to combine not preparing my talk until the last minute with forgetting that I had already said I'd go to see "For The Term Of His Natural Life" at the National Library with some friends. And I didn't realise this until I was basically committed to going to the CLUG meeting. So it all felt a bit rushed and unorganised.

To cover my bit of the talk, DAR is an archiver built for the modern world: it compresses, it slices, it encrypts, it allows you to include and exclude file easily, it seeks right to where you need to be on the disk rather than reading through an entire .tar.gz file, and it saves Extended Attributes so your SELinux contexts and Beagle annotations will be saved too. I'll put the slides of the talk up soon, but it wasn't a very in-depth look at the program anyway so it's not really going to tell you any more than you'd get reading the man pages.

I'd forgot, when I was standing up, to point out the absence of my laptop case cover. This is because, on this very day, I have picked up six pieces of 0.5mm stainless steel cut to my exacting specifications by a powerful jet of water, and taken them to a place where they can be bent into the right shape to fit in my laminated laptop case covers. I had to leave the case cover with Precision Metals so they could work out how to bend both pieces to the right shape. So I'm very close to actually being able to make them. With my dining table near completion, it looks like I'm actually finishing some of my projects!

Andrew's talk on OpenStreetMap was much more interesting, and he fielded a lot more questions on it as a result. OpenStreetMap is solving the libré problem of the data on Google Maps and similar mapping sites being free to look at but not free to use in your own work. With OpenStreetMap you can correct it if you think it's wrong (leaving aside the 'easter egg' issue), add to it, analyze the map data in new ways, and so forth. There were a lot of very good questions, which I think shows that geodata is one of the current hot topics in computing these days.

Finally, I took orders for pizza (a new process to me!), Brad fired up his DVD burner, and we had the regular stand-around-and-talk that seems to be perennially popular at CLUG meetings.

posted at: 23:51 | path: /tech/clug | permanent link to this entry

Thu, 12 Apr 2007

CLUG PSIG April 2007 - Python in C in Python in C ...
At 5:00 or so I got the phone call - Bob Edwards, the speaker for the night's Programming SIG at the Canberra Linux User's Group was unwell and wouldn't be able to make it. I sent out a quick note to say that we'd be reverting to the more traditional "everyone talk in their little groups about whatever" format, which was a little unsatisfying. This may seem a contradiction from a person who loves to tell tall tales of programming history and relate episodes from the Daily WTF. But I want to learn as much as I like to teach, and I want to make sure that the CLUG PSIG doesn't turn into the "solve coding problems for Paul Wayper" SIG.

We had a good turn-out, though, and people who had seen my last-minute email had still come along. But the day was really saved when David Collett and Michael Cohen did an impromptu talk about integrating Python with C. Those of us with laptops and internet connections went to the PyFlag code repository that David and Michael and others have been working on, and followed along as David showed us how they'd progressed from using SWiG to writing entirely in C and using the Python integration library to pass data in and out and to call Python methods from within C. David knows his code well and he and Michael were able to demonstrate all the standard things you need to do to integrate the two languages, as well as why those methods were chosen. I was really impressed at their off-the-cuff presentation and it really saved the night.

And then somehow I found myself explaining all about my sequence counting program, why I'd used C instead of Perl to implement it, and what its limitations were. Though everyone was listening attentively, I was secretly fearing that it was turning into a conversation between myself and Michael, who was doing most of the questioning. And there were a couple of good ideas - things like using array-based trie systems, seeking through Clustal .ALN files, and using buffer systems to break the problem down - that he mentioned that I'll have to follow up. But I'm very annoyed with myself that it turned into exactly what I have felt all along that it should not be (q.v.).

Finally it came down to Owen, Ian, Rhys (or Ryan, I can't remember) and myself talking about esoteric things like Big Bang vs Steady State theory, the four quantum forces and their relative strengths, and the families and groups of the Periodic Table. So it ended on a good note after all.

posted at: 23:43 | path: /tech/clug | permanent link to this entry

Fri, 23 Feb 2007

CLUG February meeting - PS3 and Sushi
It's at times like this that I'm really glad I discovered the CLUG.

Firstly, the main focus of the presentation was the PS3. Hugh Blemings and Jeremy Kerr (I think) gave a talk about the heart of the PS3, and several of IBM's large blade servers, the Cell processor. I'll gloss over a lot of the technical detail because it's pretty easy to find, but the key things to me were:

Of course, writing code for the Cell's Secondary Processing Units - the eight 'sub-processors' that do most of the SIMD work - is not an easy process. The 'Hello World' example involved lots of complicated IO, but that was only because the SPU isn't the right platform to use to put words on the console. What Jeremy and Hugh are interested in is seeing various libraries - FFT, audio and video codecs, rendering libraries, all sorts of other things that require lots of brute-force computation - ported to use the Cell's SPUs. The power of these things is not to be underestimated: their classic demo is a four-dimensional jula set (!) made of glass (!) ray-tracing (!!) in real-time (!!!), and it's done with no clever OpenGL output, just blatting pixels onto a frame buffer. With so much work being done on general-purpose data processing algorithms that can be run on the Graphics Processing Unit of your modern graphics card, this offers significant performance increases if we can get these libraries in place.

Nick Piggin also gave a talk about his work on getting better performance from NUMA machines. NUMA is Non Uniform memory Access, where each processor or core has direct access only to a subsection of the total memory in the system, and has to ask for other blocks to be sent to it if the block it wants is attached to another processor. Blocks that are read-only can be duplicated in each processor's local memory: for example, the pages containing libc. Blocks that are read-write can be replicated while everyone's only reading them, and flushed out and reloaded when a write occurs. So overall this was a night for the supercomputing enthusiasts amongst us (e.g. me).

(Note to self: I need to find a good way to talk to Nick about his presentation style.)

Once most of the presenting is over, the night is given over to eating and chatting. Usually in CLUG meetings this is given over to a pizza feeding frenzy once famously compared to a gannet colony. Last night, however, we also had sushi. This was organised by myself with some assistance from Pascal Klein, and had to be arranged in advance. It was an experiment in alternative foods prompted by Hugh Fisher's talk on Open Source software communities. We had seven people, Hugh and Jeremy included, request sushi; one didn't show up, so despite me asking for an eighth person to join in we still only had seven people sharing the cost. So I got stuck with a bit of the bill, but it was worth it for the quantity of sushi. There was a good variety, in reasonably good quality, and enough wasabi to entirely destroy the sinuses of the entire CLUG attendance for the night. So I think this was a success; I'll do it next time.

I'm still casting around for other cuisines that have small, easy to eat portions that don't require cutlery and can be sourced relatively quickly and don't go off their serving temperature for the period they're stored between picking them up and dispensing them. But sushi twice in a row won't be all that bad...

Addendum: I called my brother up in Brisbane to tell him of the PS3 coolness, and ended up spending more time talking to his friend Nick who works there. Nick's a linux user in a crowd of Windows geeks, like myself, so we ended up chewing the fat over processing coolness and Vista badness for a good hour or so. I also passed on to him the news about http://www.debian-multimedia.org - i.e. that it exists. That should save him the agony of getting MythTV to compile...

posted at: 15:51 | path: /tech/clug | permanent link to this entry

Mon, 19 Feb 2007

Programming Sig February 2007 - Rusty Russell
I should have reported this a week or more back, but the weekend was filled with dead and dying firewalls and hardware purchases that didn't work. So:

At the CLUG Programming SIG for February, we had Rusty Russell speaking on his project LGuest, a Linux x86_32 hypervisor system. He talked about his aims (to develop a framework for testing virtualisation in Linux) and how he overcame the various obstacles in his way, such as glibc's way of using an segmentation calculation overflow to store information about the program code, and how he got around them. He finished up with some benchmarks and a bit of a comparison between the LGuest and the various other virtualisation and hypervisor packages out there - primarily Xen and VMware. There were at least a dozen people there, which was big for a PSIG meeting, and this included two people from ANU who had never been to any CLUG event before but had wanted to hear Rusty's presentation. The talk was very well received, and Rusty gave an excellent presentation in his usual engaging manner. He handled the constant interruptions from food arriving and plates departing neatly, and didn't miss out on his own meal either (this last is important, I think, for a speaker). My only apology to him was my brief and inelegant introduction, but since pretty much everyone knew who he was and what he was doing, I felt words were superfluous :-)

I've been working in the background trying to get various people that I know in the CLUG scene to give talks at the PSIG. There are essentially three things that I want to hear about:

That last one in particular interests me greatly. There are a number of topics I'd personally like to hear about:

The question is: how do I find people in the Canberra area to talk on these topics?

posted at: 18:43 | path: /tech/clug | permanent link to this entry

Thu, 30 Nov 2006

CLUG, Women and GPL3
On this month's CLUG meeting, Hugh Fisher led a wide-ranging and stimulating discussion on a bunch of issues facing Free and Open Source Software today. I always like these kinds of discussion - partly because it expands the area of my knowledge and partly because I think sometimes we need a devil's advocate in order to really understand why we stand for the things we do. If we never question the knowledge we have, we're no better than the unreasoning zealots who promote proprietary software.

There were two problems with the process, I felt. One was that having a discussion like this is possible up to about eight people - for a CLUG meeting of twenty or so it sometimes degenerated into a shouting match. I'm as guilty as the rest - I'd stick my hand up sometimes and wait patiently to be noticed, but then five minutes later I'd be calling out amusing comments or counterexamples with the rest of us.

The second problem was that Hugh's approach was basically to attack FOSS's dogmas and articles of faith. This often ended up with arguments coming from both sides - you can't say the Free Software Manifesto is equivalent to Marxism, and then say that there's nothing wrong with capitalism and proprietary software without ending up sounding like you're arguing about completely different things. And these are also the sorts of declarations that get Open Source practitioners somewhat riled up, which means that they want to go on the attack, which is hard if it's coming from the other side of the political arena.

Personally I don't have a problem with a lot of these statements. The Free Software Manifesto is a lot like classical Marxism - where people get confused is then thinking that it equals communism or Stalinism. Proprietary and free-but-closed-source software has a lot to teach Open Source programmers: Mark told me of two features of a Finder-replacement in Mac OS X which would make Nautilus or its KDE equivalent green with envy[1]. Sure, we can copy features from them just as they copy features from us, but it seems a curious inversion of the "it's worth nothing if it's free" mentality to say that the only software worth using has no cost. I certainly don't mind paying for a game or an application I might use; I know the hidden cost is there that I've been locked into their file formats and so forth. I just factor that into the equation. It's the everyday equivalent of reading and understanding all the EULA stuff.

One interesting topic came up right at the start: for a group of people that prides itself on 'openness' and hates companies and governments putting up barriers to participation, we're an awfully White Anglo-Saxon Programmer group. That night was especially poignant as we had no women in the twenty or so people there - and that's not uncommon either. As a contrast, even SLUG has a higher proportion of women. What are we doing wrong.

It's not that hard to deduce, when you ask the question 'where are all the beginners?' CLUG is, somewhat unashamedly, a very technical (and technical-for-the-sake-of-it) audience. Some newcomers (e.g. a former acquaintance) are driven away by the sheer technical complexity of the talks; others are driven away by the heckling random technical questions launched at the speaker from anywhere in the audience without warning. Others still, I would argue, are driven away by the way little cliques will form and gossip about geeky, technical, 'you have to have been reading the mailing list for three months to understand the joke' stuff given any opportunity - the speaker taking a breath, for instance. All of these and more drive women away - the guy at the back saying "I love women - I appreciate how they look" (or something like that) was just the tip of the iceberg.

I think there's a reasonable area between being patronising and being gruffly neutral in our attempts to encourage women to come along. I think part of the problem is that, just by there being fewer women, us guys feel a bit uncertain. Women don't automatically think that we're talking to them because we're chatting them up or trying to impress them. You can treat someone as an equal without them having to know as much as you in your little special interest fields. While I fear that the next woman who visits a CLUG meeting for the first time is going to be swamped with people trying to make her feel welcome ("have a chair!" "no, have mine!" "this one's warm!"), I think we may obsess too much over the problem to admit that the solution is just to be friendly and supportive - much as we (gender-neutral) all like being treated.

But also, we're going to try organising different meals to the standard Pizza Gannet Feeding Frenzy that CLUGgites call "a good way to wrap up a talk". My proposal is:

That way the gannets still get fed, and tnose of us (me included) who would also like something different (and possibly more nutritious) get to eat too. I don't think it's going to cause a mass influx of women in IT coming along to CLUG meetings, but at least it's a step on the way to making CLUG meetings more appealing to more people.

[1]: Feature one: tabbed panes, so you can keep multiple places in the file system and move between them easily. Feature two: a 'drop box' that you can collect files into and then drop them into your destination. Saves all that confusing control- shift- alt- left-click-with-a-fringe on tip selecting in file list windows in order to grab the files you want.

posted at: 08:36 | path: /tech/clug | permanent link to this entry

Tue, 21 Nov 2006

Jumping into a new project
(Nearly all of this was written after the PSIG meeting on the 9th of November; then I got too busy and didn't finish it off. So "tonight" is two weeks ago as of this posting.)

Tonight at the Programmer's SIG we were 'supposed' to be having a sort of round-table discussion, with people with ideas meeting up with people who know how to implement them. Or, at least, have more knowledge into the way that Linux is organised and may be able to recommend language choices, libraries to look for and people to speak to. If any of those people had actually turned up, this would have happened. But they didn't.

After the usual early round of "Hey have you seen this cool stuff / weird shit" as meals were served (amazingly quickly, this time), I tried to jump start the thing by asking what people's ideas were. Maybe it's just me - this didn't seem to get any real discussion started. Conversation kept revolving around Pascal Klein's idea for rewriting the Linux kernel in C#, and the multivarious reasons why this would be a Bad Thing. As amusing as it is to discuss bad language choices, the things we hate about customers, and what's new on Slashdot, this wasn't really doing it for me as someone who a) has ideas and b) is a programmer.

Despite the good nature of Steve Walsh's teasing, I do worry that I'm talking too much about my own ideas. I say this because we then had a long and quite spirited discussion about how to solve a problem with my backup process. It started with me noting that I'd thought of an improvement to rsync:

At the moment, rsync will only try to synchronise changes to a file if the destination directory has a file with that name. If you've renamed the file, or copied it into a new directory, then rsync (AFAIK) won't recognise that and will copy the entire file again. However, rsync already has a mechanism to recognise which files are the same - it generates a checksum for each file it encounters and only copies the checksums if the file is different. So the idea is for the receiver to check if it already has a file with that checksum somewhere else. There's more to it than this, but I'll develop that in another post.

This all supports my partner's method of backing up her PhD - every once in a while, she takes all the files so far and copies them into a directory named 'Backup date'. Separately to this, I then rsync her entire directory up to my brother's machine in Brisbane, as an off-site backup. While I'm not especially worried about the time it takes or the amount of data transferred, since rsync's principle aim is to reduce both of these I thought it would be a useful improvement to optimise for the case where a file has been renamed on the client - why transmit the whole file again if you can just copy and delete on the server?

I suppose the thing I enjoyed was the idea of co-operatively solving a problem using the tools at everyone's disposal. Several people suggested that Revision Control Systems would be better in this scenario, because they would only store the diffs and would give instant reversion to any point in time. Other people suggested automated folders that would pick up the files in a 'drop' directory, put them in an appropriately labelled directory, and then start a remote copy of the appropriate folder on the remote server. Other people suggested that having two backups was overkill - that as long as I had the remote server updated I could retrieve backup copies should anything go wrong locally. All of these were good suggestions, and despite the problem that they didn't really solve the problem the way I wanted it to be solved, I did really appreciate the new ideas and approaches.

That led me to my next question, which was: rsync is a largish and complicated piece of software. The philosophy of Open Source says that if you have an idea, you should modify the source rather than ask someone else to do it; and I can program in C so the source of rsync wouldn't be foreign to me. So where do I start? One approach suggested was to generate a tags file and start tracing through the execution of the main routine; another was to find the printed text messages that are generated at the time that I want my revision to be used, and start reading from there. A further approach was to draw a concept map - sketch out the top-down design of rsync in order to narrow down the code I had to read. All excellent suggestions, and when I have some spare time I shall try them.

Then we had some real nuts-and-bolts stuff; I showed Hugh how to do Doxygen documentation, and Daniel showed me a bit about autoconf/automake and how to integrate them into my coding. He also suggested a technique of checking for the existence of a library at runtime (e.g. libmagic) in order to determine whether we should call the libmagic routines to check file type; unfortunately I can't now remember what this magical call was. I should have been writing this nine days ago.

It started out not looking so good, but I think it was one of the better Programming SIGs I've been to.

P.S. I've also learnt tonight that, if my WiFi is connecting and then almost immediately disconnecting after showing now signal strength, unloading and reloading the kernel module (after stopping the ipw3945d service) will reset it; starting the ipw3945d service again will get things back on track. Or so it would seem from this initial test.

posted at: 08:25 | path: /tech/clug | permanent link to this entry

Wed, 11 Oct 2006

Come up and see my library.
I'm now quite experienced in using Doxygen to add documentation to code. I've written pretty much all of the documentation for the CFile and progress libraries, and pwlib is pretty small so it'll be easier to do. Now: how do I get everyone to have a look at it?

Well, step 1: put it on my own web site. That's relatively easy: those sons of fun at Ace Hosting have subversion installed; just check the source code out in a directory underneath my public HTML directory. I also added a snipped of .htaccess file to prevent people looking in the .svn directories thus created, more to stop undue confusion and anxiety than any real security threat. Step 2 was a little more work: building the Doxygen HTML documentation. I'll have to get Ace Hosting to install Doxygen, but in the meantime I compiled it and ran it from my own bin directory. LATEX is installed, so it can automatically generate the formulas as .png files and link them in automatically. Hooray!

The third step is to make this available as more than just a directory index. This probably means either kludging some Doxygen documentation together to contain it, or writing a file that can scan through the directories inside my /code directory and create a nice piece of HTML from a template that lists all the source files and links to the Doxygen documentation directly. I'll probably just write a templater; who needs another reason to reimplement the wheel? Although - maybe there's some clever way that Doxygen does this already - I should look into its recursive options...

Anyway, for now, you can see what's done so far in the /code directory. So far only the CFile and progress libraries are documented, but have a browse anyway. Genuflections, derisive laughter, offers of money, peanuts, etc. can be sent here. Of course, if you want to check out the Subversion code and play with it that way, you can retrieve it from svn://tangram.dnsalias.net/pwlib-c/.

posted at: 17:39 | path: /tech/c | permanent link to this entry

Uncompressed bzip2 file sizes the easy way
Cacheing. It's all about cacheing. If you've got some result that took some time to calculate, save it again for later use.

In this case, I'm referring to the uncompressed file size of a bzip2 file. A gzip file is easy - it's the last four bytes as an integer (signed? unsigned? I don't know). But Julian Seward, in his wisdom, didn't put such information in the bzip2 format. So the only way to determine the uncompressed size of a bzip2 file is to physically decompress it. In CFile I do this by opening a pipe to 'bzcat filename | wc -l', which is probably inefficient or insecure in some way, but I reckon does the job better than me rewriting a buffered reader in CFile for the purpose. It means that if you're reading a bzip2 file, you have this potentially long delay if you want to know the file size beforehand. (If you don't care how long the file is before reading it, then you won't incur this time overhead).

So: how do we cache this value for later? In an filesystem extended user attribute, that's how! There's just one minor problem: if you rewrite an existing file, then fopen() and its logical descendents will truncate and rewrite the file, rather than deleting the inode and then creating a new file. Which means that the extended attribute stays where it was, and now refers to the uncompressed size of the wrong file. To solve this, we write a timestamp in another attribute which determines the time that the size was calculated - if the file's modification time is later than this, then the file size attribute is out of date.

(Of course, if the file system doesn't support user extended attributes, then we bork out nicely and calculate the file size from scratch again.)

Of course, you need user extended attributes on your file system. I thought this would be already available in most modern Linux kernels, but no! You have to explicitly turn it on by putting user_xattr in the options section of the file system's mount line in /etc/fstab first. Fortunately, you can do mount -n -o remount / to remount your root file system with the new options - as is so often the case with Linux, you don't have to reboot to set low-level operating system parameters! Yay us! Ext2, Ext3, XFS and ReiserFS all support extended user attributes, too. Once you've done this, you can test it by writing a parameter with something like attr -s user.foo -V bar file to set an attribute and attr -l file to list the file's attributes. You have to use 'user.' as a prefix to specify you want the user attribute namespace.

So, now to write the actual code!

posted at: 11:56 | path: /tech/c | permanent link to this entry

Mon, 09 Oct 2006

talloc and require
BTW, I found a gotcha in recent versions of the talloc library. In order to work on all those platforms that don't have all the standard and not-so-standard utility functions that talloc and Samba are built against, Samba and talloc now have a 'replace' library that implements them all. So in order to build talloc, you now have to check out the replace library - you do:

svn co svn://svnanon.samba.org/samba/branches/SAMBA_4_0/source/lib/replace

Then in the new replace directory do ./autogen.sh, ./configure, and make. You can do a make install to install the replace library, but talloc still seems to want direct access to the replace.o and snprintf.o files in the replace library, so you have to then go back to talloc, ./autogen.sh and ./configure, then link the replace object files:

ln -s ../replace/replace.o
ln -s ../replace/snprintf.o

Then and only then can you do the make in the talloc directory to make the new talloc library. Oh, and if your code sets a destructor function, then you have to change the type of argument it takes, as it's now correctly typed (rather than taking a void *).

posted at: 18:51 | path: /tech/c | permanent link to this entry

CFile now does bzip2
Over the last couple of days, I've actually been getting to do a fair bit of actual coding. It started with me adding the basics of support for bzip2 compression into my CFile library. Then I decided to redo my subversion repository for the libraries I've written so far (cfile, progress and pwlib) into separate directories, rather than the more standard but somewhat restrictive trunk|branches|tags structure that the Subversion book recommends.

Today I've added the rest of the bzip2 support, namely being able to read lines from them. It involved me copying an implementation of fgets that I found in stdio.c and implementing my own fgetc for bzip2 using array buffers. The hard part is detecting EOF, because it seems that the BZ2_bzerror routine doesn't actually return BZ_STREAM_END when the stream is at an end, it just returns BZ_OK. But BZ2_bzread will return 0 bytes if you're at the end of file, so I detect that and return EOF accordingly.

This also gave me the impetus to correctly detect EOF in the rest of the code, something that I hadn't implemented correctly. I'm still not sure I'm obeying whatever ANSI or POSIX guidelines there are on this subject, but the test-cat program I've written reports no differences between the original uncompressed file and being fed through CFile, so I'm assuming I'm doing something right.

My experience, given that I can be reading 1.5GB uncompressed sequence files, is that compressing the inputs and outputs saves not only space, but time (to read and write the files). I noticed the other day that The Gimp also allows you to read and write files with .bz2 or .gz extensions as if they were the uncompressed images. Hopefully the CFile library will give that functionality to people who only want a dependency on Talloc, rather than on half of the Gimp, an external compression program, and enough spare filesystem space for the uncompressed file...

posted at: 17:05 | path: /tech/c | permanent link to this entry

Wed, 04 Oct 2006

Working with other people's code
I've long been a convert to the talloc library for memory allocation, after Tridge's talk mentioned it at LCA 2005. Granted, it's not a sinecure for memory problems - you can still have memory leaking out your program's ears, but at least it's not completely unfreeable. And because it's typed and it does all the things you're supposed to do with memory, like zero the memory immediately after it's allocated and nullify the pointer immediately after it's freed, it does help you avoid the worst excesses of pointer mishandling.

For my own sake, I've written a little library that allows you to treat a gzip-compressed file (with the .gz extension) as being functionally identical to a normal file - you use the same library of functions to read or write to it (although not at the same time) regardless of whether it's compressed or not. When you open the file, the file type is determined automatically and the correct lower-level functions are used transparently. It also provides a useful 'getline' routine, which reallocates your string buffer (using talloc_realloc) on the fly to make enough space for whatever line you're reading.

Now, I've just embarked on a project to read and write information in a simple dictionary section/key/value format - much like a .ini file. So I've started looking at a parser library rather than reinventing this particular wheel (I know, I know, shock horror, has Paul actually learned to do some research?). But it doesn't use talloc and it definitely hasn't heard of my CFile library.

It seems likely to me that someone's going to suggest an object oriented language like C++ here, so that I can extend the class. But as far as I can see this doesn't actually solve the problem. I don't want to add functionality to it (yet), I want to replace functionality that it relies on with other functions of my own choosing. Which means that Nicolas would have had to have written his library with a file and memory abstraction class, so that I could then extend those classes and have iniparser use that new functionality with no recompilation or reprogramming. What if he'd thought to abstract the file functions, but not the memory functions (after all, who expects to not use malloc in C?) I'd still be looking at rewriting part of his code. And, since I'm writing in C, it's all a bit of wishful thinking anyway.

So is there any way to do this? Do I have to modify Nicolas's code? I can search through the Samba codebase, because I'm sure they've implemented a .ini file reader in there, but I want to write the file too and maybe they haven't done that. And they won't be using my CFile library.

So is there a solution? Do we have to go to a new, revolutionary programming language, where somehow you not only override these and perhaps other even more basic functions, but do so knowing what other functions you're overriding and what guarantees you're providing to the code 'above'? Does such a thing exist?

Because you can bet Linus Torvalds' hair that there's no library so good, so all-encompassingly correct, that everyone will use it. talloc, for all its plusses, is not ideal in an environment where every byte may count; or at least there will be some people who will fight to the death to avoid using it. My CFile library is perhaps no-one's idea of good (or someone else would have done it way earlier, and I haven't seen one yet), but I reckon it solves a problem that I think is worth solving. It would be good to have ways of using these libraries without having to rework other people's code - it gives one the temptation to just write everything myself and save myself the trouble.

posted at: 12:29 | path: /tech/c | permanent link to this entry

Fri, 01 Sep 2006

A new approach to the problem
I've revisited a program of mine that I've written for work. The basic idea is to read a set of DNA sequences and produce different files which list all the different 'subsequences' (jargon: oligonucleotides) of different lengths, and how many we saw of each. I presented this idea at a Programmers SIG a while back and the reply came back: write them out and use sort(1) and uniq(1) -c. In my vanity I hadn't thought of this, and was instead making up great complex data structures and slab memory allocators and all sorts of junk. Then I was diverted to work on other projects; when I came back all this looked like so much gibberish.

So I wrote a simple test and showed that, for an average set of virus sequences, I get about files ranging between 24MB (six-letter words, called 'six-mers' in molecular biologist's jargon) to 77MB (21-mers). Sure enough, sort and uniq produce what I want in a not inefficient manner. But I'd like to run this as a single standalone executable, much as that's against the Unix piping ethos. For efficiency reasons I generate all the files simultaneously, and the thought of forking off fifteen separate sort | uniq -c pipes make me shudder. There must be a better way, I think.

The first obvious improvement is to keep the lists in memory and use some kind of in-memory sort function. The files contain about three and a half million words apiece, so it would be possible using qsort(3) to fill a 21MB array with six-mers (since you wouldn't have to store the null at the end). There's a setting in talloc to allow grabbing chunks of memory greater than 64MB in size, so doing even the 21-mers (77MB in size) would be possible using this method.

The problem, though, is that the way I generate the sequences is to generate all the different ranges simultaneously - doing it in one pass through the sequences. Storing all of them in arrays simultaneously would require 812MB (roughly), and this seems to not be much better than my previous dodgy tree implementation.

Then I realised: all the six-mers are just the prefixes of all the seven-mers, plus all the six-mers that didn't fit in seven-mers. This process applies right up the scale. So I could generate an array which contained the longest strings (null-terminated) available at each location (up to the maximum length required), and sort that. If I did that with fixed-length 32-byte strings (more than enough for all the things we've been doing so far) you'd use 112MB or so. That now contains an ordered list of all the strings of lengths between the minimum and maximum we're using. In order to extract all the N-mers, ignore all strings of length less than N, and take the first N characters of the remaining strings. They're still in order, so counting their frequency is a simple matter. You could even do this in parallel, in one pass across the array (which would increase cache read performance).

Consider, for a moment, though, if you can't allocate huge arrays like that. So you have to break the single array into smaller, more manageable arrays, sort each in turn, and do a merge-condense to recombine each block. Which lends itself to parallelisation: each slave is given a contiguous chunk of the sequence (i.e. one with no non-sequence characters in it - like being given a paragraph), and breaks it up into words, sorts them, and then returns the data to the master (either in single messages, or better yet in chunks) for merging.

But think the sort process through: most common sort routines recursively descend through successively smaller chunks of the file, sorting each and then combining the sorted bits back together into larger sorted chunks. Could this process not also combine elements that were the same? So we might sort records of a string and a count (e.g. 30 byte strings and a two-byte count), initially starting each count at one but as each sort finds duplicated strings they be combined and added together? This would also compact the data storage as we go, which might well mean that it might be good to read each sequence into a separate block, which is then sorted and compacted independently (giving a certain granularity to the process) and then the final merge process happens N-way rather than two-way. If that's too complex, make each merge two-way but just do as many two-way merges as necessary to get all the data into one sorted array, which would now be also compacted and contain the counts anyway.

The best part of this AFAICS it it's also neatly parallelisable. I can even write an application, which, if it finds a PVM machine set up, can distribute itself amongst the hosts neatly, but if it doesn't it can still operate in 'single thread' mode and do all the work itself. Which, in turn, means that as long as I keep in mind that at some stage I could distribute the workload, I won't end up having to substantially rewrite the program (as it was looking with the last method).

So, you see, I do do things other than rant and rave sometimes.

posted at: 16:52 | path: /tech/c | permanent link to this entry


All posts licensed under the CC-BY-NC license. Author Paul Wayper.