"Missed Call" scam - DC Marketing
I got a call from 03 8102 9565 the other evening. When I say "got a call" I
mean "missed a call", since they seem to have tried for about 0.5
seconds to contact me. A brief bit of research turned up a
large
quantity
of
pages
about these numbers, and the antics of DC Marketing, who are based in
the UK with an office here.
Basically, they ring your number but immediately hang up. You don't recognise the number, so out of curiosity you ring back. They then give a rather deceptive spiel which asks you to ring a 1900 number to claim a valuable $40 prize. The $40 prize is actually ringtones, and while one or two might be free, if you accept the prize you're then subscribed to buy ringtones off them. So as well as suckering you for the 1900 call, the 'prize' is dubious at best and deceitful at worst.
Despite DC Marketing's legal bluster, this is almost certainly illegal, but because it's only poor punters losing their money the police aren't going to be getting involved any time soon. This has pissed a few people off enough to track down DC Marketing - one guy has even apparently hung around their post office box with a baseball bat. I can't condone the violence, but to me it's merely a sign that people feel increasingly disempowered by the law processes and by companies hiding behind legal smokescreens while unashamedly ripping people off with dodgy marketing.
Heads up, y'all...
posted at: 23:07 | path: /personal/goat | permanent link to this entry
String List Fixed, Head Screwed On
I think file formats are the bane of programmers everywhere. Certainly
a lot of my troubles with my current program stem from one particularly
bogus format - the CLUSTAL 'aligned' format (extension .ALN).
Normal nucleotide and protein sequences have a structure like this (this is FASTA, for example, although PIR is almost identical and GENBANK, while more verbose, has a similar structure):
>Sequence 1 name AAAAAGCTGTCAGACTGTGCGTGCTTGCTGGA CTGGTGCTGAAGAGACGTTGCGAGCGACGTGT >Sequence 2 name CCCCCCCCCCCCCCCGCGGCGGCGCGGCCGCG GCGCGCCGGCGCTTGTGTGGTGTGTTGGTGGTI.e. a sequence name with a specific prefix, then everything from there until the next name line is part of that sequence. Clustal is a program for displaying 'aligned' sequences - where special 'zero-length' gaps are inserted into sequences to make them 'line up'. The 'lining up' process is a black art unto itself, but we won't worry about that here. Clustal displays the sequences as rows and the bases aligned in columns, and I shudder to think what hideous perversion caused the programmers to think that this was a more efficient structure to retrieve their data in:
CLUSTALW 1.8.2 format header Sequence_1_name AAAAAGCTGTCAGACTGTGCGTGCTTGCTGGA Sequence_2_name CCCCCCCCCCCCCCCGCGGCGGCGCGGCCGCG Sequence_1_name CTGGTGCTGAAGAGACGTTGCGAGCGACGTGT Sequence_2_name GCGCGCCGGCGCTTGTGTGGTGTGTTGGTGGTEven with a memory limit much smaller than the size of the file, I can think of at least two ways to quickly and easily retrieve the data for any 'rectangular area' of sequence from a FASTA or PIR file. You might need to keep a list of byte offsets where each sequence's data starts in the file, but unless you're doing something really evil (i.e. calculating the offset for a particular sequence derived from the line length and the header length alone) that's what you'd have to do anyway. So you've done nothing good at the expense of sequence name readability (did I mention that it's traditional in ALN files to smash the name down to the first word, so you get lots of sequences with names like GI:37592756 and no clue about what they actually are...). Goody.
For me, it means that I have to read the entire file into memory, appending text to each sequence in turn (and I make no assumptions about whether the sequence order changes per 'block', either...). It means that my sequences are stored as a linked list of strings - i.e. the Stringlist structure has one pointer to a char, and one pointer to the next Stringlist. To make appending slightly less stupid, we keep a pointer to the first and the last Stringlist structures in the list, and therefore appending doesn't require us to traverse the list every time. That was a big speed-up from my stupid first implementation right there...
The only problem is that this method has fallen foul of the talloc block overhead. Each string added therefore incurs the talloc overhead for the linked list structure and the string copy - the former being only sixteen bytes in size (on my 64-bit machine). Not only that, but in order to use it as one string we have to allocate a new string with enough space to hold the entire thing, copy each of the list strings in turn, and then free the entire list structure (which is still handled neatly by talloc but does still mean a complete list traversal, effectively).
The solution was obvious when I looked at it in hindsight: each stringlist structure now contains an array of 1024 chars, into which strings are agglomerated. When we run out of characters in the last block we simply create a new block and put the string into it. So, OK, each string in the list may not be 100% full, but it doesn't really need to be - we're already getting a huge saving in talloc overhead, list traversal and string copying anyway.
The 'stringification' process above now doesn't reorganise the linked list. But that's OK, since we pretty much only use the stringified sequence once anyway, and the caller was getting a talloc_strdup'ed copy of the stringifed sequence anyway. So in a way the code just got simpler... And that can only be good. (And we're probably about to discard the sequence from its storage structure anyway...)
Now to work out the finer points of the Rainer structure. Block
agglomeration is good!
posted at: 13:55 | path: /tech/c | permanent link to this entry
All posts licensed under the CC-BY-NC license. Author Paul Wayper.
You can also read this blog as a syndicated RSS feed.