Too Busy For Words - the PaulWay Blog

Thu 23rd Mar, 2006

Shuffling a list - the power of factorials.

I needed to shuffle a list of DNA sequences for a work program. Having been suitably chastised by Damian Conway's excellent entry in maddog's ticket picking mini-competition, I decided that in future I'd do more searching on CPAN before re-inventing the wheel for the sixteenth time.

I feel so stupid. I downloaded Algorithm::Numeric::Shuffle, written by abigail, as I think it was the first entry that turned up in a search on CPAN. I should have looked by category under List::Util, being a utility function on lists. But anyway. abigail pointed out in the POD that the Perl rand() routine is a linear congruential generator - in other words, from a given random seed the numbers will always follow a predictable pattern. The algorithm works by going through the entire list and swapping the Nth member with another random member. Thus, the actual final ordering is dependent only on the initial RNG seed, which is a 32-bit number.

(The List::Util::shuffle routine suffers from this similarly, despite using a somewhat more bizarre method of giving a unique random ordering to the list.)

Now, 13! is 6227020800, and 232 is 4294967296, so there are more combinations for a 13-element list than a 32-bit number. Which means, if you think about it, that there are some combinations that you will never get from the shuffle routine. It's just simple information theory, like compressing data: there's no random seed that will cause that ordering of elements to come out.

Here's where once again I forgot to learn the basic rule of my programming existence: What You Want Has Probably Already Been Invented. I wrote an email to abigail suggesting that, if the user of the shuffle routine wanted a truly random shuffle, they could pass in a sub reference to their own rand() function, or one from e.g. MCrypt or OpenSSL::Rand, that does at least have more bits of random seed (or, for preference, draws on a 'true' source of randomness.) I even started writing up a revision of abigail's code to do this.

Then I asked on the #perl channel on irc.freenode.net whether the inbuilt rand() function could be passed as a sub reference (it can't apparently). Someone asked why I wanted to do such a thing. I told them the story. I spent some time convincing them that the statistics said it was possible, and that someone might actually care about being statistically correct. Then someone else said "why don't you just override the rand() function for that module?".

Why not indeed! Perl doesn't make package namespaces sacred, so it's perfectly possible to put a sub rand(!) definition in my code after a package List::Util declaration which calls the rand() function I want. Of course, I've yet to work out how to do that, and the random picks of the sequences are good enough for testing so far, so it's a bit of a moot point. But I hate it when I should have known better all along.

I'm somewhat glad that my email to abigail bounced...

Last updated: | path: tech / perl | permanent link to this entry


All posts licensed under the CC-BY-NC license. Author Paul Wayper.


Main index / tbfw/ - © 2004-2023 Paul Wayper
Valid HTML5 Valid CSS!