The shuffle the lines of a large file thread on c.l.py reminded me of Richard Papworth's superb Cookbook recipe, Retrieving a line at random from a file of unknown size. The wonderful thing about this recipe is that the file only needs to be read once, and it doesn't need to be kept in memory, but every line in the file has an equal chance of being selected.
I came up with a quick hack that allows the selection of multiple lines, but it soon occurred to me that this approach is far too pretty to be restricted to files - you can do the same with any iterator:
def random_items(iterator, items_wanted=1):
selected_items = [None] * items_wanted
for item_index, item in enumerate(iterator):
for selected_item_index in xrange(items_wanted):
if not random.randint(0, item_index):
selected_items[selected_item_index] = item
return selected_items
You can run this over a file; random_items(open(file_name)), but you can also run it over any other kind of iterator.
This is in Python of course, but I don't see any reason why you couldn't implement this in any language that has iterators.
Fuller version with kind-of-unit-tests: random_items.py
BTW, how do you unit test something that gives random results properly?
Update 10th March: Faster initial allocation of selected_items
The only problem I see with your code is that sometimes iterators are used in order not to generate the whole sequence (xrange vs range). Your code forces the generation of the sequence in memory. I've been looking for a random iterator "decorator" that can preserve that.
Posted by: max on March 9, 2005 05:41 PMI'm pretty sure that what you'd like isn't possible, Max. In order to select random items lazily, with an equal chance of each item being picked, you'd need to know in advance how many items there are from which to chose.
In the case of a file, that would mean reading the file twice, which is exactly what this fuction avoids. And some iterators can't be processed twice at all.
Posted by: Simon Brunning on March 10, 2005 09:27 AMI think you are right, although I still can't convince myself :-)
You lost me with having to read the file twice to get its size, you can do so with one single seek. Also I was assuming that the original iterator would, at least, give you the size of the sequence, and have random access, or else it definetely wouldn't work.
If you have random access, then the random "decorator" should be just a permutation on the inner iterator's sequence.
In terms of testing, you can easily test that at the end you covered all the items on the inner sequence only once. To test for randomness I guess you'd have to do some analysis of the permutation, but if you are positive that the random() function is working and you are covering all the elements ony once then you should be alright
Posted by: max on March 10, 2005 04:05 PMSeek will give you the number of characters, sure, but you need to know the number of *lines* in order to work out how many items you have. If there's a way to find that without reading the whole file, I don't know what it is.
Iterators don't support random access, nor do they give you a way of finding their size. (They may in fact be infinite.) So, as you say, it won't work. ;-)
Posted by: Simon Brunning on March 10, 2005 04:21 PMyou got me there :)
Posted by: max on March 10, 2005 05:09 PMNote that Richard Papworth's recipe doesn´t care about the number of lines; it works with iterators of unknown size. Only for infinite iterators it would fail, because it still has to read over the entire file.
Posted by: Roberto on March 12, 2005 04:35 PM