March 09, 2005
Picking random items from an iterator

The shuffle the lines of a large file thread on c.l.py reminded me of Richard Papworth's superb Cookbook recipe, Retrieving a line at random from a file of unknown size. The wonderful thing about this recipe is that the file only needs to be read once, and it doesn't need to be kept in memory, but every line in the file has an equal chance of being selected.

I came up with a quick hack that allows the selection of multiple lines, but it soon occurred to me that this approach is far too pretty to be restricted to files - you can do the same with any iterator:

def random_items(iterator, items_wanted=1):
    selected_items = [None] * items_wanted

    for item_index, item in enumerate(iterator):
        for selected_item_index in xrange(items_wanted):
            if not random.randint(0, item_index):
                selected_items[selected_item_index] = item

    return selected_items

You can run this over a file; random_items(open(file_name)), but you can also run it over any other kind of iterator.

This is in Python of course, but I don't see any reason why you couldn't implement this in any language that has iterators.

Fuller version with kind-of-unit-tests: random_items.py

BTW, how do you unit test something that gives random results properly?

Update 10th March: Faster initial allocation of selected_items

Posted to Python by Simon Brunning at March 09, 2005 01:00 PM
Comments

The only problem I see with your code is that sometimes iterators are used in order not to generate the whole sequence (xrange vs range). Your code forces the generation of the sequence in memory. I've been looking for a random iterator "decorator" that can preserve that.

Posted by: max on March 9, 2005 05:41 PM

I'm pretty sure that what you'd like isn't possible, Max. In order to select random items lazily, with an equal chance of each item being picked, you'd need to know in advance how many items there are from which to chose.

In the case of a file, that would mean reading the file twice, which is exactly what this fuction avoids. And some iterators can't be processed twice at all.

Posted by: Simon Brunning on March 10, 2005 09:27 AM

I think you are right, although I still can't convince myself :-)
You lost me with having to read the file twice to get its size, you can do so with one single seek. Also I was assuming that the original iterator would, at least, give you the size of the sequence, and have random access, or else it definetely wouldn't work.
If you have random access, then the random "decorator" should be just a permutation on the inner iterator's sequence.

In terms of testing, you can easily test that at the end you covered all the items on the inner sequence only once. To test for randomness I guess you'd have to do some analysis of the permutation, but if you are positive that the random() function is working and you are covering all the elements ony once then you should be alright

Posted by: max on March 10, 2005 04:05 PM

Seek will give you the number of characters, sure, but you need to know the number of *lines* in order to work out how many items you have. If there's a way to find that without reading the whole file, I don't know what it is.

Iterators don't support random access, nor do they give you a way of finding their size. (They may in fact be infinite.) So, as you say, it won't work. ;-)

Posted by: Simon Brunning on March 10, 2005 04:21 PM

you got me there :)

Posted by: max on March 10, 2005 05:09 PM

Note that Richard Papworth's recipe doesn´t care about the number of lines; it works with iterators of unknown size. Only for infinite iterators it would fail, because it still has to read over the entire file.

Posted by: Roberto on March 12, 2005 04:35 PM
Post a comment
Name:


Email Address:


URL:



Comments:


Remember info?