March 09, 2005
Picking random items from an iterator

The shuffle the lines of a large file thread on reminded me of Richard Papworth's superb Cookbook recipe, Retrieving a line at random from a file of unknown size. The wonderful thing about this recipe is that the file only needs to be read once, and it doesn't need to be kept in memory, but every line in the file has an equal chance of being selected.

I came up with a quick hack that allows the selection of multiple lines, but it soon occurred to me that this approach is far too pretty to be restricted to files - you can do the same with any iterator:

def random_items(iterator, items_wanted=1):
    selected_items = [None] * items_wanted

    for item_index, item in enumerate(iterator):
        for selected_item_index in xrange(items_wanted):
            if not random.randint(0, item_index):
                selected_items[selected_item_index] = item

    return selected_items

You can run this over a file; random_items(open(file_name)), but you can also run it over any other kind of iterator.

This is in Python of course, but I don't see any reason why you couldn't implement this in any language that has iterators.

Fuller version with kind-of-unit-tests:

BTW, how do you unit test something that gives random results properly?

Update 10th March: Faster initial allocation of selected_items

Posted to Python by Simon Brunning at March 09, 2005 01:00 PM
Post a comment

Email Address:



Remember info?