The shuffle the lines of a large file thread on c.l.py reminded me of Richard Papworth's superb Cookbook recipe, Retrieving a line at random from a file of unknown size. The wonderful thing about this recipe is that the file only needs to be read once, and it doesn't need to be kept in memory, but every line in the file has an equal chance of being selected.
I came up with a quick hack that allows the selection of multiple lines, but it soon occurred to me that this approach is far too pretty to be restricted to files - you can do the same with any iterator:
def random_items(iterator, items_wanted=1):
selected_items = [None] * items_wanted
for item_index, item in enumerate(iterator):
for selected_item_index in xrange(items_wanted):
if not random.randint(0, item_index):
selected_items[selected_item_index] = item
return selected_items
You can run this over a file; random_items(open(file_name))
, but you can also run it over any other kind of iterator.
This is in Python of course, but I don't see any reason why you couldn't implement this in any language that has iterators.
Fuller version with kind-of-unit-tests: random_items.py
BTW, how do you unit test something that gives random results properly?
Update 10th March: Faster initial allocation of selected_items