Selecting random elements from iterator (or list) using python with reservoir sampling algorithm


Reservoir sampling is a technique to select N elements from a collection of unknown size. You can use this algorithm to select elements from a list also, however this algorithm is recommended for an iterator of a unknown size. Remeber that is not trivial to select random elements from a list, or to be as much as close to random.

Use this algorithm only for big collections!

The code

import random

def select_k_random_elements_from(collection_iterator, k):
    result = {}
    n = 0

    for item in collection_iterator:
        n += 1
        if len(result) < k:
            result[n - 1] = item
            selected_index = int(random.random() * n)
            if selected_index < k:
                result[selected_index] = item
    return result.values()

The algorithm starts with the first k elements as a random selection candidate, this makes sound if you think that for a collection of size n=k the k random elemnts will be the whole collection. When the first k elements are selected, the algorithm starts to generate random indexes and if the condition selected_index < k was met it uses the current item in the result. As you can see the bigger the collection the better the algorithm will work. Everything inside the collection is O(1), we should take some care on the line where the result is being accessed with the operator [] but this is O(1). In the last line we get all the values from the dict, in worst case this could be O(K). The complexity of this algorithm is O(N + K), where N is the size of the collection and K is the number of random elements to select. Remeber that K < N, so the final complexity is O(N). We made this implementation using a list as a result, but other data structures could be used to obtains O(N) complexity. This implementation will run ok when K is not too big.