How to calculate the hash of big files with python

Created:

Introduction

When you calculate the hash of a file straight forward it's expected to get an memory error. To avoid this memory error while calculate the hash of a file we will split the file in several parts.

A python function that breaks the file to calculate the hash

We will use the function hash_value_for_file which takes as parameters:

  • A file object.
  • A hash function
  • Block size, which by default is 128kb.
import hashlib


def hash_value_for_file(f, hash_function, block_size=2**20):
    while True:
        # we use the read passing the size of the block to avoid
        # heavy ram usage
        data = f.read(block_size)
        if not data:
            # if we don't have any more data to read, stop.
            break
        # we partially calculate the hash
        hash_function.update(data)
    return hash_function.digest()

How to use the function

To use the function we need to open the file that we want to calculate the hash and we will use the hashlib stantd module to use the sha1 function. In the next lines of code we will use sha1 to calculate the hash of a big file:


with open('file.txt', 'rb') as input_file:
    sha1 = hashlib.sha1()
    hash_value_for_file(input_file, sha1)

In the previus example if you replace sha1 with another hash fuction you will calculate another hash, for example you could try to use md5.