|Title:||Secure and interchangeable hash algorithm|
|Last-Modified:||2014-03-16 14:43:06 +1000 (Sun, 16 Mar 2014)|
|Author:||Christian Heimes <christian at python.org>|
|Post-History:||06-Oct-2013, 14-Nov-2013, 20-Nov-2013|
- Requirements for a hash function
- Current implementation with modified FNV
- Examined hashing algorithms
- Small string optimization
- C API additions
- Python API addition
- Necessary modifications to C code
- Backwards Compatibility
- Alternative counter measures against hash collision DoS
This PEP proposes SipHash as default string and bytes hash algorithm to properly fix hash randomization once and for all. It also proposes modifications to Python's C code in order to unify the hash code and to make it easily interchangeable.
Despite the last attempt [issue13703] CPython is still vulnerable to hash collision DoS attacks [29c3] [issue14621] . The current hash algorithm and its randomization is not resilient against attacks. Only a proper cryptographic hash function prevents the extraction of secret randomization keys. Although no practical attack against a Python-based service has been seen yet, the weakness has to be fixed. Jean-Philippe Aumasson and Daniel J. Bernstein have already shown how the seed for the current implementation can be recovered [poc] .
Furthermore the current hash algorithm is hard-coded and implemented multiple times for bytes and three different Unicode representations UCS1, UCS2 and UCS4. This makes it impossible for embedders to replace it with a different implementation without patching and recompiling large parts of the interpreter. Embedders may want to choose a more suitable hash function.
Finally the current implementation code does not perform well. In the common case it only processes one or two bytes per cycle. On a modern 64-bit processor the code can easily be adjusted to deal with eight bytes at once.
This PEP proposes three major changes to the hash code for strings and bytes:
- SipHash [sip] is introduced as default hash algorithm. It is fast and small despite its cryptographic properties. Due to the fact that it was designed by well known security and crypto experts, it is safe to assume that its secure for the near future.
- The existing FNV code is kept for platforms without a 64-bit data type. The algorithm is optimized to process larger chunks per cycle.
- Calculation of the hash of strings and bytes is moved into a single API function instead of multiple specialized implementations in Objects/object.c and Objects/unicodeobject.c . The function takes a void pointer plus length and returns the hash for it.
- The algorithm can be selected at compile time. FNV is guaranteed to exist on all platforms. SipHash is available on the majority of modern systems.
- It MUST be able to hash arbitrarily large blocks of memory from 1 byte up to the maximum ssize_t value.
- It MUST produce at least 32 bits on 32-bit platforms and at least 64 bits on 64-bit platforms. (Note: Larger outputs can be compressed with e.g. v ^ (v >> 32) .)
- It MUST support hashing of unaligned memory in order to support hash(memoryview).
- It is highly RECOMMENDED that the length of the input influences the outcome, so that hash(b'\00') != hash(b'\x00\x00') .
The internal interface code between the hash function and the tp_hash slots implements special cases for zero length input and a return value of -1 . An input of length 0 is mapped to hash value 0 . The output -1 is