TIL: Python hashing implementation details

I'm doing some research into python internals as they relate to performance because I'm currently doing a job that involves optimizing the runtime of some very performance critical code. So I've been digging around in the CPython source again and I find myself going down some rabbit holes like every other time I've investigated the source, here are two such adventures from today:

Memory layout guarantees

Today I learned that since the implementation of PEP-0393 bytes and Unicode that contains only ASCII text will be guaranteed to have the same memory layout. Because of this the hashing API will keep the hash values the same despite the underlying types being different:

Python 3.3 - 3.5

>>> hash("test") == hash(b"test")

Python 2.6 - 2.7.12

>>> hash("test") == hash(u"test")

This could potentially mean you have hash collisions with distinct dictionary keys. Despite this likely being an uncommon occurrence in production code I found it interesting. (If for example you did have a class that used an underlying bytes or string storage and you wanted distinct hash keys you could always modify __hash__ for your class to make the results distinct) In any case I found it nice to know that the memory layout isn't a coincidence in the standard library, if Unicode strings are stored internally as UTF8 then the 7bit ASCII will be able to be stored in the same byte layout.

Hash info and which hashing function is used

Reading PEP-456 was a great introduction into some changes that have occured with the default hashing implementation since the Python 3.4 release.

As per that PEP they added information to sys.hash_info about which algorithm is being used:

Python 3.5.2 (default, Nov 23 2017, 16:37:01)
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.hash_info
sys.hash_info(width=64, modulus=2305843009213693951, inf=314159, nan=0, imag=1000003, algorithm='siphash24', hash_bits=64, seed_bits=128, cutoff=0)

As seen here the default hash implementation is SipHash24 and the PEP explains why they chose to use this over the Fowler-Noll-Vo hash function that was the old default. (The FNV hash is not a cryptographic hash and has some security vulnerabilities as a result, see this article for details on how this can lead to a DOS attack)

Now the interesting thing about this is that they have fast-pathed another hashing function for smaller inputs because SipHash24 has a bit of a slow start up time.

Looking at the CPython source for 3.4.7 release we see in Objects/object.c

PyObject_HashNotImplemented(PyObject *v)
    PyErr_Format(PyExc_TypeError, "unhashable type: '%.200s'",
    return -1;

PyObject_Hash(PyObject *v)
    PyTypeObject *tp = Py_TYPE(v);
    if (tp->tp_hash != NULL)
        return (*tp->tp_hash)(v);
    /* To keep to the general practice that inheriting
     * solely from object in C code should work without
     * an explicit call to PyType_Ready, we implicitly call
     * PyType_Ready here and then check the tp_hash slot again
    if (tp->tp_dict == NULL) {
        if (PyType_Ready(tp) < 0)
            return -1;
        if (tp->tp_hash != NULL)
            return (*tp->tp_hash)(v);
    /* Otherwise, the object can't be hashed */
    return PyObject_HashNotImplemented(v);

Essentially a call to hash an object will call the appropriate hash function. Which when we go to Include/pyhash.h we start to see how the hash function is chosen.

Firstly on machines that are only 32 bits and not 64 bits there's a significant implementation difference. This is well commented in the code.

/* hash secret
 * memory layout on 64 bit systems
 *   cccccccc cccccccc cccccccc  uc -- unsigned char[24]
 *   pppppppp ssssssss ........  fnv -- two Py_hash_t
 *   k0k0k0k0 k1k1k1k1 ........  siphash -- two PY_UINT64_T
 *   ........ ........ ssssssss  djbx33a -- 16 bytes padding + one Py_hash_t
 *   ........ ........ eeeeeeee  pyexpat XML hash salt
 * memory layout on 32 bit systems
 *   cccccccc cccccccc cccccccc  uc
 *   ppppssss ........ ........  fnv -- two Py_hash_t
 *   k0k0k0k0 k1k1k1k1 ........  siphash -- two PY_UINT64_T (*)
 *   ........ ........ ssss....  djbx33a -- 16 bytes padding + one Py_hash_t
 *   ........ ........ eeee....  pyexpat XML hash salt
 * (*) The siphash member may not be available on 32 bit platforms without
 *     an unsigned int64 data type.
#ifndef Py_LIMITED_API
typedef union {
    /* ensure 24 bytes */
    unsigned char uc[24];
    /* two Py_hash_t for FNV */
    struct {
        Py_hash_t prefix;
        Py_hash_t suffix;
    } fnv;
#ifdef PY_UINT64_T
    /* two uint64 for SipHash24 */
    struct {
        PY_UINT64_T k0;
        PY_UINT64_T k1;
    } siphash;
    /* a different (!) Py_hash_t for small string optimization */
    struct {
        unsigned char padding[16];
        Py_hash_t suffix;
    } djbx33a;
    struct {
        unsigned char padding[16];
        Py_hash_t hashsalt;
    } expat;
} _Py_HashSecret_t;
PyAPI_DATA(_Py_HashSecret_t) _Py_HashSecret;

So as mentioned in the PEP there's all the familiar hashes, with djbx33a used for small strings. So how small does a string have to be to get categorized as small?

We go down a few more lines and find this:

/* cutoff for small string DJBX33A optimization in range [1, cutoff).
 * About 50% of the strings in a typical Python application are smaller than
 * 6 to 7 chars. However DJBX33A is vulnerable to hash collision attacks.
 * NEVER use DJBX33A for long strings!
 * A Py_HASH_CUTOFF of 0 disables small string optimization. 32 bit platforms
 * should use a smaller cutoff because it is easier to create colliding
 * strings. A cutoff of 7 on 64bit platforms and 5 on 32bit platforms should
 * provide a decent safety margin.
#ifndef Py_HASH_CUTOFF
#  define Py_HASH_CUTOFF 0
#elif (Py_HASH_CUTOFF > 7 || Py_HASH_CUTOFF < 0)
#  error Py_HASH_CUTOFF must in range 0...7.
#endif /* Py_HASH_CUTOFF */

This is a boundary I'll have to keep in mind when testing for performance, this switchover could be significant! I also like the disclaimer here that the short string implementation is not suitable for long strings, a potential pitfall that I now will be extra careful to avoid due to the consequences being clear.