Expert Python Programming(Third Edition)
上QQ阅读APP看书,第一时间看更新

Implementation details

CPython uses hash tables with pseudo-random probing as an underlying data structure for dictionaries. It seems like a very deep implementation detail, but it is very unlikely to change in the near future, so it is also a very interesting fact for the Python programmer.

Due to this implementation detail, only objects that are hashable can be used as a dictionary key. An object is hashable if it has a hash value that never changes during its lifetime, and can be compared to different objects. Every Python built-in type that is immutable is also hashable. Mutable types, such as list, dictionaries, and sets, are not hashable, and so they cannot be used as dictionary keys. Protocol that defines if a type is hashable consists of two methods:

  • __hash__: This provides the hash value (as an integer) that is needed by the internal dict implementation. For objects that are instances of user-defined classes, it is derived from their id().
  • __eq__: This compares if two objects have the same value. All objects that are instances of user-defined classes compare as unequal by default, except for themselves.

Two objects that are compared as equal must have the same hash value. The reverse does not need to be true. This means that collisions of hashes are possible – two objects with the same hash may not be equal. It is allowed, and every Python implementation must be able to resolve hash collisions CPython uses open addressing to resolve such collisions. The probability of collisions greatly affects dictionary performance, and, if it is high, the dictionary will not benefit from its internal optimizations.

While three basic operations, adding, getting, and deleting an item, have an average time complexity equal to O(1), their amortized worst case complexities are a lot higher. It is O(n), where n is the current dictionary size. Additionally, if user-defined class objects are used as dictionary keys and they are hashed improperly (with a high risk of collisions), this will have a huge negative impact on the dictionary's performance. The full table of CPython's time complexities for dictionaries is as follows:

 

It is also important to know that the n number in worst case complexities for copying and iterating the dictionary is the maximum size that the dictionary ever achieved, rather than the size at the time of operation. In other words, iterating over the dictionary that once was huge but greatly shrunk in time may take a surprisingly long time. In some cases, it may be better to create a new dictionary object from a dictionary that needs to be shrunk if it has to be iterated often instead of just removing elements from it.