问题
I've been playing with Python's hash function. For small integers, it appears hash(n) == n
always. However this does not extend to large numbers:
>>> hash(2**100) == 2**100
False
I'm not surprised, I understand hash takes a finite range of values. What is that range?
I tried using binary search to find the smallest number hash(n) != n
>>> import codejamhelpers # pip install codejamhelpers
>>> help(codejamhelpers.binary_search)
Help on function binary_search in module codejamhelpers.binary_search:
binary_search(f, t)
Given an increasing function :math:`f`, find the greatest non-negative integer :math:`n` such that :math:`f(n) \le t`. If :math:`f(n) > t` for all :math:`n \ge 0`, return None.
>>> f = lambda n: int(hash(n) != n)
>>> n = codejamhelpers.binary_search(f, 0)
>>> hash(n)
2305843009213693950
>>> hash(n+1)
0
What's special about 2305843009213693951? I note it's less than sys.maxsize == 9223372036854775807
Edit: I'm using Python 3. I ran the same binary search on Python 2 and got a different result 2147483648, which I note is sys.maxint+1
I also played with [hash(random.random()) for i in range(10**6)]
to estimate the range of hash function. The max is consistently below n above. Comparing the min, it seems Python 3's hash is always positively valued, whereas Python 2's hash can take negative values.
回答1:
Based on python documentation in pyhash.c file:
For numeric types, the hash of a number x is based on the reduction of x modulo the prime
P = 2**_PyHASH_BITS - 1
. It's designed so thathash(x) == hash(y)
whenever x and y are numerically equal, even if x and y have different types.
So for a 64/32 bit machine, the reduction would be 2 _PyHASH_BITS - 1, but what is _PyHASH_BITS
?
You can find it in pyhash.h header file which for a 64 bit machine has been defined as 61 (you can read more explanation in pyconfig.h
file).
#if SIZEOF_VOID_P >= 8
# define _PyHASH_BITS 61
#else
# define _PyHASH_BITS 31
#endif
So first off all it's based on your platform for example in my 64bit Linux platform the reduction is 261-1, which is 2305843009213693951
:
>>> 2**61 - 1
2305843009213693951
Also You can use math.frexp
in order to get the mantissa and exponent of sys.maxint
which for a 64 bit machine shows that max int is 263:
>>> import math
>>> math.frexp(sys.maxint)
(0.5, 64)
And you can see the difference by a simple test:
>>> hash(2**62) == 2**62
True
>>> hash(2**63) == 2**63
False
Read the complete documentation about python hashing algorithm https://github.com/python/cpython/blob/master/Python/pyhash.c#L34
As mentioned in comment you can use sys.hash_info
(in python 3.X) which will give you a struct sequence of parameters used for computing
hashes.
>>> sys.hash_info
sys.hash_info(width=64, modulus=2305843009213693951, inf=314159, nan=0, imag=1000003, algorithm='siphash24', hash_bits=64, seed_bits=128, cutoff=0)
>>>
Alongside the modulus that I've described in preceding lines, you can also get the inf
value as following:
>>> hash(float('inf'))
314159
>>> sys.hash_info.inf
314159
回答2:
2305843009213693951
is 2^61 - 1
. It's the largest Mersenne prime that fits into 64 bits.
If you have to make a hash just by taking the value mod some number, then a large Mersenne prime is a good choice -- it's easy to compute and ensures an even distribution of possibilities. (Although I personally would never make a hash this way)
It's especially convenient to compute the modulus for floating point numbers. They have an exponential component that multiplies the whole number by 2^x
. Since 2^61 = 1 mod 2^61-1
, you only need to consider the (exponent) mod 61
.
See: https://en.wikipedia.org/wiki/Mersenne_prime
回答3:
Hash function returns plain int that means that returned value is greater than -sys.maxint
and lower than sys.maxint
, which means if you pass sys.maxint + x
to it result would be -sys.maxint + (x - 2)
.
hash(sys.maxint + 1) == sys.maxint + 1 # False
hash(sys.maxint + 1) == - sys.maxint -1 # True
hash(sys.maxint + sys.maxint) == -sys.maxint + sys.maxint - 2 # True
Meanwhile 2**200
is a n
times greater than sys.maxint
- my guess is that hash would go over range -sys.maxint..+sys.maxint
n times until it stops on plain integer in that range, like in code snippets above..
So generally, for any n <= sys.maxint:
hash(sys.maxint*n) == -sys.maxint*(n%2) + 2*(n%2)*sys.maxint - n/2 - (n + 1)%2 ## True
Note: this is true for python 2.
回答4:
The implementation for the int type in cpython can be found here.
It just returns the value, except for -1
, than it returns -2
:
static long
int_hash(PyIntObject *v)
{
/* XXX If this is changed, you also need to change the way
Python's long, float and complex types are hashed. */
long x = v -> ob_ival;
if (x == -1)
x = -2;
return x;
}
来源:https://stackoverflow.com/questions/37612524/when-is-hashn-n-in-python