Optimizing Python Dictionary Lookup Speeds by Shortening Key Size?

前端未结

关注

 2  469

I\'m not clear on what goes on behind the scenes of a dictionary lookup. Does key size factor into the speed of lookup for that key?

Current dictionary keys are bet

相关标签:

2条回答

小鲜肉

2021-01-18 08:25

Python dictionaries are implemented as hash-maps in the background. The key length might have some impact on the performance if, for example, the hash-functions complexity depends on the key-length. But in general the performance impacts will be definitely negligable.

So I'd say there is little to no benefit for the added complexity.

0 讨论(0)
发布评论:

提交评论
- 加载中...
一向

2021-01-18 08:26
Dictionaries are hash tables, so looking up a key consists of:
- Hash the key.
- Reduce the hash to the table size.
- Index the table with the result.
- Compare the looked-up key with the input key.
Normally, this is amortized constant time, and you don't care about anything more than that. There are two potential issues, but they don't come up often.

Hashing the key takes linear time in the length of the key. For, e.g., huge strings, this could be a problem. However, if you look at the source code for most of the important types, including [str/unicode](https://hg.python.org/cpython/file/default/Objects/unicodeobject.c, you'll see that they cache the hash the first time. So, unless you're inputting (or randomly creating, or whatever) a bunch of strings to look up once and then throw away, this is unlikely to be an issue in most real-life programs.

On top of that, 20 characters is really pretty short; you can probably do millions of such hashes per second, not hundreds.

From a quick test on my computer, hashing 20 random letters takes 973ns, hashing a 4-digit number takes 94ns, and hashing a value I've already hashed takes 77ns. Yes, that's nanoseconds.

Meanwhile, "Index the table with the result" is a bit of a cheat. What happens if two different keys hash to the same index? Then "compare the looked-up key" will fail, and… what happens next? CPython's implementation uses probing for this. The exact algorithm is explained pretty nicely in the source. But you'll notice that given really pathological data, you could end up doing a linear search for every single element. This is never going to come up—unless someone can attack your program by explicitly crafting pathological data, in which case it will definitely come up.

Switching from 20-character strings to 4-digit numbers wouldn't help here either. If I'm crafting keys to DoS your system via dictionary collisions, I don't care what your actual keys look like, just what they hash to.

More generally, premature optimization is the root of all evil. This is sometimes misquoted to overstate the point; Knuth was arguing that the most important thing to do is find the 3% of the cases where optimization is important, not that optimization is always a waste of time. But either way, the point is: if you don't know in advance where your program is too slow (and if you think you know in advance, you're usually wrong…), profile it, and then find the part where you get the most bang for your buck. Optimizing one arbitrary piece of your code is likely to have no measurable effect at all.
0 讨论(0)
发布评论:

提交评论
- 加载中...