efficiency of long (str) keys in python dictionary

前端 未结 3 839
眼角桃花
眼角桃花 2020-12-31 18:24

I\'m parsing some xml (with some python 3.4 code) and want to retrieve both the text from a node and its id attribute. Example:

  • Some text here
  • 相关标签:
    3条回答
    • 2020-12-31 18:46

      No, Python string length hardly has an impact on dictionary performance. The only influence the string length could have is on the hash() function used map the key to a hash table slot.

      String length has very little impact on the performance of hash():

      >>> import random
      >>> from timeit import timeit
      >>> from string import ascii_letters
      >>> generate_text = lambda len: ''.join([random.choice(ascii_letters) for _ in xrange(len)])
      >>> for i in range(8):
      ...     length = 10 + 10 ** i
      ...     testword = generate_text(length)
      ...     timing = timeit('hash(t)', 'from __main__ import testword as t')
      ...     print 'Length: {}, timing: {}'.format(length, timing)
      ... 
      Length: 11, timing: 0.061537027359
      Length: 20, timing: 0.0796310901642
      Length: 110, timing: 0.0631730556488
      Length: 1010, timing: 0.0606122016907
      Length: 10010, timing: 0.0613977909088
      Length: 100010, timing: 0.0607581138611
      Length: 1000010, timing: 0.0672461986542
      Length: 10000010, timing: 0.080118894577
      

      I stopped at generating a string of 10 million characters, because I couldn't be bothered waiting for my laptop to generate a 100 million character string.

      The timings are pretty much constant, because the value is actually cached on the string object once computed.

      0 讨论(0)
    • 2020-12-31 18:49

      The performance of hash() is indeed O(n) for strings, but the result is cached in the string - repeated calls will use the cached value. This is possible because strings are immutable. Martijn's code uses the repeating feature of timeit so you cannot see this effect because in the last case, 10000009 times out of 10000010 the hash code is not calculated.

      It still is O(n) underneath:

      import random
      from timeit import timeit
      
      for i in range(10):
          length = 10 ** i
          # notice number=1 !!!
          timing = timeit('hash(t)', 't = "a" * {}'.format(length), number=1)
          print('Length: {:10d}, timing: {:.20f}'.format(length, timing))
      
      Length:          1, timing: 0.00000437500057159923
      Length:         10, timing: 0.00000287900184048340
      Length:        100, timing: 0.00000342299972544424
      Length:       1000, timing: 0.00000459299917565659
      Length:      10000, timing: 0.00002153400055249222
      Length:     100000, timing: 0.00006719700104440562
      Length:    1000000, timing: 0.00066680999952950515
      Length:   10000000, timing: 0.00673243699930026196
      Length:  100000000, timing: 0.04393487600100343116
      Length: 1000000000, timing: 0.39340837700001429766
      

      The difference is due to timing errors, branch prediction and alike.

      0 讨论(0)
    • 2020-12-31 18:50

      For the most parts, @Martijn Pieters 's answer is correct, that is, theoretically. However, practically, you want to consider a lot of things when it comes to performance.

      I recently ran into this problem of hashing long strings as keys, and I got a time out error in the practice I am doing, just because of python's dictionary key hashing. I knew this because I solved the question using JavaScript object as a "dictionary" and it worked just fine, which means no time out error.

      Then since my keys are in fact long string of lists of numbers, I made them tuples of numbers instead (immutable object can be a key). That works perfectly as well.

      That being said, I tested the timing with hashing function @Martijn Pieters wrote in the example with a long string of list of numbers as keys against a tuples version. The tuples version takes way longer on repl.it, their python compiler. I am not talking about 0.1 difference. It's a difference between 0.02 and 12.02.

      Odd isn't it ?! :>

      Now the point is, every environment varies. The volume of your operations accumulates. Thus you CANNOT simply say if certain operation is gonna take longer or shorter. Even it's a 0.01 sec operation, doing it only 1000 times, makes the user waits 10 secs.

      For any production environment, you really want to try to optimize your algorithm, if needed, and use better design, always. For normal software, it saves your users' valuable time. For cloud services, it will be dollar bills we are talking about.

      At last, I definitely DO NOT recommend use long strings as keys just because the inconsistent results I got in different environments. You definitely want to use the ids as keys and iterate through string values to find the ids if you need. But if you have to use the long string as keys, consider limit the number of operations on the dictionary. Keeping two versions is definitely a waste of space/RAM. The topic of performance and memory is another lesson.

      0 讨论(0)
    提交回复
    热议问题