Optimizing Worst Case Time complexity to O(1) for python dicts [closed]

问题

I have to store 500M two digit unicode character in memory (RAM).

The data structure I use should have:

Worst Case Space Complexity: O(n)
Worst Case Time Complexity: O(1) <-- insertion, read, update, deletion

I was thinking of choosing dict which is implementation of hash in python but then problem is it assures time complexity of O(1) for the required operations only in average cases than worst case.

I heard that if number of entries is known, time complexity of O(1) can be achieved in worst case scenarios.

How todo that?

In case, thats not possible in python can I access memory addresses and the data onto it directly in my python code? If yes, then how?

回答1:

Mostly the performance hits (usually taken on a collision) are amortized over all calls. So for most realistic use, you wont get O(n) for every call. In fact, the only case where you would incur an O(n) hit on every call is in the pathological case where every key's hash collides with an existing key's hash value (i.e. the worst possible (or most unfortunate) usage of a hashtable).

If for example, you know your set of keys beforehand, and you know that they will not have hash collisions (i.e. all their hashes are unique), then you will not suffer collision cases. The other major O(n) operation is hashtable resizing, but the frequency of this depends on the implementation (expanse factor/hash function/collision resolution scheme, etc.) and it will also vary run-to-run depending on the input set.

In either case you can avoid sudden runtime slowdowns if you can pre-populate the dict with all keys. the values can just be set to None, and populated with their real values later on. This should cause the only noticeable performance hit when "priming" the dict with keys initially, and future value insertion should be constant time.

A completely different question is how you are intending to read/query the structure? do you need to attach separate values and have access to them via a key? should it be ordered? perhaps a set might be more appropriate than a dict, since you do not really require a key:value mapping.

Update:

Based on your description in the comments, this is starting to sound more like a job for a database to do, even if you are working with a temporary set. You could use an in-memory relational database (e.g. with SQLite). Additionally, you could use an ORM like SQLAlchemy to interact with the database more pythonically and without having to write SQL.

It even sounds like you might be reading the data from a database to start with, so maybe you can leverage that further?

Storing/querying/updating a massive number of typed records that are keyed uniquely is exactly what RDBMS's have been specialised for with decades of development and research. Using an in-memory version of a pre-existing relational database (such as SQLite's) will probably be a more pragmatic and sustainable choice.

Try using python's built in sqlite3 module and try the in-memory version by providing ":memory:" as the db file path on construction:

con = sqlite3.connect(":memory:")

回答2:

The Dictionary technically has a a worst case of O(n) but it's highly unlikely to occur and likely won't in your case. I'd try to use the Dictionary and only switch to a different implementation if that isn't sufficient for what you want to do.

Here is a useful thread on the subject

回答3:

Is there a reason you care about the worst-case performance instead of the average performance? Any reasonable hashtable will give you the average performance of O(N).

If you really want worst-case performance of O(1), here are two possible approaches:

Have a vector of max(charCode)-min(charCode) entries and directly look up the value you want from the unicode character code. This will work well if your keys fall in a compact-enough range that you can fit it in RAM.
Use a brute-force approach to choose hash functions or dictionary sizes (using a custom implementation of a dictionary that lets you control this), and keep trying new functions and/or sizes till you get one with no collisions. Expect this to take a very long time. I do not recommend this.

EDIT:

Suppose that you known that the minimum character code you'll see is 1234 and the maximum you'll see is 98765. Further suppose that you have enough RAM to hold 98765-1234 elements. I'll also assume you're willing to use the numpy library or some other efficient array implementation. In that case, you can store the values in the vector like this:

# configuration info
max_value = 98765 # replace with your number
min_value = 1234  # replace with your number
spread = (max_value - min_value)
dtype = object # replace with a primitive type if you want to store something simpler

# create the big vector
my_data = numpy.empty((spread,), dtype=dtype)

# insert elements
my_char_code              = ...
my_value_for_my_char_code = ...

assert min_value <= my_char_code < max_value
my_data[my_char_code - min_value] = my_value_for_my_char_code

# extract elements
my_char_code              = ...
assert min_value <= my_char_code < max_value
my_value_for_my_char_code = my_data[my_char_code - min_value]

This is O(1) because the lookup is implemented using pointer arithmetic and there's no dependence on the number of elements stored in the array.

This approach can be extremely wasteful of RAM if the number of elements you actually want to store is much smaller than spread. For example, if spread is 4 billion (all of UTF32) then my_data alone will consume at least 4 billion * 8 bytes / pointer = 32 GB of RAM (and probably a lot more; I don't know how big Python references are). On the other hand, if min_value is 3 billion and max_value = min_value + 100, then the memory usage will be tiny.

来源：https://stackoverflow.com/questions/15191918/optimizing-worst-case-time-complexity-to-o1-for-python-dicts

标签

python

memory

dictionary

hashtable

complexity-theory