问题
I have a 400 million lines of unique key-value info that I would like to be available for quick look ups in a script. I am wondering what would be a slick way of doing this. I did consider the following but not sure if there is a way to disk map the dictionary and without using a lot of memory except during dictionary creation.
- pickled dictionary object : not sure if this is an optimum solution for my problem
- NoSQL type dbases : ideally want something which has minimum dependency on third party stuff plus the key-value are simply numbers. If you feel this is still the best option, I would like to hear that too. May be it will convince me.
Please let me know if anything is not clear.
Thanks! -Abhi
回答1:
If you want to persist a large dictionary, you are basically looking at a database.
Python comes with built in support for sqlite3, which gives you an easy database solution backed by a file on disk.
回答2:
In principle the shelve module does exactly what you want. It provides a persistent dictionary backed by a database file. Keys must be strings, but shelve will take care of pickling/unpickling values. The type of db file can vary, but it can be a Berkeley DB hash, which is an excellent light weight key-value database.
Your data size sounds huge so you must do some testing, but shelve/BDB is probably up to it.
Note: The bsddb module has been deprecated. Possibly shelve will not support BDB hashes in future.
回答3:
No one has mentioned dbm. It is opened like a file, behaves like a dictionary and is in the standard distribution.
From the docs http://docs.python.org/release/3.0.1/library/dbm.html
import dbm
# Open database, creating it if necessary.
db = dbm.open('cache', 'c')
# Record some values
db[b'hello'] = b'there'
db['www.python.org'] = 'Python Website'
db['www.cnn.com'] = 'Cable News Network'
# Note that the keys are considered bytes now.
assert db[b'www.python.org'] == b'Python Website'
# Notice how the value is now in bytes.
assert db['www.cnn.com'] == b'Cable News Network'
# Loop through contents. Other dictionary methods
# such as .keys(), .values() also work.
for k, v in db.iteritems():
print(k, '\t', v)
# Storing a non-string key or value will raise an exception (most
# likely a TypeError).
db['www.yahoo.com'] = 4
# Close when done.
db.close()
I would try this before any of the more exotic forms, and using shelve/pickle will pull everything into memory on loading.
Cheers
Tim
回答4:
Without a doubt (in my opinion), if you want this to persist, then Redis is a great option.
- Install redis-server
- Start redis server
- Install redis python pacakge (pip install redis)
- Profit.
import redis
ds = redis.Redis(host="localhost", port=6379)
with open("your_text_file.txt") as fh:
for line in fh:
line = line.strip()
k, _, v = line.partition("=")
ds.set(k, v)
Above assumes a files of values like:
key1=value1
key2=value2
etc=etc
Modify insertion script to your needs.
import redis
ds = redis.Redis(host="localhost", port=6379)
# Do your code that needs to do look ups of keys:
for mykey in special_key_list:
val = ds.get(mykey)
Why I like Redis.
- Configurable persistance options
- Blazingly fast
- Offers more than just key / value pairs (other data types)
- @antrirez
回答5:
I don't think you should try the pickled dict. I'm pretty sure that Python will slurp the whole thing in every time, which means your program will wait for I/O longer than perhaps necessary.
This is the sort of problem for which databases were invented. You are thinking "NoSQL" but an SQL database would work also. You should be able to use SQLite for this; I've never made an SQLite database that large, but according to this discussion of SQLite limits, 400 million entries should be okay.
What are the performance characteristics of sqlite with very large database files?
回答6:
I personally use LMDB and its python binding for a few million records DB. It is extremely fast even for a database larger than the RAM. It's embedded in the process so no server is needed. Dependency are managed using pip.
The only downside is you have to specify the maximum size of the DB. LMDB is going to mmap a file of this size. If too small, inserting new data will raise a error. To large, you create sparse file.
来源:https://stackoverflow.com/questions/11837229/large-python-dictionary-with-persistence-storage-for-quick-look-ups