Search of Dictionary Keys python

前端 未结 6 1392
被撕碎了的回忆
被撕碎了的回忆 2021-02-20 11:46

I want to know how I could perform some kind of index on keys from a python dictionary. The dictionary holds approx. 400,000 items, so I am trying to avoid a linear search.

相关标签:
6条回答
  • 2021-02-20 11:47

    Perhaps using has_key solve this too.

    http://docs.python.org/release/2.5.2/lib/typesmapping.html

    0 讨论(0)
  • 2021-02-20 11:54

    You could join all the keys into one long string with a suitable separator character and use the find method of the string. That is pretty fast.

    Perhaps this code is helpful to you. The search method returns a list of dictionary values whose keys contain the substring key.

    class DictLookupBySubstr(object):
        def __init__(self, dictionary, separator='\n'):
            self.dic = dictionary
            self.sep = separator
            self.txt = separator.join(dictionary.keys())+separator
    
        def search(self, key):
            res = []
            i = self.txt.find(key)
            while i >= 0:
                left = self.txt.rfind(self.sep, 0, i) + 1
                right = self.txt.find(self.sep, i)
                dic_key = self.txt[left:right]
                res.append(self.dic[dic_key])
                i = self.txt.find(key, right+1)
            return res
    
    0 讨论(0)
  • 2021-02-20 11:59

    If you only need to find keys that start with a prefix then you can use a binary search. Something like this will do the job:

    import bisect
    words = sorted("""
    a b c stack stacey stackoverflow stacked star stare x y z
    """.split())
    n = len(words)
    print n, "words"
    print words
    print
    tests = sorted("""
    r s ss st sta stack star stare stop su t
    """.split())
    for test in tests:
        i = bisect.bisect_left(words, test)
        if words[i] < test: i += 1
        print test, i
        while i < n and words[i].startswith(test):
            print i, words[i]
            i += 1
    

    Output:

    12 words
    ['a', 'b', 'c', 'stacey', 'stack', 'stacked', 'stackoverflow', 'star', 'stare',
    'x', 'y', 'z']
    
    r 3
    s 3
    3 stacey
    4 stack
    5 stacked
    6 stackoverflow
    7 star
    8 stare
    ss 3
    st 3
    3 stacey
    4 stack
    5 stacked
    6 stackoverflow
    7 star
    8 stare
    sta 3
    3 stacey
    4 stack
    5 stacked
    6 stackoverflow
    7 star
    8 stare
    stack 4
    4 stack
    5 stacked
    6 stackoverflow
    star 7
    7 star
    8 stare
    stare 8
    8 stare
    stop 9
    su 9
    t 9
    
    0 讨论(0)
  • 2021-02-20 11:59

    No. The only way of searching for a string in dictionary keys is to look in each key. Something like what you've suggested is the only way of doing it with a dictionary.

    However, if you have 400,000 records and you want to speed up your search, I'd suggest using an SQLite database. Then you can just say SELECT * FROM TABLE_NAME WHERE COLUMN_NAME LIKE '%userinput%';. Look at the documentation for Python's sqlite3 module here.

    Another option is to use a generator expression, as these are almost always faster than the equivalent for loops.

    filteredKeys = (key for key in myDict.keys() if userInput in key)
    for key in filteredKeys:
        doSomething()
    

    EDIT: If, as you say, you don't care about one-time costs, use a database. SQLite should do what you want damn near perfectly.

    I did some benchmarks, and to my surprise, the naive algorithm is actually twice as fast as a version using list comprehensions and six times as fast as a SQLite-driven version. In light of these results, I'd have to go with @Mark Byers and recommend a Trie. I've posted the benchmark below, in case someone wants to give it a go.

    import random, string, os
    import time
    import sqlite3
    
    def buildDict(numElements):
        aDict = {}
        for i in xrange(numElements-10):
            aDict[''.join(random.sample(string.letters, 6))] = 0
    
        for i in xrange(10):
            aDict['log'+''.join(random.sample(string.letters, 3))] = 0
    
        return aDict
    
    def naiveLCSearch(aDict, searchString):
        filteredKeys = [key for key in aDict.keys() if searchString in key]
        return filteredKeys
    
    def naiveSearch(aDict, searchString):
        filteredKeys = []
        for key in aDict:
            if searchString in key: 
                filteredKeys.append(key)
        return filteredKeys
    
    def insertIntoDB(aDict):
        conn = sqlite3.connect('/tmp/dictdb')
        c = conn.cursor()
        c.execute('DROP TABLE IF EXISTS BLAH')
        c.execute('CREATE TABLE BLAH (KEY TEXT PRIMARY KEY, VALUE TEXT)')
        for key in aDict:
            c.execute('INSERT INTO BLAH VALUES(?,?)',(key, aDict[key]))
        return conn
    
    def dbSearch(conn):
        cursor = conn.cursor()
        cursor.execute("SELECT KEY FROM BLAH WHERE KEY GLOB '*log*'")
        return [record[0] for record in cursor]
    
    if __name__ == '__main__':
        aDict = buildDict(400000)
        conn = insertIntoDB(aDict)
        startTimeNaive = time.time()
        for i in xrange(3):
            naiveResults = naiveSearch(aDict, 'log')
        endTimeNaive = time.time()
        print 'Time taken for 3 iterations of naive search was', (endTimeNaive-startTimeNaive), 'and the average time per run was', (endTimeNaive-startTimeNaive)/3.0
    
        startTimeNaiveLC = time.time()
        for i in xrange(3):
            naiveLCResults = naiveLCSearch(aDict, 'log')
        endTimeNaiveLC = time.time()
        print 'Time taken for 3 iterations of naive search with list comprehensions was', (endTimeNaiveLC-startTimeNaiveLC), 'and the average time per run was', (endTimeNaiveLC-startTimeNaiveLC)/3.0
    
        startTimeDB = time.time()
        for i in xrange(3):
            dbResults = dbSearch(conn)
        endTimeDB = time.time()
        print 'Time taken for 3 iterations of DB search was', (endTimeDB-startTimeDB), 'and the average time per run was', (endTimeDB-startTimeDB)/3.0
    
    
        os.remove('/tmp/dictdb')
    

    For the record, my results were:

    Time taken for 3 iterations of naive search was 0.264658927917 and the average time per run was 0.0882196426392
    Time taken for 3 iterations of naive search with list comprehensions was 0.403481960297 and the average time per run was 0.134493986766
    Time taken for 3 iterations of DB search was 1.19464492798 and the average time per run was 0.398214975993
    

    All times are in seconds.

    0 讨论(0)
  • 2021-02-20 12:02

    dpath can solve this for you easily.

    http://github.com/akesterson/dpath-python

    $ easy_install dpath
    >>> for (path, value) in dpath.util.search(MY_DICT, "glob/to/start/{}".format(userinput), yielded=True):
    >>> ...    # (do something with the path and value)
    

    You can pass an eglob ('path//to//something/[0-9a-z]') for advanced searching.

    0 讨论(0)
  • 2021-02-20 12:10

    If you only need to find keys that start with a prefix then you can use a trie. More complex data structures exist for finding keys that contain a substring anywhere within them, but they take up a lot more space to store so it's a space-time trade-off.

    0 讨论(0)
提交回复
热议问题