I have a large dictionary constructed like so:
programs[\'New York\'] = \'some values...\'
programs[\'Port Authority of New York\'] = \'some values...\'
pr
You should use the brute force method given by mensi until it proves to be too slow.
Here's something that duplicates the data to give a speedier lookup. It only works if your search is for whole words only - i.e. you'll never need to match on "New Yorks Best Bagels" because "york" and "yorks" are different words.
words = {}
for key in programs.keys():
for w in key.split():
w = w.lower()
if w not in words:
words[w] = set()
words[w].add(key)
def lookup(search_string, words, programs):
result_keys = None
for w in search_string.split():
w = w.lower()
if w not in words:
return []
result_keys = words[w] if result_keys is None else result_keys.intersection(words[w])
return [programs[k] for k in result_keys]
If the words have to be in sequence (i.e. "York New" shouldn't match) you can apply the brute-force method to the short list of result_keys
.
An iteritems
and a generator expression will do this:
d={'New York':'some values',
'Port Authority of New York':'some more values',
'New York City':'lots more values'}
print list(v for k,v in d.iteritems() if 'new york' in k.lower())
Output:
['lots more values', 'some more values', 'some values']
[value for key, value in programs.items() if 'new york' in key.lower()]
This is usually called a relaxed dictionary and it can be implemented efficiently using a suffix tree.
The memory used by this approach is linear over the keys, which is optimal, and the time of search is linear over the substring length you are searching, which is also optimal.
I have found this library in python that implements this.
https://hkn.eecs.berkeley.edu/~dyoo/python/suffix_trees/
You could generate all substrings ahead of time, and map them to their respective keys.
#generates all substrings of s.
def genSubstrings(s):
#yield all substrings that contain the first character of the string
for i in range(1, len(s)+1):
yield s[:i]
#yield all substrings that don't contain the first character
if len(s) > 1:
for j in genSubstrings(s[1:]):
yield j
keys = ["New York", "Port Authority of New York", "New York City"]
substrings = {}
for key in keys:
for substring in genSubstrings(key):
if substring not in substrings:
substrings[substring] = []
substrings[substring].append(key)
Then you can query substrings
to get the keys that contain that substring:
>>>substrings["New York"]
['New York', 'Port Authority of New York', 'New York City']
>>> substrings["of New York"]
['Port Authority of New York']
Pros:
Cons:
substrings
incurs a one-time cost at the beginning of your program, taking time proportional to the number of keys in programs
.substrings
will grow approximately linearly with the number of keys in programs
, increasing the memory usage of your script.genSubstrings
has O(n^2) performance in relation to the size of your key. For example, "Port Authority of New York" generates 351 substrings.