I am working on a search program over an inverted index. The index itself is a dictionary whose keys are terms and whose values are themselves dictionaries of short document
def two_keys(term_a, term_b, index):
doc_ids = set(index[term_a].keys()) & set(index[term_b].keys())
doc_store = index[term_a] # index[term_b] would work also
return {doc_id: doc_store[doc_id] for doc_id in doc_ids}
def n_keys(terms, index):
doc_ids = set.intersection(*[set(index[term].keys()) for term in terms])
doc_store = index[term[0]]
return {doc_id: doc_store[doc_id] for doc_id in doc_ids}
In [0]: index = {'a': {1: 'a b'},
'b': {1: 'a b'}}
In [1]: two_keys('a','b', index)
Out[1]: {1: 'a b'}
In [2]: n_keys(['a','b'], index)
Out[2]: {1: 'a b'}
I would recommend changing your index from
index = {term: {doc_id: doc}}
to two indexes one for the terms and then a separate index to hold the values
term_index = {term: set([doc_id])}
doc_store = {doc_id: doc}
that way you don't store multiple copies of the same data
Your question isn't precise enough to give single answer.
If you want to intersect ID
s from posts (credits to James) do:
common_ids = p1.keys() & p2.keys()
However if you want to iterate documents you have to consider which post has a priority, I assume it's p1
. To iterate documents for common_ids
, collections.ChainMap
will be most useful:
from collections import ChainMap
intersection = {id: document
for id, document in ChainMap(p1, p2)
if id in common_ids}
for id, document in intersection:
...
Or if you don't want to create separate intersection
dictionary:
from collections import ChainMap
posts = ChainMap(p1, p2)
for id in common_ids:
document = posts[id]
If you want to intersect items of both posts, which means to match ID
s and documents, use code below (credits to DCPY). However this is only useful if you're looking for duplicates in terms.
duplicates = dict(p1.items() & p2.items())
for id, document in duplicates:
...
p1
'AND' p2
.In case when by "'AND' search" and using iter
you meant to search both posts then again collections.ChainMap
is the best to iterate over (almost) all items in multiple posts:
from collections import ChainMap
for id, document in ChainMap(p1, p2):
...