Intersecting two dictionaries in Python

后端 未结 8 2016
借酒劲吻你
借酒劲吻你 2020-11-27 18:26

I am working on a search program over an inverted index. The index itself is a dictionary whose keys are terms and whose values are themselves dictionaries of short document

相关标签:
8条回答
  • 2020-11-27 18:50
    def two_keys(term_a, term_b, index):
        doc_ids = set(index[term_a].keys()) & set(index[term_b].keys())
        doc_store = index[term_a] # index[term_b] would work also
        return {doc_id: doc_store[doc_id] for doc_id in doc_ids}
    
    def n_keys(terms, index):
        doc_ids = set.intersection(*[set(index[term].keys()) for term in terms])
        doc_store = index[term[0]]
        return {doc_id: doc_store[doc_id] for doc_id in doc_ids}
    
    In [0]: index = {'a': {1: 'a b'}, 
                     'b': {1: 'a b'}}
    
    In [1]: two_keys('a','b', index)
    Out[1]: {1: 'a b'}
    
    In [2]: n_keys(['a','b'], index)
    Out[2]: {1: 'a b'}
    

    I would recommend changing your index from

    index = {term: {doc_id: doc}}
    

    to two indexes one for the terms and then a separate index to hold the values

    term_index = {term: set([doc_id])}
    doc_store = {doc_id: doc}
    

    that way you don't store multiple copies of the same data

    0 讨论(0)
  • 2020-11-27 18:58

    Your question isn't precise enough to give single answer.

    1. Key Intersection

    If you want to intersect IDs from posts (credits to James) do:

    common_ids = p1.keys() & p2.keys()
    

    However if you want to iterate documents you have to consider which post has a priority, I assume it's p1. To iterate documents for common_ids, collections.ChainMap will be most useful:

    from collections import ChainMap
    intersection = {id: document
                    for id, document in ChainMap(p1, p2)
                    if id in common_ids}
    for id, document in intersection:
        ...
    

    Or if you don't want to create separate intersection dictionary:

    from collections import ChainMap
    posts = ChainMap(p1, p2)
    for id in common_ids:
        document = posts[id]
    

    2. Items Intersection

    If you want to intersect items of both posts, which means to match IDs and documents, use code below (credits to DCPY). However this is only useful if you're looking for duplicates in terms.

    duplicates = dict(p1.items() & p2.items())
    for id, document in duplicates:
        ...
    

    3. Iterate over p1 'AND' p2.

    In case when by "'AND' search" and using iter you meant to search both posts then again collections.ChainMap is the best to iterate over (almost) all items in multiple posts:

    from collections import ChainMap
    for id, document in ChainMap(p1, p2):
        ...
    
    0 讨论(0)
提交回复
热议问题