How to find set of most frequently occurring word-pairs in a file using python?

前端 未结 2 1185

I have a data set as follows:

\"485\",\"AlterNet\",\"Statistics\",\"Estimation\",\"Narnia\",\"Two and half men\"
\"717\",\"I like Sheen\", \"Narnia\", \"Stat         


        
相关标签:
2条回答
  • 2021-01-03 02:54

    You might start with something like this, depending on how large your corpus is:

    >>> from itertools import combinations
    >>> from collections import Counter
    
    >>> def collect_pairs(lines):
        pair_counter = Counter()
        for line in lines:
            unique_tokens = sorted(set(line))  # exclude duplicates in same line and sort to ensure one word is always before other
            combos = combinations(unique_tokens, 2)
            pair_counter += Counter(combos)
        return pair_counter
    

    The result:

    >>> t2 = [['485', 'AlterNet', 'Statistics', 'Estimation', 'Narnia', 'Two and half men'], ['717', 'I like Sheen', 'Narnia', 'Statistics', 'Estimation'], ['633', 'MachineLearning', 'AI', 'I like Cars, but I also like bikes'], ['717', 'I like Sheen', 'MachineLearning', 'regression', 'AI'], ['136', 'MachineLearning', 'AI', 'TopGear']]
    >>> pairs = collect_pairs(t2)
    >>> pairs.most_common(3)
    [(('MachineLearning', 'AI'), 3), (('717', 'I like Sheen'), 2), (('Statistics', 'Estimation'), 2)]
    

    Do you want numbers included in these combinations or not? Since you didn't specifically mention excluding them, I have included them here.

    EDIT: Working with a file object

    The function that you posted as your first attempt above is very close to working. The only thing you need to do is change each line (which is a string) into a tuple or list. Assuming your data looks exactly like the data you posted above (with quotation marks around each term and commas separating the terms), I would suggest a simple fix: you can use ast.literal_eval. (Otherwise, you might need to use a regular expression of some kind.) See below for a modified version with ast.literal_eval:

    from itertools import combinations
    from collections import Counter
    import ast
    
    def collect_pairs(file_name):
        pair_counter = Counter()
        for line in open(file_name):  # these lines are each simply one long string; you need a list or tuple
            unique_tokens = sorted(set(ast.literal_eval(line)))  # eval will convert each line into a tuple before converting the tuple to a set
            combos = combinations(unique_tokens, 2)
            pair_counter += Counter(combos)
        return pair_counter  # return the actual Counter object
    

    Now you can test it like this:

    file_name = 'myfileComb.txt'
    p = collect_pairs(file_name)
    print p.most_common(10)  # for example
    
    0 讨论(0)
  • 2021-01-03 03:21

    There is not that much you can do, except counting all pairs.

    Obvious optimizations are to early remove duplicate words and synonyms, perform stemming (anything that reduces the number of distinct tokens is good!), and to only count pairs (a,b) where a<b (in your example, only either count statistics,narnia, or narnia,statistics, but not both!).

    If you run out of memory, perform two passes. In the first pass, use one or multiple hash functions to obtain a candidate filter. In the second pass, only count words that pass this filter (MinHash / LSH style filtering).

    It's a naive parallel problem, therefore this is also easy to distribute to multiple threads or computers.

    0 讨论(0)
提交回复
热议问题