How to find set of most frequently occurring word-pairs in a file using python?

前端 未结 2 1184

I have a data set as follows:

\"485\",\"AlterNet\",\"Statistics\",\"Estimation\",\"Narnia\",\"Two and half men\"
\"717\",\"I like Sheen\", \"Narnia\", \"Stat         


        
2条回答
  •  清酒与你
    2021-01-03 03:21

    There is not that much you can do, except counting all pairs.

    Obvious optimizations are to early remove duplicate words and synonyms, perform stemming (anything that reduces the number of distinct tokens is good!), and to only count pairs (a,b) where a (in your example, only either count statistics,narnia, or narnia,statistics, but not both!).

    If you run out of memory, perform two passes. In the first pass, use one or multiple hash functions to obtain a candidate filter. In the second pass, only count words that pass this filter (MinHash / LSH style filtering).

    It's a naive parallel problem, therefore this is also easy to distribute to multiple threads or computers.

提交回复
热议问题