I have a data set as follows:
\"485\",\"AlterNet\",\"Statistics\",\"Estimation\",\"Narnia\",\"Two and half men\"
\"717\",\"I like Sheen\", \"Narnia\", \"Stat
There is not that much you can do, except counting all pairs.
Obvious optimizations are to early remove duplicate words and synonyms, perform stemming (anything that reduces the number of distinct tokens is good!), and to only count pairs (a,b)
where a (in your example, only either count
statistics,narnia
, or narnia,statistics
, but not both!).
If you run out of memory, perform two passes. In the first pass, use one or multiple hash functions to obtain a candidate filter. In the second pass, only count words that pass this filter (MinHash / LSH style filtering).
It's a naive parallel problem, therefore this is also easy to distribute to multiple threads or computers.