I have a data set as follows:
\"485\",\"AlterNet\",\"Statistics\",\"Estimation\",\"Narnia\",\"Two and half men\"
\"717\",\"I like Sheen\", \"Narnia\", \"Stat
You might start with something like this, depending on how large your corpus is:
>>> from itertools import combinations
>>> from collections import Counter
>>> def collect_pairs(lines):
pair_counter = Counter()
for line in lines:
unique_tokens = sorted(set(line)) # exclude duplicates in same line and sort to ensure one word is always before other
combos = combinations(unique_tokens, 2)
pair_counter += Counter(combos)
return pair_counter
The result:
>>> t2 = [['485', 'AlterNet', 'Statistics', 'Estimation', 'Narnia', 'Two and half men'], ['717', 'I like Sheen', 'Narnia', 'Statistics', 'Estimation'], ['633', 'MachineLearning', 'AI', 'I like Cars, but I also like bikes'], ['717', 'I like Sheen', 'MachineLearning', 'regression', 'AI'], ['136', 'MachineLearning', 'AI', 'TopGear']]
>>> pairs = collect_pairs(t2)
>>> pairs.most_common(3)
[(('MachineLearning', 'AI'), 3), (('717', 'I like Sheen'), 2), (('Statistics', 'Estimation'), 2)]
Do you want numbers included in these combinations or not? Since you didn't specifically mention excluding them, I have included them here.
EDIT: Working with a file object
The function that you posted as your first attempt above is very close to working. The only thing you need to do is change each line (which is a string) into a tuple or list. Assuming your data looks exactly like the data you posted above (with quotation marks around each term and commas separating the terms), I would suggest a simple fix: you can use ast.literal_eval
. (Otherwise, you might need to use a regular expression of some kind.) See below for a modified version with ast.literal_eval
:
from itertools import combinations
from collections import Counter
import ast
def collect_pairs(file_name):
pair_counter = Counter()
for line in open(file_name): # these lines are each simply one long string; you need a list or tuple
unique_tokens = sorted(set(ast.literal_eval(line))) # eval will convert each line into a tuple before converting the tuple to a set
combos = combinations(unique_tokens, 2)
pair_counter += Counter(combos)
return pair_counter # return the actual Counter object
Now you can test it like this:
file_name = 'myfileComb.txt'
p = collect_pairs(file_name)
print p.most_common(10) # for example
There is not that much you can do, except counting all pairs.
Obvious optimizations are to early remove duplicate words and synonyms, perform stemming (anything that reduces the number of distinct tokens is good!), and to only count pairs (a,b)
where a<b
(in your example, only either count statistics,narnia
, or narnia,statistics
, but not both!).
If you run out of memory, perform two passes. In the first pass, use one or multiple hash functions to obtain a candidate filter. In the second pass, only count words that pass this filter (MinHash / LSH style filtering).
It's a naive parallel problem, therefore this is also easy to distribute to multiple threads or computers.