I have a data set as follows:
"485","AlterNet","Statistics","Estimation","Narnia","Two and half men"
"717","I like Sheen", "Narnia", "Statistics", "Estimation"
"633","MachineLearning","AI","I like Cars, but I also like bikes"
"717","I like Sheen","MachineLearning", "regression", "AI"
"136","MachineLearning","AI","TopGear"
and so on
I want to find out the most frequently occurring word-pairs e.g.
(Statistics,Estimation:2)
(Statistics,Narnia:2)
(Narnia,Statistics)
(MachineLearning,AI:3)
The two words could be in any order and at any distance from each other
Can someone suggest a possible solution in python? This is a very large data set.
Any suggestion is highly appreciated
So this is what I tried after suggestions from @275365
@275365 I tried the following with input read from a file
def collect_pairs(file):
pair_counter = Counter()
for line in open(file):
unique_tokens = sorted(set(line))
combos = combinations(unique_tokens, 2)
pair_counter += Counter(combos)
print pair_counter
file = ('myfileComb.txt')
p=collect_pairs(file)
text file has same number of lines as the original one but has only unique tokens in a particular line. I don't know what am I doing wrong since when I run this it splits the words in letters rather than giving output as combinations of words. When I run this file it outputs split letters rather than combinations of words as expected. I dont know where I am making a mistake.
You might start with something like this, depending on how large your corpus is:
>>> from itertools import combinations
>>> from collections import Counter
>>> def collect_pairs(lines):
pair_counter = Counter()
for line in lines:
unique_tokens = sorted(set(line)) # exclude duplicates in same line and sort to ensure one word is always before other
combos = combinations(unique_tokens, 2)
pair_counter += Counter(combos)
return pair_counter
The result:
>>> t2 = [['485', 'AlterNet', 'Statistics', 'Estimation', 'Narnia', 'Two and half men'], ['717', 'I like Sheen', 'Narnia', 'Statistics', 'Estimation'], ['633', 'MachineLearning', 'AI', 'I like Cars, but I also like bikes'], ['717', 'I like Sheen', 'MachineLearning', 'regression', 'AI'], ['136', 'MachineLearning', 'AI', 'TopGear']]
>>> pairs = collect_pairs(t2)
>>> pairs.most_common(3)
[(('MachineLearning', 'AI'), 3), (('717', 'I like Sheen'), 2), (('Statistics', 'Estimation'), 2)]
Do you want numbers included in these combinations or not? Since you didn't specifically mention excluding them, I have included them here.
EDIT: Working with a file object
The function that you posted as your first attempt above is very close to working. The only thing you need to do is change each line (which is a string) into a tuple or list. Assuming your data looks exactly like the data you posted above (with quotation marks around each term and commas separating the terms), I would suggest a simple fix: you can use ast.literal_eval
. (Otherwise, you might need to use a regular expression of some kind.) See below for a modified version with ast.literal_eval
:
from itertools import combinations
from collections import Counter
import ast
def collect_pairs(file_name):
pair_counter = Counter()
for line in open(file_name): # these lines are each simply one long string; you need a list or tuple
unique_tokens = sorted(set(ast.literal_eval(line))) # eval will convert each line into a tuple before converting the tuple to a set
combos = combinations(unique_tokens, 2)
pair_counter += Counter(combos)
return pair_counter # return the actual Counter object
Now you can test it like this:
file_name = 'myfileComb.txt'
p = collect_pairs(file_name)
print p.most_common(10) # for example
There is not that much you can do, except counting all pairs.
Obvious optimizations are to early remove duplicate words and synonyms, perform stemming (anything that reduces the number of distinct tokens is good!), and to only count pairs (a,b)
where a<b
(in your example, only either count statistics,narnia
, or narnia,statistics
, but not both!).
If you run out of memory, perform two passes. In the first pass, use one or multiple hash functions to obtain a candidate filter. In the second pass, only count words that pass this filter (MinHash / LSH style filtering).
It's a naive parallel problem, therefore this is also easy to distribute to multiple threads or computers.
来源:https://stackoverflow.com/questions/21297740/how-to-find-set-of-most-frequently-occurring-word-pairs-in-a-file-using-python