问题
An update to my previous post, with some changes:
Say that I have 100 tweets.
In those tweets, I need to extract: 1) food names, and 2) beverage names. I also need to attach type (drink or food) and an id-number (each item has a unique id) for each extraction.
I already have a lexicon with names, type and id-number:
lexicon = {
'dr pepper': {'type': 'drink', 'id': 'd_123'},
'coca cola': {'type': 'drink', 'id': 'd_234'},
'cola': {'type': 'drink', 'id': 'd_345'},
'banana': {'type': 'food', 'id': 'f_456'},
'banana split': {'type': 'food', 'id': 'f_567'},
'cream': {'type': 'food', 'id': 'f_678'},
'ice cream': {'type': 'food', 'id': 'f_789'}}
Tweet example:
After various processing of "tweet_1" I have this sentences:
sentences = [
'dr pepper is better than coca cola and suits banana split with ice cream',
'coca cola and banana is not a good combo']
My requested output (can be other type than list):
["tweet_id_1",
[[["dr pepper"], ["drink", "d_124"]],
[["coca cola"], ["drink", "d_234"]],
[["banana split"], ["food", "f_567"]],
[["ice cream"], ["food", "f_789"]]],
"tweet_id_1",,
[[["coca cola"], ["drink", "d_234"]],
[["banana"], ["food", "f_456"]]]]
It's important that the output should NOT extract unigrams within ngrams (n>1):
["tweet_id_1",
[[["dr pepper"], ["drink", "d_124"]],
[["coca cola"], ["drink", "d_234"]],
[["cola"], ["drink", "d_345"]],
[["banana split"], ["food", "f_567"]],
[["banana"], ["food", "f_456"]],
[["ice cream"], ["food", "f_789"]],
[["cream"], ["food", "f_678"]]],
"tweet_id_1",
[[["coca cola"], ["drink", "d_234"]],
[["cola"], ["drink", "d_345"]],
[["banana"], ["food", "f_456"]]]]
Ideally, I would like to be able to run my sentences in various nltk filters like lemmatize() and pos_tag() BEFORE the extraction to get an output like the following. But with this regexp solution, if I do that, then all the words are split into unigrams, or they will generate 1 unigram and 1 bigram from the string "coca cola", which would generate the output that I did not want to have (as the example above). The ideal output (again the type of the output is not important):
["tweet_id_1",
[[[("dr pepper", "NN")], ["drink", "d_124"]],
[[("coca cola", "NN")], ["drink", "d_234"]],
[[("banana split", "NN")], ["food", "f_567"]],
[[("ice cream", "NN")], ["food", "f_789"]]],
"tweet_id_1",
[[[("coca cola", "NN")], ["drink", "d_234"]],
[[("banana", "NN")], ["food", "f_456"]]]]
回答1:
May not be the most efficient solution, but this will definitely get you started -
sentences = [
'dr pepper is better than coca cola and suits banana split with ice cream',
'coca cola and banana is not a good combo']
lexicon = {
'dr pepper': {'type': 'drink', 'id': 'd_123'},
'coca cola': {'type': 'drink', 'id': 'd_234'},
'cola': {'type': 'drink', 'id': 'd_345'},
'banana': {'type': 'food', 'id': 'f_456'},
'banana split': {'type': 'food', 'id': 'f_567'},
'cream': {'type': 'food', 'id': 'f_678'},
'ice cream': {'type': 'food', 'id': 'f_789'}}
lexicon_list = list(lexicon.keys())
lexicon_list.sort(key = lambda s: len(s.split()), reverse=True)
chunks = []
for sentence in sentences:
for lex in lexicon_list:
if lex in sentence:
chunks.append({lex: list(lexicon[lex].values()) })
sentence = sentence.replace(lex, '')
print(chunks)
Output
[{'dr pepper': ['drink', 'd_123']}, {'coca cola': ['drink', 'd_234']}, {'banana split': ['food', 'f_567']}, {'ice cream': ['food', 'f_789']}, {'coca cola': ['drink', 'd_234']}, {'banana': ['food', 'f_456']}]
Explanation
lexicon_list = list(lexicon.keys())
takes the list of phrases that need to be searched and sorts them by length (so that bigger chunks are found first)
The output is a list of dict
, where each dict has list
values.
回答2:
Unfortunately I cannot make comments due to my low reputation, but the answer of Vivek could be improved through 1) regex, 2) including pos_tag tokens as NN, 3) dictionary structure in which you could select tweets result by a tweet:
import re
import nltk
from collections import OrderedDict
tweets = {"tweet_1": ['dr pepper is better than coca cola and suits banana split with ice cream', 'coca cola and banana is not a good combo']}
lexicon = {
'dr pepper': {'type': 'drink', 'id': 'd_123'},
'coca cola': {'type': 'drink', 'id': 'd_234'},
'cola': {'type': 'drink', 'id': 'd_345'},
'banana': {'type': 'food', 'id': 'f_456'},
'banana split': {'type': 'food', 'id': 'f_567'},
'cream': {'type': 'food', 'id': 'f_678'},
'ice cream': {'type': 'food', 'id': 'f_789'}}
lexicon_list = list(lexicon.keys())
lexicon_list.sort(key = lambda s: len(s.split()), reverse=True)
#regex will be much more faster than "in" operator
pattern = "(" + "|".join(lexicon_list) + ")"
pattern = re.compile(pattern)
# Here we make the dictionary of our phrases and their tagged equivalents
lexicon_pos_tag = {word:nltk.pos_tag(nltk.word_tokenize(word)) for word in lexicon_list}
# if you will train model that it recognizes e.g. "banana split" as ("banana split", "NN")
# not as ("banana", "NN") and ("split", "NN") you could use the following
# lexicon_pos_tag = {word:nltk.pos_tag(word) for word in lexicon_list}
#chunks will register the tweets as the keywords
chunks = OrderedDict()
for tweet in tweets:
chunks[tweet] = []
for sentence in tweets[tweet]:
temp = OrderedDict()
for word in pattern.findall(sentence):
temp[word] = [lexicon_pos_tag[word], [lexicon[word]["type"], lexicon[word]["id"]]]
chunks[tweet].append((temp))
Finally Output is:
OrderedDict([('tweet_1',
[OrderedDict([('dr pepper',
[[('dr', 'NN'), ('pepper', 'NN')],
['drink', 'd_123']]),
('coca cola',
[[('coca', 'NN'), ('cola', 'NN')],
['drink', 'd_234']]),
('banana split',
[[('banana', 'NN'), ('split', 'NN')],
['food', 'f_567']]),
('ice cream',
[[('ice', 'NN'), ('cream', 'NN')],
['food', 'f_789']])]),
OrderedDict([('coca cola',
[[('coca', 'NN'), ('cola', 'NN')],
['drink', 'd_234']]),
('banana',
[[('banana', 'NN')], ['food', 'f_456']])])])])
回答3:
I would a for loop to filter ..
Use if statements to find the string in the keys.. if you wish to include unigrams, delete
len(key.split()) > 1
If you wish to only include unigrams then change it to:
len(key.split()) == 1
filtered_list = ['tweet_id_1']
for k, v in lexicon.items():
for s in sentences:
if k in s and len(k.split()) > 1:
filtered_list.extend((k, v))
print(filtered_list)
来源:https://stackoverflow.com/questions/49091931/n-grams-from-text-in-python