Forming Bigrams of words in list of sentences with Python

前端未结

关注

 10  1414

I have a list of sentences:

text = [\'cant railway station\',\'citadel hotel\',\' police stn\'].

I need to form bigram pairs and store the

相关标签:

10条回答

耶瑟儿～

2020-12-24 02:40

Without nltk:

ans = []
text = ['cant railway station','citadel hotel',' police stn']
for line in text:
    arr = line.split()
    for i in range(len(arr)-1):
        ans.append([[arr[i]], [arr[i+1]]])


print(ans) #prints: [[['cant'], ['railway']], [['railway'], ['station']], [['citadel'], ['hotel']], [['police'], ['stn']]]

0 讨论(0)

别跟我提以往

2020-12-24 02:40

Best way is to use "zip" function to generate the n-gram. Where 2 in range function is number of grams

test = [1,2,3,4,5,6,7,8,9]
print(test[0:])
print(test[1:])
print(list(zip(test[0:],test[1:])))
%timeit list(zip(*[test[i:] for i in range(2)]))

o/p:

[1, 2, 3, 4, 5, 6, 7, 8, 9]  
[2, 3, 4, 5, 6, 7, 8, 9]  
[(1, 2), (2, 3), (3, 4), (4, 5), (5, 6), (6, 7), (7, 8), (8, 9)]  
1000000 loops, best of 3: 1.34 µs per loop

0 讨论(0)

一生所求

2020-12-24 02:40
I think the best and most general way to do it is the following:
```
n      = 2
ngrams = []

for l in L:
    for i in range(n,len(l)+1):
        ngrams.append(l[i-n:i])
```
or in other words:
```
ngrams = [ l[i-n:i] for l in L for i in range(n,len(l)+1) ]
```
This should work for any n and any sequence l. If there are no ngrams of length n it returns the empty list.
0 讨论(0)
发布评论:

提交评论
- 加载中...

耶瑟儿～

2020-12-24 02:42

Read the dataset

df = pd.read_csv('dataset.csv', skiprows = 6, index_col = "No")

Collect all available months

df["Month"] = df["Date(ET)"].apply(lambda x : x.split('/')[0])

Create tokens of all tweets per month

tokens = df.groupby("Month")["Contents"].sum().apply(lambda x : x.split(' '))

Create bigrams per month

bigrams = tokens.apply(lambda x : list(nk.ngrams(x, 2)))

Count bigrams per month

count_bigrams = bigrams.apply(lambda x : list(x.count(item) for item in x))

Wrap up the result in neat dataframes

month1 = pd.DataFrame(data = count_bigrams[0], index= bigrams[0], columns= ["Count"])
month2 = pd.DataFrame(data = count_bigrams[1], index= bigrams[1], columns= ["Count"])

0 讨论(0)

难免孤独

2020-12-24 02:46

Using list comprehensions and zip:

>>> text = ["this is a sentence", "so is this one"]
>>> bigrams = [b for l in text for b in zip(l.split(" ")[:-1], l.split(" ")[1:])]
>>> print(bigrams)
[('this', 'is'), ('is', 'a'), ('a', 'sentence'), ('so', 'is'), ('is', 'this'), ('this',     
'one')]

0 讨论(0)

猫巷女王i

2020-12-24 02:50

There are a number of ways to solve it but I solved in this way:

>>text = ['cant railway station','citadel hotel',' police stn']
>>text2 = [[word for word in line.split()] for line in text]
>>text2
[['cant', 'railway', 'station'], ['citadel', 'hotel'], ['police', 'stn']]
>>output = []
>>for i in range(len(text2)):
    output = output+list(bigrams(text2[i]))
>>#Here you can use list comphrension also
>>output
[('cant', 'railway'), ('railway', 'station'), ('citadel', 'hotel'), ('police', 'stn')]

0 讨论(0)

1 2 下一页