Forming Bigrams of words in list of sentences with Python

前端 未结 10 1414
遇见更好的自我
遇见更好的自我 2020-12-24 02:16

I have a list of sentences:

text = [\'cant railway station\',\'citadel hotel\',\' police stn\']. 

I need to form bigram pairs and store the

相关标签:
10条回答
  • 2020-12-24 02:40

    Without nltk:

    ans = []
    text = ['cant railway station','citadel hotel',' police stn']
    for line in text:
        arr = line.split()
        for i in range(len(arr)-1):
            ans.append([[arr[i]], [arr[i+1]]])
    
    
    print(ans) #prints: [[['cant'], ['railway']], [['railway'], ['station']], [['citadel'], ['hotel']], [['police'], ['stn']]]
    
    0 讨论(0)
  • 2020-12-24 02:40

    Best way is to use "zip" function to generate the n-gram. Where 2 in range function is number of grams

    test = [1,2,3,4,5,6,7,8,9]
    print(test[0:])
    print(test[1:])
    print(list(zip(test[0:],test[1:])))
    %timeit list(zip(*[test[i:] for i in range(2)]))
    

    o/p:

    [1, 2, 3, 4, 5, 6, 7, 8, 9]  
    [2, 3, 4, 5, 6, 7, 8, 9]  
    [(1, 2), (2, 3), (3, 4), (4, 5), (5, 6), (6, 7), (7, 8), (8, 9)]  
    1000000 loops, best of 3: 1.34 µs per loop  
    
    0 讨论(0)
  • 2020-12-24 02:40

    I think the best and most general way to do it is the following:

    n      = 2
    ngrams = []
    
    for l in L:
        for i in range(n,len(l)+1):
            ngrams.append(l[i-n:i])
    

    or in other words:

    ngrams = [ l[i-n:i] for l in L for i in range(n,len(l)+1) ]
    

    This should work for any n and any sequence l. If there are no ngrams of length n it returns the empty list.

    0 讨论(0)
  • 2020-12-24 02:42

    Read the dataset

    df = pd.read_csv('dataset.csv', skiprows = 6, index_col = "No")
    

    Collect all available months

    df["Month"] = df["Date(ET)"].apply(lambda x : x.split('/')[0])
    

    Create tokens of all tweets per month

    tokens = df.groupby("Month")["Contents"].sum().apply(lambda x : x.split(' '))
    

    Create bigrams per month

    bigrams = tokens.apply(lambda x : list(nk.ngrams(x, 2)))
    

    Count bigrams per month

    count_bigrams = bigrams.apply(lambda x : list(x.count(item) for item in x))
    

    Wrap up the result in neat dataframes

    month1 = pd.DataFrame(data = count_bigrams[0], index= bigrams[0], columns= ["Count"])
    month2 = pd.DataFrame(data = count_bigrams[1], index= bigrams[1], columns= ["Count"])
    
    0 讨论(0)
  • 2020-12-24 02:46

    Using list comprehensions and zip:

    >>> text = ["this is a sentence", "so is this one"]
    >>> bigrams = [b for l in text for b in zip(l.split(" ")[:-1], l.split(" ")[1:])]
    >>> print(bigrams)
    [('this', 'is'), ('is', 'a'), ('a', 'sentence'), ('so', 'is'), ('is', 'this'), ('this',     
    'one')]
    
    0 讨论(0)
  • 2020-12-24 02:50

    There are a number of ways to solve it but I solved in this way:

    >>text = ['cant railway station','citadel hotel',' police stn']
    >>text2 = [[word for word in line.split()] for line in text]
    >>text2
    [['cant', 'railway', 'station'], ['citadel', 'hotel'], ['police', 'stn']]
    >>output = []
    >>for i in range(len(text2)):
        output = output+list(bigrams(text2[i]))
    >>#Here you can use list comphrension also
    >>output
    [('cant', 'railway'), ('railway', 'station'), ('citadel', 'hotel'), ('police', 'stn')]
    
    0 讨论(0)
提交回复
热议问题