Computing N Grams using Python

后端 未结 8 1715
情歌与酒
情歌与酒 2020-11-28 06:02

I needed to compute the Unigrams, BiGrams and Trigrams for a text file containing text like:

\"Cystic fibrosis affects 30,000 children and young adults in the US a

8条回答
  •  有刺的猬
    2020-11-28 06:19

    Though the post is old, I thought to mention my answer here so that most of the ngrams creation logic can be in one post.

    There is something by name TextBlob in Python. It creates ngrams very easily similar to NLTK.

    Below is the code snippet with its output for easy understanding.

    sent = """This is to show the usage of Text Blob in Python"""
    blob = TextBlob(sent)
    unigrams = blob.ngrams(n=1)
    bigrams = blob.ngrams(n=2)
    trigrams = blob.ngrams(n=3)
    

    And the output is :

    unigrams
    [WordList(['This']),
     WordList(['is']),
     WordList(['to']),
     WordList(['show']),
     WordList(['the']),
     WordList(['usage']),
     WordList(['of']),
     WordList(['Text']),
     WordList(['Blob']),
     WordList(['in']),
     WordList(['Python'])]
    
    bigrams
    [WordList(['This', 'is']),
     WordList(['is', 'to']),
     WordList(['to', 'show']),
     WordList(['show', 'the']),
     WordList(['the', 'usage']),
     WordList(['usage', 'of']),
     WordList(['of', 'Text']),
     WordList(['Text', 'Blob']),
     WordList(['Blob', 'in']),
     WordList(['in', 'Python'])]
    
    trigrams
    [WordList(['This', 'is', 'to']),
     WordList(['is', 'to', 'show']),
     WordList(['to', 'show', 'the']),
     WordList(['show', 'the', 'usage']),
     WordList(['the', 'usage', 'of']),
     WordList(['usage', 'of', 'Text']),
     WordList(['of', 'Text', 'Blob']),
     WordList(['Text', 'Blob', 'in']),
     WordList(['Blob', 'in', 'Python'])]
    

    As simple as that.

    There is more to this that are being done by TextBlob. Please have a look at this doc for more details - https://textblob.readthedocs.io/en/dev/

提交回复
热议问题