How to tokenize natural English text in an input file in python?

后端 未结 3 988
不知归路
不知归路 2021-01-03 05:26

I want to tokenize input file in python please suggest me i am new user of python .

I read the some thng about the regular expression but still some con

相关标签:
3条回答
  • 2021-01-03 05:58
    with open ("file.txt", "r") as f1:
             data=str(f1.readlines())
             sent_tokenize(data)
    
    0 讨论(0)
  • 2021-01-03 06:03

    Using NLTK

    If your file is small:

    • Open the file with the context manager with open(...) as x,
    • then do a .read() and tokenize it with word_tokenize()

    [code]:

    from nltk.tokenize import word_tokenize
    with open ('myfile.txt') as fin:
        tokens = word_tokenize(fin.read())
    

    If your file is larger:

    • Open the file with the context manager with open(...) as x,
    • read the file line by line with a for-loop
    • tokenize the line with word_tokenize()
    • output to your desired format (with the write flag set)

    [code]:

    from __future__ import print_function
    from nltk.tokenize import word_tokenize
    with open ('myfile.txt') as fin, open('tokens.txt','w') as fout:
        for line in fin:
            tokens = word_tokenize(line)
            print(' '.join(tokens), end='\n', file=fout)
    

    Using SpaCy

    from __future__ import print_function
    from spacy.tokenizer import Tokenizer
    from spacy.lang.en import English
    
    nlp = English()
    tokenizer = Tokenizer(nlp.vocab)
    
    with open ('myfile.txt') as fin, open('tokens.txt') as fout:
        for line in fin:
            tokens = tokenizer.tokenize(line)
            print(' '.join(tokens), end='\n', file=fout)
    
    0 讨论(0)
  • 2021-01-03 06:15

    Try something like this:

    import nltk
    file_content = open("myfile.txt").read()
    tokens = nltk.word_tokenize(file_content)
    print tokens
    

    The NLTK tutorial is also full of easy to follow examples: http://nltk.googlecode.com/svn/trunk/doc/book/ch03.html

    0 讨论(0)
提交回复
热议问题