How to tokenize natural English text in an input file in python?

后端未结

关注

 3  988

不知归路

I want to tokenize input file in python please suggest me i am new user of python .

I read the some thng about the regular expression but still some con

Using `NLTK`

If your file is small:

Open the file with the context manager with open(...) as x,
then do a .read() and tokenize it with word_tokenize()

[code]:

from nltk.tokenize import word_tokenize
with open ('myfile.txt') as fin:
    tokens = word_tokenize(fin.read())

If your file is larger:

Open the file with the context manager with open(...) as x,
read the file line by line with a for-loop
tokenize the line with word_tokenize()
output to your desired format (with the write flag set)

[code]:

from __future__ import print_function
from nltk.tokenize import word_tokenize
with open ('myfile.txt') as fin, open('tokens.txt','w') as fout:
    for line in fin:
        tokens = word_tokenize(line)
        print(' '.join(tokens), end='\n', file=fout)

Using SpaCy

from __future__ import print_function
from spacy.tokenizer import Tokenizer
from spacy.lang.en import English

nlp = English()
tokenizer = Tokenizer(nlp.vocab)

with open ('myfile.txt') as fin, open('tokens.txt') as fout:
    for line in fin:
        tokens = tokenizer.tokenize(line)
        print(' '.join(tokens), end='\n', file=fout)

0 讨论(0)

时光说笑

2021-01-03 06:15
Try something like this:
```
import nltk
file_content = open("myfile.txt").read()
tokens = nltk.word_tokenize(file_content)
print tokens
```
The NLTK tutorial is also full of easy to follow examples: http://nltk.googlecode.com/svn/trunk/doc/book/ch03.html
0 讨论(0)
发布评论:

提交评论
- 加载中...

How to tokenize natural English text in an input file in python?

Using NLTK

Using SpaCy

Using `NLTK`