Extracting tuples with nltk?

前端未结

关注

 2  640

Reading the documentation of nltk i found that is possible to extract tuples with str2tuple(). As an instance assume i have the following sentence(clearly is a much

相关标签:

2条回答

我寻月下人不归

2021-01-24 12:35
Since you're presumably interested in using your corpus with the NLTK: Assuming your file is stored in this format, you should read it in, parse it (using str2tuple or other simpler methods) and load it with TaggedCorpusReader. Then you can use all the standard NLTK corpus functions with it. You basically have two types of tags, part of speech and (presumably) word lemma. If this is what you're after, I can add more specific information to this answer.

Assuming your string actually includes a newline after each triple, the easy way to parse it into a list of tuples is like this:
```
sent = """pero pero CC
tan tan RG
antigua antiguo AQ0FS0
que que CS
según según SPS00
mi  mi DP1CSS
madre madre NCFS000"""

tuples = [ line.split() for line in sent.splitlines() ]
```
A detail: split() actually returns a list, not a tuple. If you need to use them as dictionary keys, replace line.split() with tuple(line.split()).
0 讨论(0)
发布评论:

提交评论
- 加载中...

你的背包

2021-01-24 12:36

I think what you have is a verticalized text file, aka as .vrt , see CWB encoding Corpus

I guess the first column means the surface form of the word, the second refers to some sort of lemma and the third is the part-of-speech text.

First take a look at csv module, i find this tutorial helpful, http://www.pythonforbeginners.com/systems-programming/using-the-csv-module-in-python/

Let's say you have a tab-delimited file as such:

pero    pero    CC
tan tan RG
antigua antiguo AQ0FS0
que que CS
según   según   SPS00
mi  mi  DP1CSS
madre   madre   NCFS000

To read the file, sometimes people call it "to parse the file":

import csv

with open('test.txt', 'r') as fin:
    reader = csv.reader(fin, delimiter='\t')
    for line in reader:
        word, lemma, pos = line
        print word, lemma, pos

To get the (word,pos) tuple structure for the sentence, try:

import csv
sentences = []
with open('test.txt', 'r') as fin:
    reader = csv.reader(fin, delimiter='\t')
    for line in reader:
        word, lemma, pos = line
        sentences.append((word, pos))

print sentences

[out]:

[('pero', 'CC'), ('tan', 'RG'), ('antigua', 'AQ0FS0'), ('que', 'CS'), ('seg\xc3\xban', 'SPS00'), ('mi', 'DP1CSS'), ('madre', 'NCFS000')]

0 讨论(0)