Extracting tuples with nltk?

前端 未结 2 640
执笔经年
执笔经年 2021-01-24 11:59

Reading the documentation of nltk i found that is possible to extract tuples with str2tuple(). As an instance assume i have the following sentence(clearly is a much

相关标签:
2条回答
  • 2021-01-24 12:35

    Since you're presumably interested in using your corpus with the NLTK: Assuming your file is stored in this format, you should read it in, parse it (using str2tuple or other simpler methods) and load it with TaggedCorpusReader. Then you can use all the standard NLTK corpus functions with it. You basically have two types of tags, part of speech and (presumably) word lemma. If this is what you're after, I can add more specific information to this answer.

    Assuming your string actually includes a newline after each triple, the easy way to parse it into a list of tuples is like this:

    sent = """pero pero CC
    tan tan RG
    antigua antiguo AQ0FS0
    que que CS
    según según SPS00
    mi  mi DP1CSS
    madre madre NCFS000"""
    
    tuples = [ line.split() for line in sent.splitlines() ]
    

    A detail: split() actually returns a list, not a tuple. If you need to use them as dictionary keys, replace line.split() with tuple(line.split()).

    0 讨论(0)
  • 2021-01-24 12:36

    I think what you have is a verticalized text file, aka as .vrt , see CWB encoding Corpus

    I guess the first column means the surface form of the word, the second refers to some sort of lemma and the third is the part-of-speech text.

    First take a look at csv module, i find this tutorial helpful, http://www.pythonforbeginners.com/systems-programming/using-the-csv-module-in-python/

    Let's say you have a tab-delimited file as such:

    pero    pero    CC
    tan tan RG
    antigua antiguo AQ0FS0
    que que CS
    según   según   SPS00
    mi  mi  DP1CSS
    madre   madre   NCFS000
    

    To read the file, sometimes people call it "to parse the file":

    import csv
    
    with open('test.txt', 'r') as fin:
        reader = csv.reader(fin, delimiter='\t')
        for line in reader:
            word, lemma, pos = line
            print word, lemma, pos
    

    To get the (word,pos) tuple structure for the sentence, try:

    import csv
    sentences = []
    with open('test.txt', 'r') as fin:
        reader = csv.reader(fin, delimiter='\t')
        for line in reader:
            word, lemma, pos = line
            sentences.append((word, pos))
    
    print sentences
    

    [out]:

    [('pero', 'CC'), ('tan', 'RG'), ('antigua', 'AQ0FS0'), ('que', 'CS'), ('seg\xc3\xban', 'SPS00'), ('mi', 'DP1CSS'), ('madre', 'NCFS000')]
    
    0 讨论(0)
提交回复
热议问题