NLTK Tagging spanish words using a corpus

前端 未结 4 824
無奈伤痛
無奈伤痛 2020-11-27 17:13

I am trying to learn how to tag spanish words using NLTK.

From the nltk book, It is quite easy to tag english words using their example. Because I am new to nltk an

相关标签:
4条回答
  • 2020-11-27 17:34

    First you need to read the tagged sentence from a corpus. NLTK provides a nice interface to no bother with different formats from the different corpora; you can simply import the corpus use the corpus object functions to access the data. See http://nltk.googlecode.com/svn/trunk/nltk_data/index.xml .

    Then you have to choose your choice of tagger and train the tagger. There are more fancy options but you can start with the N-gram taggers.

    Then you can use the tagger to tag the sentence you want. Here's an example code:

    from nltk.corpus import cess_esp as cess
    from nltk import UnigramTagger as ut
    from nltk import BigramTagger as bt
    
    # Read the corpus into a list, 
    # each entry in the list is one sentence.
    cess_sents = cess.tagged_sents()
    
    # Train the unigram tagger
    uni_tag = ut(cess_sents)
    
    sentence = "Hola , esta foo bar ."
    
    # Tagger reads a list of tokens.
    uni_tag.tag(sentence.split(" "))
    
    # Split corpus into training and testing set.
    train = int(len(cess_sents)*90/100) # 90%
    
    # Train a bigram tagger with only training data.
    bi_tag = bt(cess_sents[:train])
    
    # Evaluates on testing data remaining 10%
    bi_tag.evaluate(cess_sents[train+1:])
    
    # Using the tagger.
    bi_tag.tag(sentence.split(" "))
    

    Training a tagger on a large corpus may take a significant time. Instead of training a tagger every time we need one, it is convenient to save a trained tagger in a file for later re-use.

    Please look at Storing Taggers section in http://nltk.googlecode.com/svn/trunk/doc/book/ch05.html

    0 讨论(0)
  • 2020-11-27 17:46

    The following script gives you a quick approach to get a "bag of words" in Spanish sentences. Note that if you want to do it correctly you must tokenize the sentences before tag, so 'religiosas.' must be separated in two tokens 'religiosas','.'

    #-*- coding: utf8 -*-
    
    # about the tagger: http://nlp.stanford.edu/software/tagger.shtml 
    # about the tagset: nlp.lsi.upc.edu/freeling/doc/tagsets/tagset-es.html
    
    import nltk
    
    from nltk.tag.stanford import POSTagger
    
    spanish_postagger = POSTagger('models/spanish.tagger', 'stanford-postagger.jar', encoding='utf8')
    
    sentences = ['El copal se usa principalmente para sahumar en distintas ocasiones como lo son las fiestas religiosas.','Las flores, hojas y frutos se usan para aliviar la tos y también se emplea como sedante.']
    
    for sent in sentences:
    
        words = sent.split()
        tagged_words = spanish_postagger.tag(words)
    
        nouns = []
    
        for (word, tag) in tagged_words:
    
            print(word+' '+tag).encode('utf8')
            if isNoun(tag): nouns.append(word)
    
        print(nouns)
    

    Gives:

    El da0000
    copal nc0s000
    se p0000000
    usa vmip000
    principalmente rg
    para sp000
    sahumar vmn0000
    en sp000
    distintas di0000
    ocasiones nc0p000
    como cs
    lo pp000000
    son vsip000
    las da0000
    fiestas nc0p000
    religiosas. np00000
    [u'copal', u'ocasiones', u'fiestas', u'religiosas.']
    Las da0000
    flores, np00000
    hojas nc0p000
    y cc
    frutos nc0p000
    se p0000000
    usan vmip000
    para sp000
    aliviar vmn0000
    la da0000
    tos nc0s000
    y cc
    también rg
    se p0000000
    emplea vmip000
    como cs
    sedante. nc0s000
    [u'flores,', u'hojas', u'frutos', u'tos', u'sedante.']
    
    0 讨论(0)
  • 2020-11-27 17:59

    I ended up here searching for POS taggers for other languages then English. Another option for your problem is using the Spacy library. Which offers POS tagging for multiple languages such as Dutch, German, French, Portuguese, Spanish, Norwegian, Italian, Greek and Lithuanian.

    From the Spacy Documentation:

    import es_core_news_sm
    nlp = es_core_news_sm.load()
    doc = nlp("El copal se usa principalmente para sahumar en distintas ocasiones como lo son las fiestas religiosas.")
    print([(w.text, w.pos_) for w in doc])
    

    leads to:

    [('El', 'DET'), ('copal', 'NOUN'), ('se', 'PRON'), ('usa', 'VERB'), ('principalmente', 'ADV'), ('para', 'ADP'), ('sahumar', 'VERB'), ('en', 'ADP'), ('distintas', 'DET'), ('ocasiones', 'NOUN'), ('como', 'SCONJ'), ('lo', 'PRON'), ('son', 'AUX'), ('las', 'DET'), ('fiestas', 'NOUN'), ('religiosas', 'ADJ'), ('.', 'PUNCT')]

    and to visualize in a notebook:

    displacy.render(doc, style='dep', jupyter = True, options = {'distance': 120})
    

    0 讨论(0)
  • 2020-11-27 18:01

    Given the tutorial in the previous answer, here's a more object-oriented approach from spaghetti tagger: https://github.com/alvations/spaghetti-tagger

    #-*- coding: utf8 -*-
    
    from nltk import UnigramTagger as ut
    from nltk import BigramTagger as bt
    from cPickle import dump,load
    
    def loadtagger(taggerfilename):
        infile = open(taggerfilename,'rb')
        tagger = load(infile); infile.close()
        return tagger
    
    def traintag(corpusname, corpus):
        # Function to save tagger.
        def savetagger(tagfilename,tagger):
            outfile = open(tagfilename, 'wb')
            dump(tagger,outfile,-1); outfile.close()
            return
        # Training UnigramTagger.
        uni_tag = ut(corpus)
        savetagger(corpusname+'_unigram.tagger',uni_tag)
        # Training BigramTagger.
        bi_tag = bt(corpus)
        savetagger(corpusname+'_bigram.tagger',bi_tag)
        print "Tagger trained with",corpusname,"using" +\
                    "UnigramTagger and BigramTagger."
        return
    
    # Function to unchunk corpus.
    def unchunk(corpus):
        nomwe_corpus = []
        for i in corpus:
            nomwe = " ".join([j[0].replace("_"," ") for j in i])
            nomwe_corpus.append(nomwe.split())
        return nomwe_corpus
    
    class cesstag():
        def __init__(self,mwe=True):
            self.mwe = mwe
            # Train tagger if it's used for the first time.
            try:
                loadtagger('cess_unigram.tagger').tag(['estoy'])
                loadtagger('cess_bigram.tagger').tag(['estoy'])
            except IOError:
                print "*** First-time use of cess tagger ***"
                print "Training tagger ..."
                from nltk.corpus import cess_esp as cess
                cess_sents = cess.tagged_sents()
                traintag('cess',cess_sents)
                # Trains the tagger with no MWE.
                cess_nomwe = unchunk(cess.tagged_sents())
                tagged_cess_nomwe = batch_pos_tag(cess_nomwe)
                traintag('cess_nomwe',tagged_cess_nomwe)
                print
            # Load tagger.
            if self.mwe == True:
                self.uni = loadtagger('cess_unigram.tagger')
                self.bi = loadtagger('cess_bigram.tagger')
            elif self.mwe == False:
                self.uni = loadtagger('cess_nomwe_unigram.tagger')
                self.bi = loadtagger('cess_nomwe_bigram.tagger')
    
    def pos_tag(tokens, mmwe=True):
        tagger = cesstag(mmwe)
        return tagger.uni.tag(tokens)
    
    def batch_pos_tag(sentences, mmwe=True):
        tagger = cesstag(mmwe)
        return tagger.uni.batch_tag(sentences)
    
    tagger = cesstag()
    print tagger.uni.tag('Mi colega me ayuda a programar cosas .'.split())
    
    0 讨论(0)
提交回复
热议问题