Setting NLTK with Stanford NLP (both StanfordNERTagger and StanfordPOSTagger) for Spanish

后端 未结 3 2170
天涯浪人
天涯浪人 2021-02-09 07:54

The NLTK documentation is rather poor in this integration. The steps I followed were:

  • Download http://nlp.stanford.edu/software/stanford-postagger-

3条回答
  •  轻奢々
    轻奢々 (楼主)
    2021-02-09 08:12

    POS Tagger

    In order to use the StanfordPOSTagger for Spanish with python, you have to install the Stanford tagger that includes a model for spanish.

    In this example I download the tagger on /content folder

    cd /content
    wget https://nlp.stanford.edu/software/stanford-tagger-4.1.0.zip
    unzip stanford-tagger-4.1.0.zip
    

    After unziping, I have a folder stanford-postagger-full-2020-08-06 in /content, so I can use the tagger with:

    from nltk.tag.stanford import StanfordPOSTagger
    
    stanford_dir = '/content/stanford-postagger-full-2020-08-06'
    modelfile = f'{stanford_dir}/models/spanish-ud.tagger'
    jarfile =   f'{stanford_dir}/stanford-postagger.jar'
    
    st = StanfordPOSTagger(model_filename=modelfile, path_to_jar=jarfile)
    

    To check that everything works fine, we can do:

    >st.tag(["Juan","Medina","es","un","ingeniero"])
    
    >[('Juan', 'PROPN'),
     ('Medina', 'PROPN'),
     ('es', 'AUX'),
     ('un', 'DET'),
     ('ingeniero', 'NOUN')]
    

    NER Tagger

    In this case is necessary to download the NER core and the spanish model separatelly.

    cd /content
    #download NER core
    wget https://nlp.stanford.edu/software/stanford-ner-4.0.0.zip
    unzip stanford-ner-4.0.0.zip
    #download spanish models
    wget http://nlp.stanford.edu/software/stanford-spanish-corenlp-2018-02-27-models.jar
    unzip stanford-spanish-corenlp-2018-02-27-models.jar -d stanford-spanish
    #copy only the necessary files
    cp stanford-spanish/edu/stanford/nlp/models/ner/* stanford-ner-4.0.0/classifiers/
    rm -rf stanford-spanish stanford-ner-4.0.0.zip stanford-spanish-corenlp-2018-02-27-models.jar
    

    To use it on python:

    from nltk.tag.stanford import StanfordNERTagger
    stanford_dir = '/content/stanford-ner-4.0.0/'
    jarfile = f'{stanford_dir}/stanford-ner.jar'
    modelfile = f'{stanford_dir}/classifiers/spanish.ancora.distsim.s512.crf.ser.gz'
    
    st = StanfordNERTagger(model_filename=modelfile, path_to_jar=jarfile)
    

    To check that everything works fine, we can do:

    >st.tag(["Juan","Medina","es","un","ingeniero"])
    
    >[('Juan', 'PERS'),
     ('Medina', 'PERS'),
     ('es', 'O'),
     ('un', 'O'),
     ('ingeniero', 'O')]
    

提交回复
热议问题