Setting NLTK with Stanford NLP (both StanfordNERTagger and StanfordPOSTagger) for Spanish

后端 未结 3 2183
天涯浪人
天涯浪人 2021-02-09 07:54

The NLTK documentation is rather poor in this integration. The steps I followed were:

  • Download http://nlp.stanford.edu/software/stanford-postagger-

相关标签:
3条回答
  • 2021-02-09 08:10

    Try:

    # StanfordPOSTagger
    from nltk.tag.stanford import StanfordPOSTagger
    stanford_dir = '/home/me/stanford/stanford-postagger-full-2015-04-20/'
    modelfile = stanford_dir + 'models/english-bidirectional-distsim.tagger'
    jarfile = stanford_dir + 'stanford-postagger.jar'
    
    st = StanfordPOSTagger(model_filename=modelfile, path_to_jar=jarfile)
    
    
    # NERTagger
    stanford_dir = '/home/me/stanford/stanford-ner-2015-04-20/'
    jarfile = stanford_dir + 'stanford-ner.jar'
    modelfile = stanford_dir + 'classifiers/english.all.3class.distsim.crf.ser.gz'
    
    st = StanfordNERTagger(model_filename=modelfile, path_to_jar=jarfile)
    

    For detailed information on NLTK API with Stanford tools, take a look at: https://github.com/nltk/nltk/wiki/Installing-Third-Party-Software#stanford-tagger-ner-tokenizer-and-parser

    Note: The NLTK APIs are for the individual Stanford tools, if you're using Stanford Core NLP, it's best to follow @dimazest instructions on http://www.eecs.qmul.ac.uk/~dm303/stanford-dependency-parser-nltk-and-anaconda.html


    EDITED

    As for Spanish NER Tagging, I strongly suggest that you us Stanford Core NLP (http://nlp.stanford.edu/software/corenlp.shtml) instead of using the Stanford NER package (http://nlp.stanford.edu/software/CRF-NER.shtml). And follow @dimazest solution for JSON file reading.

    Alternatively, if you must use the NER packge, you can try following the instructions from https://github.com/alvations/nltk_cli (Disclaimer: This repo is not affiliated with NLTK officially). Do the following on the unix command line:

    cd $HOME
    wget http://nlp.stanford.edu/software/stanford-spanish-corenlp-2015-01-08-models.jar
    unzip stanford-spanish-corenlp-2015-01-08-models.jar -d stanford-spanish
    cp stanford-spanish/edu/stanford/nlp/models/ner/* /home/me/stanford/stanford-ner-2015-04-20/ner/classifiers/
    

    Then in python:

    # NERTagger
    stanford_dir = '/home/me/stanford/stanford-ner-2015-04-20/'
    jarfile = stanford_dir + 'stanford-ner.jar'
    modelfile = stanford_dir + 'classifiers/spanish.ancora.distsim.s512.crf.ser.gz'
    
    st = StanfordNERTagger(model_filename=modelfile, path_to_jar=jarfile)
    
    0 讨论(0)
  • 2021-02-09 08:11

    The error lies in the arguments written for the StanfordNerTagger function.

    The first argument should be a model file or the classifier you are using. You can find that file inside the Stanford zip file. For example:

        st = StanfordNERTagger('/home/me/stanford/stanford-postagger-full-2015-04-20/classifier/tagger.ser.gz', '/home/me/stanford/stanford-spanish-corenlp-2015-01-08-models.jar')
    
    0 讨论(0)
  • 2021-02-09 08:12

    POS Tagger

    In order to use the StanfordPOSTagger for Spanish with python, you have to install the Stanford tagger that includes a model for spanish.

    In this example I download the tagger on /content folder

    cd /content
    wget https://nlp.stanford.edu/software/stanford-tagger-4.1.0.zip
    unzip stanford-tagger-4.1.0.zip
    

    After unziping, I have a folder stanford-postagger-full-2020-08-06 in /content, so I can use the tagger with:

    from nltk.tag.stanford import StanfordPOSTagger
    
    stanford_dir = '/content/stanford-postagger-full-2020-08-06'
    modelfile = f'{stanford_dir}/models/spanish-ud.tagger'
    jarfile =   f'{stanford_dir}/stanford-postagger.jar'
    
    st = StanfordPOSTagger(model_filename=modelfile, path_to_jar=jarfile)
    

    To check that everything works fine, we can do:

    >st.tag(["Juan","Medina","es","un","ingeniero"])
    
    >[('Juan', 'PROPN'),
     ('Medina', 'PROPN'),
     ('es', 'AUX'),
     ('un', 'DET'),
     ('ingeniero', 'NOUN')]
    

    NER Tagger

    In this case is necessary to download the NER core and the spanish model separatelly.

    cd /content
    #download NER core
    wget https://nlp.stanford.edu/software/stanford-ner-4.0.0.zip
    unzip stanford-ner-4.0.0.zip
    #download spanish models
    wget http://nlp.stanford.edu/software/stanford-spanish-corenlp-2018-02-27-models.jar
    unzip stanford-spanish-corenlp-2018-02-27-models.jar -d stanford-spanish
    #copy only the necessary files
    cp stanford-spanish/edu/stanford/nlp/models/ner/* stanford-ner-4.0.0/classifiers/
    rm -rf stanford-spanish stanford-ner-4.0.0.zip stanford-spanish-corenlp-2018-02-27-models.jar
    

    To use it on python:

    from nltk.tag.stanford import StanfordNERTagger
    stanford_dir = '/content/stanford-ner-4.0.0/'
    jarfile = f'{stanford_dir}/stanford-ner.jar'
    modelfile = f'{stanford_dir}/classifiers/spanish.ancora.distsim.s512.crf.ser.gz'
    
    st = StanfordNERTagger(model_filename=modelfile, path_to_jar=jarfile)
    

    To check that everything works fine, we can do:

    >st.tag(["Juan","Medina","es","un","ingeniero"])
    
    >[('Juan', 'PERS'),
     ('Medina', 'PERS'),
     ('es', 'O'),
     ('un', 'O'),
     ('ingeniero', 'O')]
    
    0 讨论(0)
提交回复
热议问题