Setting NLTK with Stanford NLP (both StanfordNERTagger and StanfordPOSTagger) for Spanish

后端未结

关注

 3  2170

天涯浪人 2021-02-09 07:54

The NLTK documentation is rather poor in this integration. The steps I followed were:

Download http://nlp.stanford.edu/software/stanford-postagger-

3条回答

轻奢々 (楼主)

2021-02-09 08:12

POS Tagger

In order to use the StanfordPOSTagger for Spanish with python, you have to install the Stanford tagger that includes a model for spanish.

In this example I download the tagger on /content folder

cd /content
wget https://nlp.stanford.edu/software/stanford-tagger-4.1.0.zip
unzip stanford-tagger-4.1.0.zip

After unziping, I have a folder stanford-postagger-full-2020-08-06 in /content, so I can use the tagger with:

from nltk.tag.stanford import StanfordPOSTagger

stanford_dir = '/content/stanford-postagger-full-2020-08-06'
modelfile = f'{stanford_dir}/models/spanish-ud.tagger'
jarfile =   f'{stanford_dir}/stanford-postagger.jar'

st = StanfordPOSTagger(model_filename=modelfile, path_to_jar=jarfile)

To check that everything works fine, we can do:

>st.tag(["Juan","Medina","es","un","ingeniero"])

>[('Juan', 'PROPN'),
 ('Medina', 'PROPN'),
 ('es', 'AUX'),
 ('un', 'DET'),
 ('ingeniero', 'NOUN')]

NER Tagger

In this case is necessary to download the NER core and the spanish model separatelly.

cd /content
#download NER core
wget https://nlp.stanford.edu/software/stanford-ner-4.0.0.zip
unzip stanford-ner-4.0.0.zip
#download spanish models
wget http://nlp.stanford.edu/software/stanford-spanish-corenlp-2018-02-27-models.jar
unzip stanford-spanish-corenlp-2018-02-27-models.jar -d stanford-spanish
#copy only the necessary files
cp stanford-spanish/edu/stanford/nlp/models/ner/* stanford-ner-4.0.0/classifiers/
rm -rf stanford-spanish stanford-ner-4.0.0.zip stanford-spanish-corenlp-2018-02-27-models.jar

To use it on python:

from nltk.tag.stanford import StanfordNERTagger
stanford_dir = '/content/stanford-ner-4.0.0/'
jarfile = f'{stanford_dir}/stanford-ner.jar'
modelfile = f'{stanford_dir}/classifiers/spanish.ancora.distsim.s512.crf.ser.gz'

st = StanfordNERTagger(model_filename=modelfile, path_to_jar=jarfile)

To check that everything works fine, we can do:

>st.tag(["Juan","Medina","es","un","ingeniero"])

>[('Juan', 'PERS'),
 ('Medina', 'PERS'),
 ('es', 'O'),
 ('un', 'O'),
 ('ingeniero', 'O')]

0 讨论(0)

查看其它3个回答