I discovered the tools of stanford-NLP and found it really interesting. I\'m a french dataminer / datascientist, fond of text analysis and would love to use your tools, but
NB: I am not a developper of the Stanford tools, nor a NLP expert. Just a lambda user that also needed such informations at some point. Also note that part of the information given below are from the official FAQ: http://nlp.stanford.edu/software/crf-faq.shtml#a
Here are the steps I followed to train my own NER:
Create a train/test sample. It must take the form of .tsv
files with the following format:
Venez O
découvrir O
lundi DAY
le O
nouvel O
espace O
de O
vente O
ODHOJS ORGANISATION
Depending on the original format of your text, you can create this sample with SQL statement or other NLP tools. The labelling is the most complicated part as I don't know other ways to proceed than to do it by hand.
Train the model with this command:
java -cp "stanford-ner.jar:lib/*" -mx4g edu.stanford.nlp.ie.crf.CRFClassifier -prop prop.txt
where prop.txt
is also described here.
This should create a new .jar
containing the newly trained model.
Test the model performances:
java -cp "stanford-ner.jar:lib/*" edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier ner-model.ser.gz -testFile test.tsv > test.res
The input test.tsv
has the same format than the train.tsv
file. The output in test.res
has an extra column containing the NER predicted class. The last lines also show the summary in terms of precision, recall and F1.
Finally, you can use your NER on real data:
java -cp "stanford-ner.jar:lib/*" -mx5g edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier ner-model.ser.gz -textFile test.txt -outputFormat inlineXML > test.res
Hope it helps.