问题
I am new to the area of NLP and sentiment analysis in particular. My goal is to train the Stanford CoreNLP sentiment model. I am aware that the sentences provided as training data should be in the following format.
(3 (2 (2 The) (2 Rock)) (4 (3 (2 is) (4 (2 destined) (2 (2 (2 (2 (2 to) (2 (2 be) (2 (2 the) (2 (2 21st) (2 (2 (2 Century) (2 's)) (2 (3 new) (2 (2 ``) (2 Conan)))))))) (2 '')) (2 and)) (3 (2 that) (3 (2 he) (3 (2 's) (3 (2 going) (3 (2 to) (4 (3 (2 make) (3 (3 (2 a) (3 splash)) (2 (2 even) (3 greater)))) (2 (2 than) (2 (2 (2 (2 (1 (2 Arnold) (2 Schwarzenegger)) (2 ,)) (2 (2 Jean-Claud) (2 (2 Van) (2 Damme)))) (2 or)) (2 (2 Steven) (2 Segal))))))))))))) (2 .)))
I am also aware that I can create the sentiment training model with my own training data using the following command.
java -mx8g edu.stanford.nlp.sentiment.SentimentTraining -numHid 25 -trainPath train.txt -devPath dev.txt -train -model model.ser.gz
My question is, do I have access to the training data set that was used to train the model? If yes, then where can I find it? Also, is there a way I can append new sentences to the original training data set and create the train model?
回答1:
The data is available here: http://nlp.stanford.edu/sentiment/
If you just create a new data set with the same format you can put the files in a directory and set -trainPath
to that directory. It will load all files from that directory and train on them.
sample command:
java -Xmx8g edu.stanford.nlp.sentiment.SentimentTraining -train -numHid 25 -trainPath trees/training-data/ -model model.ser.gz
来源:https://stackoverflow.com/questions/42550092/stanford-corenlp-sentiment-training-set