Managing Train/Develop Splits with the spaCy command line trainer

狂风中的少年 提交于 2020-03-03 07:34:29

问题


I am training an NER model using the python -m spacy train command line tool. I use gold.docs_to_json to convert my annotated documents to the JSON-serializable format.

The command line training tool uses both a training set and a development set. I'm not sure how much assistance the command line tools give me for managing train/dev splits.

  1. Is there a command line tool to create train/dev splits from a single set of data?
  2. Will the spaCy training command do cross-validation for me instead of making me create a dev set?
  3. When it comes time to train the production model on all the data, what do I use as the dev set?

I think the answer to both questions (1) and (2) is "no", but I want to double-check.

From playing around it appears that you always have to pass in a non-empty dev set, even when you are training a production model for a fixed number of iterations. For now I just pass in a copy of my training data, but seems odd so I'm wondering if there is some other procedure I'm missing.

The spaCy documentation on training mostly discusses writing your own iteration loops. I've done enough of that that I'm sure I could make any of the above work if I wrote my own code, but for these basic training operations I'd rather not write code and just use the command line tools for everything.

来源:https://stackoverflow.com/questions/59921513/managing-train-develop-splits-with-the-spacy-command-line-trainer

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!