Strategy for parsing natural language descriptions into structured data

烈酒焚心 提交于 2019-12-04 00:53:10

Short answer. Use GATE.

Long answer. You need some tool for pattern recognition in text. Something, that can catch patterns like:

{Number}{Space}{Ingredient}
{Number}{Space}{Measure}{Space}{"of"}{Space}{Ingredient}
{Number}{Space}{Measure}{Space}{"of"}{Space}{Ingredient}{"("}{Value}{")"}
...

Where {Number} is a number, {Ingredient} is taken from dictionary of ingredients, {Measure} - from dictionary measures and so on.

Patterns I described are very similar to GATE's JAPE rules. With them you catch text that matches pattern and assign some label to each part of a pattern (number, ingredient, measure, etc.). Then you extract labeled text and put it into single table.

Dictionaries I mentioned can be represented by Gazetteers in GATE.

So, GATE covers all your needs. It's not the easiest way to start, since you will have to learn at least GATE's basics, JAPE rules and Gazetteers, but with such approach you will be able to get really good results.

It is basically natural language parsing. (You did already stemming chicken[s].) So basically it is a translation process. Fortunately the context is very restricted.

You need a supportive translation, where you can add dictionary entries, adapt the grammar rules and retry again.

An easy process/work flow in this case is much more important than the algorithms. I am interested in both aspects.

If you need a programming hand for an initial prototype, feel free to contact me. I did see, you are already working quite structured.

Unfortunately I do not know of fitting frameworks. You are doing something, that Mathematica wants to do with its Alpha (natural language commands yielding results). Data mining? But simple natural language parsing with a manual adaption process should give fast and easy results.

You also can try Gexp. Then you have to write rules as Java class such as

seq(Number, opt(Measure), Ingradient, opt(seq(token("("), Number, Measure, token(")")))

Then you have to add some group to capture (group(String name, Matcher m)) and extrat parts of pattern and store this information into table. For Number, Measure you should use similar Gexp pattern, or I would recommend some Shallow parsing for noun phrase detection with words from Ingradients.

If you don't want to be exposed to the nitty-gritty of NLP and machine learning, there are a few hosted services that do this for you:

If you are interested in the nitty-gritty, the New York Times wrote about how they parsed their ingredient archive. They open-sourced their code, but abandoned it soon after. I maintain the most up-to-date version of it and I wrote a bit about how I modernized it.

Do you have access to a tagged corpus for training a statistical model? That is probably the most fruitful avenue here. You could build one up using epicurious.com; scrape a lot of their recipe ingredients lists, which are in the kind of prose form you need to parse, and then use their helpful "print a shopping list" feature, which provides the same ingredients in a tabular format. You can use this data to train a statistical language model, since you will have both the raw untagged data, and the expected parse results for a large number of examples.

This might be a bigger project than you have in mind, but I think in the end it will produce better results than a structured top-down parsing approach will.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!