Identify an english word as a thing or product?

对着背影说爱祢 提交于 2019-12-13 07:42:41

问题


Write a program with the following objective - be able to identify whether a word/phrase represents a thing/product. For example - 1) "A glove comprising at least an index finger receptacle, a middle finger receptacle.." <-Be able to identify glove as a thing/product. 2) "In a window regulator, especially for automobiles, in which the window is connected to a drive..." <- be able to identify regulator as a thing. Doing this tells me that the text is talking about a thing/product. as a contrast, the following text talks about a process instead of a thing/product -> "An extrusion coating process for the production of flexible packaging films of nylon coated substrates consisting of the steps of..."

I have millions of such texts; hence, manually doing it is not feasible. So far, with the help of using NLTK + Python, I have been able to identify some specific cases which use very similar keywords. But I have not been able to do the same with the kinds mentioned in the examples above. Any help will be appreciated!


回答1:


What you want to do is actually pretty difficult. It is a sort of (very specific) semantic labelling task. The possible solutions are:

  • create your own labelling algorithm, create training data, test, eval and finally tag your data
  • use an existing knowledge base (lexicon) to extract semantic labels for each target word

The first option is a complex research project in itself. Do it if you have the time and resources.

The second option will only give you the labels that are available in the knowledge base, and these might not match your wishes. I would give it a try with python, NLTK and Wordnet (interface already available), you might be able to use synset hypernyms for your problem.




回答2:


This task is called named entity reconition problem.

EDIT: There is no clean definition of NER in NLP community, so one can say this is not NER task, but instance of more general sequence labeling problem. Anyway, there is still no tool that can do this out of the box.

Out of the box, Standford NLP can only recognize following types:

Recognizes named (PERSON, LOCATION, ORGANIZATION, MISC), numerical (MONEY, NUMBER, ORDINAL, PERCENT), and temporal (DATE, TIME, DURATION, SET) entities

so it is not suitable for solving this task. There are some commercial solutions that possible can do the job, they can be readily found by googling "product name named entity recognition", some of them offer free trial plans. I don't know any free ready to deploy solution.

Of course, you can create you own model by hand-annotating about 1000 or so product name containing sentences and training some classifier like Conditional Random Field classifier with some basic features (here is documentation page that explains how to that with stanford NLP). This solution should work reasonable well, while it won't be perfect of course (no system will be perfect but some solutions are better then others).

EDIT: This is complex task per se, but not that complex unless you want state-of-the art results. You can create reasonable good model in just 2-3 days. Here is (example) step-by-step instruction how to do this using open source tool:

  • Download CRF++ and look at provided examples, they are in a simple text format
  • Annotate you data in a similar manner
    a OTHER 
    glove PRODUCT 
    comprising OTHER
    ... 

and so on.

Spilt you annotated data into two files train (80%) and dev(20%)

  1. use following baseline template features (paste in template file)
    

    U02:%x[0,0]
    U01:%x[-1,0]
    U01:%x[-2,0]
    U02:%x[0,0]
    U03:%x[1,0]
    U04:%x[2,0]
    U05:%x[-1,0]/%x[0,0]
    U06:%x[0,0]/%x[1,0]

4.Run

crf_learn template train.txt model
crf_test -m model dev.txt  > result.txt 
  1. Look at result.txt. one column will contain your hand-labeled data and other - machine predicted labels. You can then compare these, compute accuracy etc. After that you can feed new unlabeled data into crf_test and get your labels.

As I said, this won't be perfect, but I will be very surprised if that won't be reasonable good (I actually solved very similar task not long ago) and certanly better just using few keywords/templates

ENDNOTE: this ignores many things and some best-practices in solving such tasks, won't be good for academic research, not 100% guaranteed to work, but still useful for this and many similar problems as relatively quick solution.



来源:https://stackoverflow.com/questions/28574183/identify-an-english-word-as-a-thing-or-product

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!