Methods for automated synonym detection

后端 未结 4 1606
忘了有多久
忘了有多久 2021-02-03 14:04

I am currently working on a neural network based approach to short document classification, and since the corpuses I am working with are usually around ten words, the standard s

相关标签:
4条回答
  • 2021-02-03 14:33

    The code here (http://ronan.collobert.com/senna/) implements a neural network to perform a variety on NLP tasks. The page also links to a paper that describes one of the most successful approaches so far of applying convolutional neural nets to NLP tasks.

    It is possible to modify their code to use the trained networks that they provide to classify sentences, but this may take more work than you were hoping for, and it can be tricky to correctly train neural networks.

    I had a lot of success using a similar technique to classify biological sequences, but, in contrast to English language sentences, my sequences had only 20 possible symbols per position rather than 50-100k.

    One interesting feature of their network that may be useful to you is their word embeddings. Word embeddings map individual words (each can be considered an indicator vector of length 100k) to real valued vectors of length 50. Euclidean distance between the embedded vectors should reflect semantic distance between words, so this could help you detect synonyms.

    For a simpler approach WordNet (http://wordnet.princeton.edu/) provides lists of synonyms, but I have never used this myself.

    0 讨论(0)
  • 2021-02-03 14:35

    There is an unsupervized boot-strapping approach that was explained to me to do this.

    There are different ways of applying this approach, and variants, but here's a simplified version.

    Concept:

    Start by a assuming that if two words are synonyms, then in your corpus they will appear in similar settings. (eating grapes, eating sandwich, etc.)

    (In this variant I will use co-occurence as the setting).

    Boot-Strapping Algorithm:

    We have two lists,

    • one list will contain the words that co-occur with food items
    • one list will contain the words that are food items

    Supervized Part

    Start by seeding one of the lists, for instance I might write the word Apple on the food items list.

    Now let the computer take over.

    Unsupervized Parts

    It will first find all words in the corpus that appear just before Apple, and sort them in order of most occuring.

    Take the top two (or however many you want) and add them into the co-occur with food items list. For example, perhaps "eating" and "Delicious" are the top two.

    Now use that list to find the next two top food words by ranking the words that appear to the right of each word in the list.

    Continue this process expanding each list until you are happy with the results.

    Once that's done

    (you may need to manually remove some things from the lists as you go which are clearly wrong.)

    Variants

    This procedure can be made quite effective if you take into account the grammatical setting of the keywords.

    Subj ate NounPhrase
    NounPhrase are/is Moldy
    
    The workers harvested the Apples. 
       subj       verb     Apples 
    
    That might imply harvested is an important verb for distinguishing foods.
    
    Then look for other occurrences of subj harvested nounPhrase
    

    You can expand this process to move words into categories, instead of a single category at each step.

    My Source

    This approach was used in a system developed at the University of Utah a few years back which was successful at compiling a decent list of weapon words, victim words, and place words by just looking at news articles.

    An interesting approach, and had good results.

    Not a neural network approach, but an intriguing methodology.

    Edit:

    the system at the University of Utah was called AutoSlog-TS, and a short slide about it can be seen here towards the end of the presentation. And a link to a paper about it here

    0 讨论(0)
  • 2021-02-03 14:40

    You could try LDA which is unsupervised. There is a supervised version of LDA but I can't remember the name! Stanford parser will have the algorithm which you can play around with. I understand it's not the NN approach you are looking for. But if you are just looking to group information together LDA would seem appropriate, especially if you are looking for 'topics'

    0 讨论(0)
  • 2021-02-03 14:44

    I'm not sure if I misunderstand your question. Do you require the system to be able to reason based on your input data alone, or would it be acceptable to refer to an external dictionary?

    If it is acceptable, I would recommend you to take a look at http://wordnet.princeton.edu/ which is a database of English word relationships. (It also exists for a few other languges.) These relationships include synonyms, antonyms, hyperonyms (which is what you really seem to be looking for, rather than synonyms), hyponyms, etc.

    The hyperonym / hyponym relationship links more generic terms to more specific ones. The words "banana" and "orange" are hyponyms of "fruit"; it is a hyperonym of both. http://en.wikipedia.org/wiki/Hyponymy Of course, "orange" is ambiguous, and is also a hyponym of "color".

    You asked for a method, but I can only point you to data. Even if this turns out to be useful, you will obviously need quite a bit of work to use it for your particular application. For one thing, how do you know when you have reached a suitable level of abstraction? Unless your input is hevily normalized, you will have a mix of generic and specific terms. Do you stop at "citrus","fruit", "plant", "animate", "concrete", or "noun"? (Sorry, just made up this particular hierarchy.) Still, hope this helps.

    0 讨论(0)
提交回复
热议问题