For example, \"parrots do not swim.\" Here the main verb is \"swim\". How can we extract that by language processing? Are there any known algorithms for this purpose?
You can run a dependency parsing algorithm on the sentence and the find the dependent of the root
relation. For example, running the sentence "Parrots do not swim" through the Stanford Parser online demo, I get the following dependencies:
nsubj(swim-4, Parrots-1)
aux(swim-4, do-2)
neg(swim-4, not-3)
root(ROOT-0, swim-4)
Each of these lines provides information about a different grammatical relation between two words in the sentence (see below). You need the last line, which says that swim
is the root of the sentence, i.e. the main verb. So to extract the main verb, perform dependency parsing first and find the dependency that reads root(ROOT-0, X)
. X
will be the main verb.
There are several readily available dependency parsers, such as the one available with Stanford CoreNLP or Malt parser. I prefer Stanford because it is comparable in accuracy, but has better documentation and supports multithreaded parsing (if you have lots of text). The Stanford parser outputs XML, so you will have to parse that to get the dependency information above.
For the sake of completeness, a brief explanation of the rest of the output. The first line says that parrots
, the first word in the sentence, is the subject of swim
, the 4th word. The second line says that do
is an auxiliary verb related to swim
, and the third says that not
negates swim
. For a more detailed explanation of the meaning of each dependency, see the Stanford typed dependency manual.
Edit:
Depending on how you define main verb
, some sentences may have more than one main verb, e.g. I like cats and hate snakes
. The dependency parse for this contain the dependencies:
root(ROOT-0, like-2)
conj(like-2, hate-5)
which together say that according to the parser the main verb is like
, but hate
is conjoined to it. For your purposes you might want to consider both like
and hate
to be main.
To get the verb (or any other Part-Of-Speech) there are many supervised and unsupervised algorithms available like Viterbi Algorithm, Hidden Markov Models, Brill Tagger, Constraint Grammer, etc. Even we have libraries like NLTK(Natural Language Tool Kit) for Python (and similar is also available for Java) which have these algorithm already encoded in them. Annotating POS in any document or sentence is a complex job (especially when you desire high accuracy ) and you need an in-depth knowledge in this field, begin with the very basics first and continuous effort might lead you to develop an algorithm which has higher efficiency than the prevailing one.