I\'m working on a non-English parser with Unicode characters. For that, I decided to use NLTK.
But it requires a predefined context-free grammar as below:
You can't write such kind of rules in nltk right now without any effort but you can make some tricks.
For example, transcribe your sentence in some kind of word-informative labels and write your grammar rules accordingly.
For example (using POS tag as label):
Dogs eat bones.
becomes:
NN V NN.
And grammar terminal rules example:
V -> 'V'
If that's not enough you should take a look for a more flexible formalism implementation.
If you are creating a parser, then you have to add a step of pos-tagging before the actual parsing -- there is no way to successfully determine the POS-tag of a word out of context. For example, 'closed' can be an adjective or a verb; a POS-tagger will find out the correct tag for you from the context of the word. Then you can use the output of the POS-tagger to create your CFG.
You can use one of the many existing POS-taggers. In NLTK, you can simply do something like:
import nltk
input_sentence = "Dogs chase cats"
text = nltk.word_tokenize(input_sentence)
list_of_tokens = nltk.pos_tag(text)
print list_of_tokens
The output will be:
[('Dogs', 'NN'), ('chase', 'VB'), ('cats', 'NN')]
which you can use to create a grammar string and feed it to nltk.parse_cfg()
.
Maybe you're looking for CFG.fromstring()
(formerly parse_cfg()
)?
From Chapter 7 of the NLTK book (updated to NLTK 3.0):
> grammar = nltk.CFG.fromstring("""
S -> NP VP
VP -> V NP | V NP PP
V -> "saw" | "ate"
NP -> "John" | "Mary" | "Bob" | Det N | Det N PP
Det -> "a" | "an" | "the" | "my"
N -> "dog" | "cat" | "cookie" | "park"
PP -> P NP
P -> "in" | "on" | "by" | "with"
""")
> sent = "Mary saw Bob".split()
> rd_parser = nltk.RecursiveDescentParser(grammar)
> for p in rd_parser.parse(sent):
print p
(S (NP Mary) (VP (V saw) (NP Bob)))
You can use NLTK RegexTagger that have regular expression capability of decide token. This is exactly you need need in your case. As token ending with 'ing' will be tagged as gerunds and token ending with 'ed' will be tagged with verb past. see the example below.
patterns = [
(r'.*ing$', 'VBG'), # gerunds
(r'.*ed$', 'VBD'), # simple past
(r'.*es$', 'VBZ'), # 3rd singular present
(r'.*ould$', 'MD'), # modals
(r'.*\'s$', 'NN$'), # possessive nouns
(r'.*s$', 'NNS') # plural nouns
]
Note that these are processed in order, and the first one that matches is applied. Now we can set up a tagger and use it to tag a sentence. After this step, it is correct about a fifth of the time.
regexp_tagger = nltk.RegexpTagger(patterns)
regexp_tagger.tag(your_sent)
you can use Combining Taggers for using collectively multiple tagger in a sequence.