If I want to define a grammar in which one of the tokens will match an integer, how can i achieve it using nltk\'s string CFG?
For example -
S -> SK S
A simple solution is to define a function which creates a parser given the sentence and grammar. This works for the integer problem by augmenting the grammar for each function call to include productions for the integers in the sentence. Here is an example function:
def name_parser(G,sent):
ints = [i for i in sent if i.isdigit()]
lproductions = list(G.productions())
lproduction.extend([nltk.grammar.Production(nltk.grammar.Nonterminal('INT'),[i]) for i in ints])
lgrammar = nltk.grammar.CFG(G.start(),lproductions)
parser = nltk.ChartParser(lgrammar)
for tree in parser.parse(sent):
print(tree)
Create a number phrase as such:
import nltk
groucho_grammar = nltk.CFG.fromstring("""
S -> NP VP
PP -> P NP
NP -> Det N | Det N PP | 'I' | NUM N
VP -> V NP | VP PP
Det -> 'an' | 'my'
N -> 'elephant' | 'pajamas' | 'elephants'
V -> 'shot'
P -> 'in'
NUM -> '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' | '10'
""")
sent = 'I shot 3 elephants'.split()
parser = nltk.ChartParser(groucho_grammar)
for tree in parser.parse(sent):
print(tree)
[out]:
(S (NP I) (VP (V shot) (NP (NUM 3) (N elephants))))
But note that that can only handle single digit number. So let's try compressing integers into a single token-type, e.g. '#NUM#':
import nltk
groucho_grammar = nltk.CFG.fromstring("""
S -> NP VP
PP -> P NP
NP -> Det N | Det N PP | 'I' | NUM N
VP -> V NP | VP PP
Det -> 'an' | 'my'
N -> 'elephant' | 'pajamas' | 'elephants'
V -> 'shot'
P -> 'in'
NUM -> '#NUM#'
""")
sent = 'I shot 333 elephants'.split()
sent = ['#NUM#' if i.isdigit() else i for i in sent]
parser = nltk.ChartParser(groucho_grammar)
for tree in parser.parse(sent):
print(tree)
[out]:
(S (NP I) (VP (V shot) (NP (NUM #NUM#) (N elephants))))
To put the numbers back, try:
import nltk
groucho_grammar = nltk.CFG.fromstring("""
S -> NP VP
PP -> P NP
NP -> Det N | Det N PP | 'I' | NUM N
VP -> V NP | VP PP
Det -> 'an' | 'my'
N -> 'elephant' | 'pajamas' | 'elephants'
V -> 'shot'
P -> 'in'
NUM -> '#NUM#'
""")
original_sent = 'I shot 333 elephants'.split()
sent = ['#NUM#' if i.isdigit() else i for i in original_sent]
numbers = [i for i in original_sent if i.isdigit()]
parser = nltk.ChartParser(groucho_grammar)
for tree in parser.parse(sent):
treestr = str(tree)
for n in numbers:
treestr = treestr.replace('#NUM#', n, 1)
print(treestr)
[out]:
(S (NP I) (VP (V shot) (NP (NUM 333) (N elephants))))