How to match integers in NLTK CFG?

前端 未结 2 374
醉话见心
醉话见心 2021-01-23 07:13

If I want to define a grammar in which one of the tokens will match an integer, how can i achieve it using nltk\'s string CFG?

For example -

S -> SK S         


        
相关标签:
2条回答
  • 2021-01-23 07:34

    A simple solution is to define a function which creates a parser given the sentence and grammar. This works for the integer problem by augmenting the grammar for each function call to include productions for the integers in the sentence. Here is an example function:

    def name_parser(G,sent):
        ints = [i for i in sent if i.isdigit()]
        lproductions = list(G.productions())
        lproduction.extend([nltk.grammar.Production(nltk.grammar.Nonterminal('INT'),[i]) for i in ints])
        lgrammar = nltk.grammar.CFG(G.start(),lproductions)
        parser = nltk.ChartParser(lgrammar)
        for tree in parser.parse(sent):
            print(tree)
    
    
    0 讨论(0)
  • 2021-01-23 07:45

    Create a number phrase as such:

    import nltk
    
    groucho_grammar = nltk.CFG.fromstring("""
    S -> NP VP
    PP -> P NP
    NP -> Det N | Det N PP | 'I' | NUM N
    VP -> V NP | VP PP
    Det -> 'an' | 'my'
    N -> 'elephant' | 'pajamas' | 'elephants'
    V -> 'shot'
    P -> 'in'
    NUM -> '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' | '10'
    """)
    
    sent = 'I shot 3 elephants'.split()
    parser = nltk.ChartParser(groucho_grammar)
    for tree in parser.parse(sent):
        print(tree)
    

    [out]:

    (S (NP I) (VP (V shot) (NP (NUM 3) (N elephants))))
    

    But note that that can only handle single digit number. So let's try compressing integers into a single token-type, e.g. '#NUM#':

    import nltk
    
    groucho_grammar = nltk.CFG.fromstring("""
    S -> NP VP
    PP -> P NP
    NP -> Det N | Det N PP | 'I' | NUM N
    VP -> V NP | VP PP
    Det -> 'an' | 'my'
    N -> 'elephant' | 'pajamas' | 'elephants'
    V -> 'shot'
    P -> 'in'
    NUM -> '#NUM#'
    """)
    
    sent = 'I shot 333 elephants'.split()
    sent = ['#NUM#' if i.isdigit() else i for i in sent]
    
    parser = nltk.ChartParser(groucho_grammar)
    for tree in parser.parse(sent):
        print(tree)
    

    [out]:

    (S (NP I) (VP (V shot) (NP (NUM #NUM#) (N elephants))))
    

    To put the numbers back, try:

    import nltk
    
    groucho_grammar = nltk.CFG.fromstring("""
    S -> NP VP
    PP -> P NP
    NP -> Det N | Det N PP | 'I' | NUM N
    VP -> V NP | VP PP
    Det -> 'an' | 'my'
    N -> 'elephant' | 'pajamas' | 'elephants'
    V -> 'shot'
    P -> 'in'
    NUM -> '#NUM#'
    """)
    
    original_sent = 'I shot 333 elephants'.split()
    sent = ['#NUM#' if i.isdigit() else i for i in original_sent]
    numbers = [i for i in original_sent if i.isdigit()]
    
    parser = nltk.ChartParser(groucho_grammar)
    for tree in parser.parse(sent):
        treestr = str(tree)
        for n in numbers:
            treestr = treestr.replace('#NUM#', n, 1)
        print(treestr)
    

    [out]:

    (S (NP I) (VP (V shot) (NP (NUM 333) (N elephants))))
    
    0 讨论(0)
提交回复
热议问题