How to match integers in NLTK CFG?

前端 未结 2 375
醉话见心
醉话见心 2021-01-23 07:13

If I want to define a grammar in which one of the tokens will match an integer, how can i achieve it using nltk\'s string CFG?

For example -

S -> SK S         


        
2条回答
  •  旧巷少年郎
    2021-01-23 07:45

    Create a number phrase as such:

    import nltk
    
    groucho_grammar = nltk.CFG.fromstring("""
    S -> NP VP
    PP -> P NP
    NP -> Det N | Det N PP | 'I' | NUM N
    VP -> V NP | VP PP
    Det -> 'an' | 'my'
    N -> 'elephant' | 'pajamas' | 'elephants'
    V -> 'shot'
    P -> 'in'
    NUM -> '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' | '10'
    """)
    
    sent = 'I shot 3 elephants'.split()
    parser = nltk.ChartParser(groucho_grammar)
    for tree in parser.parse(sent):
        print(tree)
    

    [out]:

    (S (NP I) (VP (V shot) (NP (NUM 3) (N elephants))))
    

    But note that that can only handle single digit number. So let's try compressing integers into a single token-type, e.g. '#NUM#':

    import nltk
    
    groucho_grammar = nltk.CFG.fromstring("""
    S -> NP VP
    PP -> P NP
    NP -> Det N | Det N PP | 'I' | NUM N
    VP -> V NP | VP PP
    Det -> 'an' | 'my'
    N -> 'elephant' | 'pajamas' | 'elephants'
    V -> 'shot'
    P -> 'in'
    NUM -> '#NUM#'
    """)
    
    sent = 'I shot 333 elephants'.split()
    sent = ['#NUM#' if i.isdigit() else i for i in sent]
    
    parser = nltk.ChartParser(groucho_grammar)
    for tree in parser.parse(sent):
        print(tree)
    

    [out]:

    (S (NP I) (VP (V shot) (NP (NUM #NUM#) (N elephants))))
    

    To put the numbers back, try:

    import nltk
    
    groucho_grammar = nltk.CFG.fromstring("""
    S -> NP VP
    PP -> P NP
    NP -> Det N | Det N PP | 'I' | NUM N
    VP -> V NP | VP PP
    Det -> 'an' | 'my'
    N -> 'elephant' | 'pajamas' | 'elephants'
    V -> 'shot'
    P -> 'in'
    NUM -> '#NUM#'
    """)
    
    original_sent = 'I shot 333 elephants'.split()
    sent = ['#NUM#' if i.isdigit() else i for i in original_sent]
    numbers = [i for i in original_sent if i.isdigit()]
    
    parser = nltk.ChartParser(groucho_grammar)
    for tree in parser.parse(sent):
        treestr = str(tree)
        for n in numbers:
            treestr = treestr.replace('#NUM#', n, 1)
        print(treestr)
    

    [out]:

    (S (NP I) (VP (V shot) (NP (NUM 333) (N elephants))))
    

提交回复
热议问题