What does “fragment” mean in ANTLR?

后端 未结 4 1442
太阳男子
太阳男子 2020-11-27 04:21

What does fragment mean in ANTLR?

I\'ve seen both rules:

fragment DIGIT : \'0\'..\'9\';

and

DIGIT : \'0\'         


        
相关标签:
4条回答
  • 2020-11-27 04:23

    A fragment is somewhat akin to an inline function: It makes the grammar more readable and easier to maintain.

    A fragment will never be counted as a token, it only serves to simplify a grammar.

    Consider:

    NUMBER: DIGITS | OCTAL_DIGITS | HEX_DIGITS;
    fragment DIGITS: '1'..'9' '0'..'9'*;
    fragment OCTAL_DIGITS: '0' '0'..'7'+;
    fragment HEX_DIGITS: '0x' ('0'..'9' | 'a'..'f' | 'A'..'F')+;
    

    In this example, matching a NUMBER will always return a NUMBER to the lexer, regardless of if it matched "1234", "0xab12", or "0777".

    See item 3

    0 讨论(0)
  • 2020-11-27 04:23

    This blog post has a very clear example where fragment makes a significant difference:

    grammar number;  
    
    number: INT;  
    DIGIT : '0'..'9';  
    INT   :  DIGIT+;
    

    The grammar will recognize '42' but not '7'. You can fix it by making digit a fragment (or moving DIGIT after INT).

    0 讨论(0)
  • 2020-11-27 04:24

    The Definitive ANTLR 4 Reference (Page 106):

    Rules prefixed with fragment can be called only from other lexer rules; they are not tokens in their own right.


    Abstract Concepts:

    Case1: ( if I need the RULE1, RULE2, RULE3 entities or group info )

    rule0 : RULE1 | RULE2 | RULE3 ;
    RULE1 : [A-C]+ ;
    RULE2 : [DEF]+ ;
    RULE3 : ('G'|'H'|'I')+ ;
    


    Case2: ( if I don't care RULE1, RULE2, RULE3, I just focus on RULE0 )

    RULE0 : [A-C]+ | [DEF]+ | ('G'|'H'|'I')+ ;
    // RULE0 is a terminal node. 
    // You can't name it 'rule0', or you will get syntax errors:
    // 'A-C' came as a complete surprise to me while matching alternative
    // 'DEF' came as a complete surprise to me while matching alternative
    


    Case3: ( is equivalent to Case2, making it more readable than Case2)

    RULE0 : RULE1 | RULE2 | RULE3 ;
    fragment RULE1 : [A-C]+ ;
    fragment RULE2 : [DEF]+ ;
    fragment RULE3 : ('G'|'H'|'I')+ ;
    // You can't name it 'rule0', or you will get warnings:
    // warning(125): implicit definition of token RULE1 in parser
    // warning(125): implicit definition of token RULE2 in parser
    // warning(125): implicit definition of token RULE3 in parser
    // and failed to capture rule0 content (?)
    


    Differences between Case1 and Case2/3 ?

    1. The lexer rules are equivalent
    2. Each of RULE1/2/3 in Case1 is a capturing group, similar to Regex:(X)
    3. Each of RULE1/2/3 in Case3 is a non-capturing group, similar to Regex:(?:X)



    Let's see a concrete example.

    Goal: identify [ABC]+, [DEF]+, [GHI]+ tokens

    input.txt

    ABBCCCDDDDEEEEE ABCDE
    FFGGHHIIJJKK FGHIJK
    ABCDEFGHIJKL
    


    Main.py

    import sys
    from antlr4 import *
    from AlphabetLexer import AlphabetLexer
    from AlphabetParser import AlphabetParser
    from AlphabetListener import AlphabetListener
    
    class MyListener(AlphabetListener):
        # Exit a parse tree produced by AlphabetParser#content.
        def exitContent(self, ctx:AlphabetParser.ContentContext):
            pass
    
        # (For Case1 Only) enable it when testing Case1
        # Exit a parse tree produced by AlphabetParser#rule0.
        def exitRule0(self, ctx:AlphabetParser.Rule0Context):
            print(ctx.getText())
    # end-of-class
    
    def main():
        file_name = sys.argv[1]
        input = FileStream(file_name)
        lexer = AlphabetLexer(input)
        stream = CommonTokenStream(lexer)
        parser = AlphabetParser(stream)
        tree = parser.content()
        print(tree.toStringTree(recog=parser))
    
        listener = MyListener()
        walker = ParseTreeWalker()
        walker.walk(listener, tree)
    # end-of-def
    
    main()
    


    Case1 and results:

    Alphabet.g4 (Case1)

    grammar Alphabet;
    
    content : (rule0|ANYCHAR)* EOF;
    
    rule0 : RULE1 | RULE2 | RULE3 ;
    RULE1 : [A-C]+ ;
    RULE2 : [DEF]+ ;
    RULE3 : ('G'|'H'|'I')+ ;
    
    ANYCHAR : . -> skip;
    

    Result:

    # Input data (for reference)
    # ABBCCCDDDDEEEEE ABCDE
    # FFGGHHIIJJKK FGHIJK
    # ABCDEFGHIJKL
    
    $ python3 Main.py input.txt 
    (content (rule0 ABBCCC) (rule0 DDDDEEEEE) (rule0 ABC) (rule0 DE) (rule0 FF) (rule0 GGHHII) (rule0 F) (rule0 GHI) (rule0 ABC) (rule0 DEF) (rule0 GHI) <EOF>)
    ABBCCC
    DDDDEEEEE
    ABC
    DE
    FF
    GGHHII
    F
    GHI
    ABC
    DEF
    GHI
    


    Case2/3 and results:

    Alphabet.g4 (Case2)

    grammar Alphabet;
    
    content : (RULE0|ANYCHAR)* EOF;
    
    RULE0 : [A-C]+ | [DEF]+ | ('G'|'H'|'I')+ ;
    
    ANYCHAR : . -> skip;
    

    Alphabet.g4 (Case3)

    grammar Alphabet;
    
    content : (RULE0|ANYCHAR)* EOF;
    
    RULE0 : RULE1 | RULE2 | RULE3 ;
    fragment RULE1 : [A-C]+ ;
    fragment RULE2 : [DEF]+ ;
    fragment RULE3 : ('G'|'H'|'I')+ ;
    
    ANYCHAR : . -> skip;
    

    Result:

    # Input data (for reference)
    # ABBCCCDDDDEEEEE ABCDE
    # FFGGHHIIJJKK FGHIJK
    # ABCDEFGHIJKL
    
    $ python3 Main.py input.txt 
    (content ABBCCC DDDDEEEEE ABC DE FF GGHHII F GHI ABC DEF GHI <EOF>)
    

    Did you see "capturing groups" and "non-capturing groups" parts?




    Let's see the concrete example2.

    Goal: identify octal / decimal / hexadecimal numbers

    input.txt

    0
    123
     1~9999
     001~077
    0xFF, 0x01, 0xabc123
    


    Number.g4

    grammar Number;
    
    content
        : (number|ANY_CHAR)* EOF
        ;
    
    number
        : DECIMAL_NUMBER
        | OCTAL_NUMBER
        | HEXADECIMAL_NUMBER
        ;
    
    DECIMAL_NUMBER
        : [1-9][0-9]*
        | '0'
        ;
    
    OCTAL_NUMBER
        : '0' '0'..'9'+
        ;
    
    HEXADECIMAL_NUMBER
        : '0x'[0-9A-Fa-f]+
        ;
    
    ANY_CHAR
        : .
        ;
    


    Main.py

    import sys
    from antlr4 import *
    from NumberLexer import NumberLexer
    from NumberParser import NumberParser
    from NumberListener import NumberListener
    
    class Listener(NumberListener):
        # Exit a parse tree produced by NumberParser#Number.
        def exitNumber(self, ctx:NumberParser.NumberContext):
            print('%8s, dec: %-8s, oct: %-8s, hex: %-8s' % (ctx.getText(),
                ctx.DECIMAL_NUMBER(), ctx.OCTAL_NUMBER(), ctx.HEXADECIMAL_NUMBER()))
        # end-of-def
    # end-of-class
    
    def main():
        input = FileStream(sys.argv[1])
        lexer = NumberLexer(input)
        stream = CommonTokenStream(lexer)
        parser = NumberParser(stream)
        tree = parser.content()
        print(tree.toStringTree(recog=parser))
    
        listener = Listener()
        walker = ParseTreeWalker()
        walker.walk(listener, tree)
    # end-of-def
    
    main()
    


    Result:

    # Input data (for reference)
    # 0
    # 123
    #  1~9999
    #  001~077
    # 0xFF, 0x01, 0xabc123
    
    $ python3 Main.py input.txt 
    (content (number 0) \n (number 123) \n   (number 1) ~ (number 9999) \n   (number 001) ~ (number 077) \n (number 0xFF) ,   (number 0x01) ,   (number 0xabc123) \n <EOF>)
           0, dec: 0       , oct: None    , hex: None    
         123, dec: 123     , oct: None    , hex: None    
           1, dec: 1       , oct: None    , hex: None    
        9999, dec: 9999    , oct: None    , hex: None    
         001, dec: None    , oct: 001     , hex: None    
         077, dec: None    , oct: 077     , hex: None    
        0xFF, dec: None    , oct: None    , hex: 0xFF    
        0x01, dec: None    , oct: None    , hex: 0x01    
    0xabc123, dec: None    , oct: None    , hex: 0xabc123
    

    If you add the modifier 'fragment' to DECIMAL_NUMBER, OCTAL_NUMBER, HEXADECIMAL_NUMBER, you won't be able to capture the number entities (since they are not tokens anymore). And the result will be:

    $ python3 Main.py input.txt 
    (content 0 \n 1 2 3 \n   1 ~ 9 9 9 9 \n   0 0 1 ~ 0 7 7 \n 0 x F F ,   0 x 0 1 ,   0 x a b c 1 2 3 \n <EOF>)
    
    0 讨论(0)
  • 2020-11-27 04:25

    According to the Definitive Antlr4 references book :

    Rules prefixed with fragment can be called only from other lexer rules; they are not tokens in their own right.

    actually they'll improve readability of your grammars.

    look at this example :

    STRING : '"' (ESC | ~["\\])* '"' ;
    fragment ESC : '\\' (["\\/bfnrt] | UNICODE) ;
    fragment UNICODE : 'u' HEX HEX HEX HEX ;
    fragment HEX : [0-9a-fA-F] ;
    

    STRING is a lexer using fragment rule like ESC .Unicode is used in Esc rule and Hex is used in Unicode fragment rule. ESC and UNICODE and HEX rules can't be used explicitly.

    0 讨论(0)
提交回复
热议问题