In Parsec, is there a way to prevent lexeme from consuming newlines?

前端 未结 4 1983
生来不讨喜
生来不讨喜 2021-02-13 12:36

All of the parsers in Text.Parsec.Token politely use lexeme to eat whitespace after a token. Unfortunately for me, whitespace includes new lines, whic

相关标签:
4条回答
  • 2021-02-13 12:44

    No, it is not. Here is the relevant code.

    From Text.Parsec.Token:

    lexeme p
        = do{ x <- p; whiteSpace; return x  }
    
    
    --whiteSpace
    whiteSpace
        | noLine && noMulti  = skipMany (simpleSpace <?> "")
        | noLine             = skipMany (simpleSpace <|> multiLineComment <?> "")
        | noMulti            = skipMany (simpleSpace <|> oneLineComment <?> "")
        | otherwise          = skipMany (simpleSpace <|> oneLineComment <|> multiLineComment <?> "")
        where
          noLine  = null (commentLine languageDef)
          noMulti = null (commentStart languageDef)
    

    One will notice in the where clause of whitespace that the only only options looked at deal with comments. The lexeme function uses whitespace and it is used liberally in the rest of parsec.token.


    Update Sept. 28, 2015

    The ultimate solution for me was to use a proper lexical analyser (alex). Parsec does a very good job as a parsing library and it is a credit to the design that it can be mangled into doing lexical analysis, but for all but small and simple projects it will quickly become unwieldy. I now use alex to create a linear set of tokens and then Parsec turns them into an AST.

    0 讨论(0)
  • 2021-02-13 12:57

    Well, not all parsers in Text.Parsec.Token use lexeme, although all of them should. Worst of all it's not documented which of them consume white space and which of them do not. Some of the parsers in Text.Parsec.Token do consume white space after lexeme, some of them don't. Some of them consume leading whitespace as well. You should read existing issues on GitHub issue tracker if you want to control the situation fully.

    In particular:

    • decimal, hexadecimal, and octal parsers do not consume trailing white space, see the source, and this issue;

    • integer consumes leading whitespace as well, see this issue;

    • rest of them probably consume trailing whitespace and thus newlines, this is however difficult to tell for sure because Parsec's code is particularly hairy (IMHO) and the project has no test suite (except for 3 tests which checks that already fixed bugs do not show up again, however it's not enough to prevent regressions and every change in source may break your code in next release of Parsec.)

    There are various propositions how to make it configurable (what should be considered white space), none of them is merged or commented on for some reason.

    But the real problem is rather in design of Text.Parsec.Token, which locks user into solutions built by makeTokenParser. This design is particularly non-flexible. There are many cases when only one solution is to copy the entire module and edit it as needed.

    But if you want modern and consistent Parsec there is an option to switch to Megaparsec where this (and many others) problem is non-existent.


    Disclosure: I'm one of the authors of Megaparsec.

    0 讨论(0)
  • 2021-02-13 12:58

    If newlines are your expression terminators, maybe it would make sense to split the input at each newline and parsing each line on its own.

    0 讨论(0)
  • 2021-02-13 13:08

    Although the other answers about it not being possible are correct, I would like to point out that the char parsers are not using a lexeme parser.
    I use parsec to analyse some html mustache templates. Whitespaces are important in that analysis. What I did was to simply parse the ">" and "}}" strings with Text.Parsec.Char.string.
    Since I am interested in whitespaces between tags and not inside them I can still use the reserved operators to parse "<" and "{{" etc., because the lexeme parser only consumes trailing whitespace.

    0 讨论(0)
提交回复
热议问题