All of the parsers in Text.Parsec.Token
politely use lexeme
to eat whitespace after a token. Unfortunately for me, whitespace includes new lines, whic
No, it is not. Here is the relevant code.
From Text.Parsec.Token:
lexeme p
= do{ x <- p; whiteSpace; return x }
--whiteSpace
whiteSpace
| noLine && noMulti = skipMany (simpleSpace <?> "")
| noLine = skipMany (simpleSpace <|> multiLineComment <?> "")
| noMulti = skipMany (simpleSpace <|> oneLineComment <?> "")
| otherwise = skipMany (simpleSpace <|> oneLineComment <|> multiLineComment <?> "")
where
noLine = null (commentLine languageDef)
noMulti = null (commentStart languageDef)
One will notice in the where clause of whitespace
that the only only options looked at deal with comments. The lexeme
function uses whitespace
and it is used liberally in the rest of parsec.token
.
The ultimate solution for me was to use a proper lexical analyser (alex). Parsec does a very good job as a parsing library and it is a credit to the design that it can be mangled into doing lexical analysis, but for all but small and simple projects it will quickly become unwieldy. I now use alex to create a linear set of tokens and then Parsec turns them into an AST.
Well, not all parsers in Text.Parsec.Token
use lexeme
, although all of
them should. Worst of all it's not documented which of them consume white
space and which of them do not. Some of the parsers in Text.Parsec.Token
do consume white space after lexeme, some of them don't. Some of them
consume leading whitespace as well. You should read existing issues on
GitHub issue tracker if you want to control the situation fully.
In particular:
decimal
, hexadecimal
, and octal
parsers do not consume trailing
white space, see
the source,
and this issue;
integer
consumes leading whitespace as well, see
this issue;
rest of them probably consume trailing whitespace and thus newlines, this is however difficult to tell for sure because Parsec's code is particularly hairy (IMHO) and the project has no test suite (except for 3 tests which checks that already fixed bugs do not show up again, however it's not enough to prevent regressions and every change in source may break your code in next release of Parsec.)
There are various propositions how to make it configurable (what should be considered white space), none of them is merged or commented on for some reason.
But the real problem is rather in design of Text.Parsec.Token
, which locks
user into solutions built by makeTokenParser
. This design is particularly
non-flexible. There are many cases when only one solution is to copy the
entire module and edit it as needed.
But if you want modern and consistent Parsec there is an option to switch to Megaparsec where this (and many others) problem is non-existent.
Disclosure: I'm one of the authors of Megaparsec.
If newlines are your expression terminators, maybe it would make sense to split the input at each newline and parsing each line on its own.
Although the other answers about it not being possible are correct, I would like to point out that the char parsers are not using a lexeme parser.
I use parsec to analyse some html mustache templates. Whitespaces are important in that analysis. What I did was to simply parse the ">" and "}}" strings with Text.Parsec.Char.string
.
Since I am interested in whitespaces between tags and not inside them I can still use the reserved operators to parse "<" and "{{" etc., because the lexeme parser only consumes trailing whitespace.