Can parser combinators be made efficient?

后端 未结 4 1691
北恋
北恋 2020-12-22 22:35

Around 6 years ago, I benchmarked my own parser combinators in OCaml and found that they were ~5× slower than the parser generators on offer at the time. I recently rev

相关标签:
4条回答
  • 2020-12-22 23:05

    In a nutshell, parser combinators are slow for lexing.

    There was a Haskell combinator library for building lexers (see "Lazy Lexing is Fast" Manuel M. T. Chakravarty) - as the tables were generated at runtime, there wasn't the hassle of code generation. The library got used a bit - it was initially used in one of the FFI preprocessors, but I don't think it ever got uploaded to Hackage, so maybe it was a little too inconvenient for regular use.

    In the OCaml code above, the parser is directly matching on char-lists so it can be as fast as list destructuring is in the host language (it would be much faster than Parsec if it were re-implemented in Haskell). Christian Lindig had an OCaml library that had a set of parser combinators and a set of lexer combinators - the lexer combinators were certainly much simpler than Manuel Chakravarty's, and it might might be worthwhile tracking down this library and bench-marking it before writing a lexer generator.

    0 讨论(0)
  • 2020-12-22 23:12

    I'm currently working on the next version of FParsec (v. 0.9), which will in many situations improve performance by up to a factor of 2 relative to the current version.

    [Update: FParsec 0.9 has been released, see http://www.quanttec.com/fparsec ]

    I've tested Jon's F# parser implementation against two FParsec implementations. The first FParsec parser is a direct translation of djahandarie's parser. The second one uses FParsec's embeddable operator precedence component. As the input I used a string generated with Jon's OCaml script with parameter 10, which gives me an input size of about 2.66MB. All parsers were compiled in release mode and were run on the 32-bit .NET 4 CLR. I only measured the pure parsing time and didn't include startup time or the time needed for constructing the input string (for the FParsec parsers) or the char list (Jon's parser).

    I measured the following numbers (updated numbers for v. 0.9 in parens):

    • Jon's hand-rolled parser: ~230ms
    • FParsec parser #1: ~270ms (~235ms)
    • FParsec parser #2: ~110ms (~102ms)

    In light of these numbers, I'd say that parser combinators can definitely offer competitive performance, at least for this particular problem, especially if you take into account that FParsec

    • automatically generates highly readable error messages,
    • supports very large files as input (with arbitrary backtracking), and
    • comes with a declarative, runtime-configurable operator-precedence parser module.

    Here's the code for the two FParsec implementations:

    Parser #1 (Translation of djahandarie's parser):

    open FParsec
    
    let str s = pstring s
    let expr, exprRef = createParserForwardedToRef()
    
    let fact = pint32 <|> between (str "(") (str ")") expr
    let term =   chainl1 fact ((str "*" >>% (*)) <|> (str "/" >>% (/)))
    do exprRef:= chainl1 term ((str "+" >>% (+)) <|> (str "-" >>% (-)))
    
    let parse str = run expr str
    

    Parser #2 (Idiomatic FParsec implementation):

    open FParsec
    
    let opp = new OperatorPrecedenceParser<_,_,_>()
    type Assoc = Associativity
    
    let str s = pstring s
    let noWS = preturn () // dummy whitespace parser
    
    opp.AddOperator(InfixOperator("-", noWS, 1, Assoc.Left, (-)))
    opp.AddOperator(InfixOperator("+", noWS, 1, Assoc.Left, (+)))
    opp.AddOperator(InfixOperator("*", noWS, 2, Assoc.Left, (*)))
    opp.AddOperator(InfixOperator("/", noWS, 2, Assoc.Left, (/)))
    
    let expr = opp.ExpressionParser
    let term = pint32 <|> between (str "(") (str ")") expr
    opp.TermParser <- term
    
    let parse str = run expr str
    
    0 讨论(0)
  • 2020-12-22 23:13

    I've come up with a Haskell solution that is 30× faster than the Haskell solution you posted (with my concocted test expression).

    Major changes:

    1. Change Parsec/String to Attoparsec/ByteString
    2. In the fact function, change read & many1 digit to decimal
    3. Made the chainl1 recursion strict (remove $! for the lazier version).

    I tried to keep everything else you had as similar as possible.

    import Control.Applicative
    import Data.Attoparsec
    import Data.Attoparsec.Char8
    import qualified Data.ByteString.Char8 as B
    
    expr :: Parser Int
    expr = chainl1 term ((+) <$ char '+' <|> (-) <$ char '-')
    
    term :: Parser Int
    term = chainl1 fact ((*) <$ char '*' <|> div <$ char '/')
    
    fact :: Parser Int
    fact = decimal <|> char '(' *> expr <* char ')'
    
    eval :: B.ByteString -> Int
    eval = either (error . show) id . eitherResult . parse expr . B.filter (/= ' ')
    
    chainl1 :: (Monad f, Alternative f) => f a -> f (a -> a -> a) -> f a
    chainl1 p op = p >>= rest where
      rest x = do f <- op
                  y <- p
                  rest $! (f x y)
               <|> pure x
    
    main :: IO ()
    main = B.readFile "expr" >>= (print . eval)
    

    I guess what I concluded from this is that the majority of the slowdown for the parser combinator was that it was sitting on an inefficient base, not that it was a parser combinator, per se.

    I imagine with more time and profiling this could go faster, as I stopped when I went past the 25× mark.

    I don't know if this would be faster than the precedence climbing parser ported to Haskell. Maybe that would be an interesting test?

    0 讨论(0)
  • 2020-12-22 23:13

    Have you tried one of the known fast parser libraries? Parsec's aims have never really been speed, but ease of use and clarity. Comparing to something like attoparsec may be a more fair comparison, especially because the string types are likely to be more equal (ByteString instead of String).

    I also wonder which compile flags were used. This being another trolling post by the infamous Jon Harrop, it would not surprise me if no optimisations were used at all for the Haskell code.

    0 讨论(0)
提交回复
热议问题