Parsing grammars using OCaml

前端 未结 3 1297
余生分开走
余生分开走 2020-12-28 10:33

I have a task to write a (toy) parser for a (toy) grammar using OCaml and not sure how to start (and proceed with) this problem.

Here\'s a sample Awk grammar:

<
相关标签:
3条回答
  • 2020-12-28 11:06

    Here is a rough sketch - straightforwardly descend into the grammar and try each branch in order. Possible optimization : tail recursion for single non-terminal in a branch.

    exception Backtrack
    
    let parse l =
      let rules = snd awksub_grammar in
      let rec descend gram l =
        let rec loop = function 
          | [] -> raise Backtrack
          | x::xs -> try attempt x l with Backtrack -> loop xs
        in
        loop (rules gram)
      and attempt branch (path,tokens) =
        match branch, tokens with
        | T x :: branch' , h::tokens' when h = x -> 
            attempt branch' ((T x :: path),tokens')
        | N n :: branch' , _ -> 
            let (path',tokens) = descend n ((N n :: path),tokens) in 
            attempt branch' (path', tokens)
        | [], _ -> path,tokens
        | _, _ -> raise Backtrack
      in
      let (path,tail) = descend (fst awksub_grammar) ([],l) in
      tail, List.rev path
    
    0 讨论(0)
  • 2020-12-28 11:06

    Ok, so the first think you should do is write a lexical analyser. That's the function that takes the ‘raw’ input, like ["3"; "-"; "("; "4"; "+"; "2"; ")"], and splits it into a list of tokens (that is, representations of terminal symbols).

    You can define a token to be

    type token =
        | TokInt of int         (* an integer *)
        | TokBinOp of binop     (* a binary operator *)
        | TokOParen             (* an opening parenthesis *) 
        | TokCParen             (* a closing parenthesis *)     
    and binop = Plus | Minus 
    

    The type of the lexer function would be string list -> token list and the ouput of

    lexer ["3"; "-"; "("; "4"; "+"; "2"; ")"]
    

    would be something like

    [   TokInt 3; TokBinOp Minus; TokOParen; TokInt 4;
        TBinOp Plus; TokInt 2; TokCParen   ]
    

    This will make the job of writing the parser easier, because you won't have to worry about recognising what is a integer, what is an operator, etc.

    This is a first, not too difficult step because the tokens are already separated. All the lexer has to do is identify them.

    When this is done, you can write a more realistic lexical analyser, of type string -> token list, that takes a actual raw input, such as "3-(4+2)" and turns it into a token list.

    0 讨论(0)
  • 2020-12-28 11:06

    I'm not sure if you specifically require the derivation tree, or if this is a just a first step in parsing. I'm assuming the latter.

    You could start by defining the structure of the resulting abstract syntax tree by defining types. It could be something like this:

    type expr =
        | Operation of term * binop * term
        | Term of term
    and term =
        | Num of num
        | Lvalue of expr
        | Incrop of incrop * expression
    and incrop = Incr | Decr
    and binop = Plus | Minus
    and num = int
    

    Then I'd implement a recursive descent parser. Of course it would be much nicer if you could use streams combined with the preprocessor camlp4of...

    By the way, there's a small example about arithmetic expressions in the OCaml documentation here.

    0 讨论(0)
提交回复
热议问题