Parse indentation level with PEG.js

雨燕双飞 提交于 2019-12-03 05:16:10

问题


I have essentially the same question as PEG for Python style indentation, but I'd like to get a little more direction regarding this answer.

The answer successfully generates an array of strings that are each line of input with 'INDENT' and 'DEDENT' between lines. It seems like he's pretty much used PEG.js to tokenize, but no real parsing is happening.

So how can I extend his example to do some actual parsing?

As an example, how can I change this grammar:

start = obj
obj = id:id children:(indent obj* outdent)?
    {
        if (children) {
            let o = {}; o[id] = children[1];
            return o;
        } else {
            return id;
        }
    }
id = [a-z]
indent = '{'
outdent = '}'

to use indentation instead of braces to delineate blocks, and still get the same output?

(Use http://pegjs.majda.cz/online to test that grammar with the following input: a{bcd{zyx{}}})


回答1:


Parser:

// do not use result cache, nor line and column tracking

{ var indentStack = [], indent = ""; }

start
  = INDENT? l:line
    { return l; }

line
  = SAMEDENT line:(!EOL c:. { return c; })+ EOL?
    children:( INDENT c:line* DEDENT { return c; })?
    { var o = {}; o[line] = children; return children ? o : line.join(""); }

EOL
  = "\r\n" / "\n" / "\r"

SAMEDENT
  = i:[ \t]* &{ return i.join("") === indent; }

INDENT
  = &(i:[ \t]+ &{ return i.length > indent.length; }
      { indentStack.push(indent); indent = i.join(""); pos = offset; })

DEDENT
  = { indent = indentStack.pop(); }

Input:

a
  b
  c
  d
    z
    y
    x

Output:

{
   "a": [
      "b",
      "c",
      {
         "d": [
            "z",
            "y",
            "x"
         ]
      }
   ]
}

It cannot parse an empty object (last x), however, it should be easy to solve. Trick here is the SAMEDENT rule, it succeeds when indentation level hasn't changed. INDENT and DEDENT change current indentation level without changing position in text pos = offset.




回答2:


Here is a fix for @Jakub Kulhan´s grammar which works in PEG.js v 0.10.0. The last line needs to be changed to = &{ indent = indentStack.pop(); return true;} because PEG.js now does not allow standalone actions ({...}) in a grammar any more. This line is now a predicate (&{...}) which always succeeds (return true;).

i also removed the pos = offset; because it gives an error offset is not defined. Probably Jakub was referring to some global variable available in older versions of PEG.js. PEG.js now provides the location() function which returns an object which contains offset and other information.

// do not use result cache, nor line and column tracking

{ var indentStack = [], indent = ""; }

start
  = INDENT? l:line
    { return l; }

line
  = SAMEDENT line:(!EOL c:. { return c; })+ EOL?
    children:( INDENT c:line* DEDENT { return c; })?
    { var o = {}; o[line] = children; return children ? o : line.join(""); }

EOL
  = "\r\n" / "\n" / "\r"

SAMEDENT
  = i:[ \t]* &{ return i.join("") === indent; }

INDENT
  = &(i:[ \t]+ &{ return i.length > indent.length; }
      { indentStack.push(indent); indent = i.join(""); })

DEDENT
  = &{ indent = indentStack.pop(); return true;}

Starting with v 0.11.0 PEG.js also supports the Value Plucking operator, @ which would allow to write this grammar even simpler, but as it is currently not in the online parser i will refrain from adding it to this example.




回答3:


This example uses the colon (:) in order to separate between an object and a simple letter. That way it can also end with an object, but the colon is required. Like the example in the question it does not take care of ignorable whitespaces (eg. before a colon). It is based on Jakubs Kulhans´ example:

// do not use result cache, nor line and column tracking

{ var indentStack = [], indent = ""; }

Start = Object

Object = Block / Letterline

Block = Samedent id:Letter ':' childs:(
    Newline Indent childs:Object* Dedent {return childs;}
)* {
    if (childs) {
        var o = {}; o[id] = childs.flat().flat();
        return o;
    } else {
        return id;
    }
}

Letterline = Samedent letters:Letter+ Newline? {return letters;}

Letter = [a-z]

Newline = "\r\n" / "\n" / "\r"

Indent = &(
    i:[ ]+ &{
        return i.length > indent.length;
    } {
        indentStack.push(indent);
        indent = i.join("");
    }
)

Samedent = i:[ ]* &{ return i.join("") === indent; }

Dedent = &{ indent = indentStack.pop(); return true; }

The grammar will produce the desired output for the following input:

a:
  bc
  d:
    zy
    x:


来源:https://stackoverflow.com/questions/11659095/parse-indentation-level-with-peg-js

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!