I want to parse a file with lines of the following content:
simple word abbr -8. (012) word, simple phrase, one another phrase - (simply dummy text of the printing; Lorem Ipsum : "Lorem" - has been the industry's standard dummy text, ever since the 1500s!; "It is a long established!"; "Sometimes by accident, sometimes on purpose (injected humour and the like)"; "sometimes on purpose") This is the end of the line
so now explaining the parts (not all spaces are described, because of the markup here):
simple word
is one or several words (phrase) separated by the whitespaceabbr -
is a fixed part of the string (never changes)8
- optional number.
- always includedword, simple phrase, one another phrase
- one or several words or phrases separated by comma- (
- fixed part, always includedsimply dummy text of the printing; Lorem Ipsum : "Lorem" - has been the industry's standard dummy text, ever since the 1500s!;
- (optional) one or several phrases separated by;
"It is a long established!"; "Sometimes by accident, sometimes on purpose (injected humour and the like)"; "sometimes on purpose"
- (optional) one or several phrases with quotation marks"
separated by;
) This is the end of the line
- always included
In the worst case there are no phrases in clause, but this is uncommon: there should be a phrase without augmenting quotation marks (phrase1
type) or with them (phrase2
type).
So the phrases are Natural Language sentences (with all the punctuation possible)...
BUT:
- the internal content is irrelevant (i.e. I do not need to parse the Natural Language itself in the NLP meaning)
- it is just required to mark it as a
phrase1
orphrase2
types:- those without and with quotation marks, i.e., if the phrase, which is placed between
(
and;
or;
and;
or;
and)
or even between(
and)
is augmented with quotation marks, then it isphrase2
type - otherwise, if the phrase either begins or ends without quotation marks, though it could have all the marks within the phrase, it is the
phrase1
type
- those without and with quotation marks, i.e., if the phrase, which is placed between
Since to write a Regex (PCRE) for such an input is an overkill, so I looked to the parsing approach (EBNF or similar). I ended up with a PEG.js parser generator. I created a basic grammar variants (even not handling the part with the different phrases in the clause):
start = term _ "abbr" _ "-" .+
term = word (_? word !(_ "abbr" _ "-"))+
word = letters:letter+ {return letters.join("")}
letter = [A-Za-z]
_ "whitespace"
= [ \t\n\r]*
or (the difference is only in " abbr -"
and "_ "abbr" _ "-""
):
start = term " abbr -" .+
term = word (_? word !(" abbr -"))+
word = letters:letter+ {return letters.join("")}
letter = [A-Za-z]
_ "whitespace"
= [ \t\n\r]*
But even this simple grammar cannot parse the beginning of the string. The errors are:
Parse Error Expected [A-Za-z] but " " found.
Parse Error Expected "abbr" but "-" found.
- etc.
So it looks the problem is in the ambiguity: "abbr"
is consumed withing the term
as a word
token. Although I defined the rule !(" abbr -")
, which I thought has a meaning, that the next word
token will be only consumed, if the next substring is not of " abbr -"
kind.
I didn't found any good examples explaining the following expressions of PEG.js, which seems to me as a possible solution of the aforementioned problem [from: http://pegjs.majda.cz/documentation]:
& expression
! expression
$ expression
& { predicate }
! { predicate }
TL;DR:
related to PEG.js:
are there any examples of applying the rules:
& expression
! expression
$ expression
& { predicate }
! { predicate }
general question:
- what is the possible approach to handle such complex strings with intuitive ambiguous grammars? This is still not a Natural Language and it looks like it has some formal structure, just with several optional parts. One of the ideas is to split the strings by preprocessing (with the help of Regular Expressions, in the places of fixed elements, i.e. "abbr -" ") This is the end of the line") and then create for each splited part a separate grammar. But it seems to have have performance issues and scalability problems (i.e. - what if the fixed elements will change a bit - e.g. there will be no
-
char anymore.)
Update1:
I found the rule which solves the problem with matching the "abbr -"
ambiguity:
term = term:(word (!" abbr -" _? word))+ {return term.join("")}
but the result looks strange:
[
"simple, ,word",
" abbr -",
[
"8",
...
],
...
]
if removing the predicate: term = term:(word (!" abbr -" _? word))+
:
[
[
"simple",
[
[
undefined,
[
" "
],
"word"
]
]
],
" abbr -",
[
"8",
".",
" ",
"(",
...
],
...
]
I expected something like:
[
[
"simple word"
],
" abbr -",
[
"8",
".",
" ",
"(",
...
],
...
]
or at least:
[
[
"simple",
[
" ",
"word"
]
],
" abbr -",
[
"8",
".",
" ",
"(",
...
],
...
]
The expression is grouped, so why is it separated in so many nesting levels and even undefined
is included in the output? Are there any general rules to fold the result based on the expression in the rule?
Update2:
I created the grammar so that it parses as desired, though I didn't yet identified the clear process of such a grammar creation:
start
= (term:term1 (" abbr -" number "." _ "("number:number") "{return number}) terms:terms2 ((" - (" phrases:phrases ")" .+){return phrases}))
//start //alternative way = looks better
// = (term:term1 " abbr -" number "." _ "("number:number") " terms:terms2 " - (" phrases:phrases ")" .+){return {term: term, number: number, phrases:phrases}}
term1
= term1:(
start_word:word
(rest_words:(
rest_word:(
(non_abbr:!" abbr -"{return non_abbr;})
(space:_?{return space[0];}) word){return rest_word.join("");})+{return rest_words.join("")}
)) {return term1.join("");}
terms2
= terms2:(start_word:word (rest_words:(!" - (" ","?" "? word)+){rest_words = rest_words.map(function(array) {
return array.filter(function(n){return n != null;}).join("");
}); return start_word + rest_words.join("")})
phrases
// = ((phrase_t:(phrase / '"' phrase '"') ";"?" "?){return phrase_t})+
= (( (phrase:(phrase2 / phrase1) ";"?" "?) {return phrase;})+)
phrase2
= (('"'pharse2:(phrase)'"'){return {phrase2: pharse2}})
phrase1
= ((pharse1:phrase){return {phrase1: pharse1}})
phrase
= (general_phrase:(!(';' / ')' / '";' / '")') .)+ ){return general_phrase.map(function(array){return array[1]}).join("")}
word = letters:letter+ {return letters.join("")}
letter = [A-Za-z]
number = digits:digit+{return digits.join("")}
digit = [0-9]
_ "whitespace"
= [ \t\n\r]*
It could be tested either on the PEG.js author's site: [http://pegjs.majda.cz/online] or on the PEG.js Web-IDE: [http://peg.arcanis.fr/]
If somebody has answers for the previous questions (i.e. general approach for disambiguation of the grammar, examples to the expressions available in PEG.js) as well as improvement advices to the grammar itself (this is I think far away from an ideal grammar now), I would very appreciate!
so why is it separated in so many nesting levels and even undefined is included in the output?
If you look at the documentation for PEG.js, you'll see almost every operator collects the results of its operands into an array. undefined
is returned by the !
operator.
The $
operator bypasses all this nesting and just gives you the actual string that matches, eg: [a-z]+
will give an array of letters, but $[a-z]+
will give a string of letters.
I think most of the parsing here follows the pattern: "give me everything until I see this string". You should express this in PEG by first using !
to make sure you haven't hit the terminating string, and then just taking the next character. For example, to get everything up to " abbr -":
(!" abbr -" .)+
If the terminating string is a single character, you can use [^]
as a short form of this, eg: [^x]+
is a shorter way of saying (!"x" .)+
.
Parsing comma/semicolon separated phrases rather than comma/semicolon terminated phrases is slightly annoying, but treating them as optional terminators seems to work (with some trim
ing).
start = $(!" abbr -" .)+ " abbr -" $num "." [ ]? "(012)"
phrase_comma+ "- (" noq_phrase_semi+ q_phrase_semi+ ")"
$.*
phrase_comma = p:$[^-,]+ [, ]* { return p.trim() }
noq_phrase_semi = !'"' p:$[^;]+ [; ]* { return p.trim() }
q_phrase_semi = '"' p:$[^"]+ '"' [; ]* { return p }
num = [0-9]+
gives
[
"simple word",
" abbr -",
"8",
".",
" ",
"(012)",
[
"word",
"simple phrase",
"one another phrase"
],
"- (",
[
"simply dummy text of the printing",
"Lorem Ipsum : \"Lorem\" - has been the industry's standard dummy text, ever since the 1500s!"
],
[
"It is a long established!",
"Sometimes by accident, sometimes on purpose (injected humour and the like)",
"sometimes on purpose"
],
")",
" This is the end of the line"
]
来源:https://stackoverflow.com/questions/24659684/problems-with-an-ambiguous-grammar-and-peg-js-no-examples-found