问题
Recently I've discovered the python module pyparsing
, a wonderful tool for parsing data by writing the grammar, not the parser. I'm new to the idea of context-free grammars, so please correct any false assumptions in this question.
Pyparsing can implement a BNF (Backus–Naur Form) context-free grammar. This grammar can be recursive, but can it have a forward look-ahead? I've been wondering about the answer to this since I stumbled across this question. Let me give you a concrete example. Consider the string:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
Let the grammar look something like:
<number> :: __<digit>__
<block> :: <number>(x) (<number> <number> <number> .... x times)
i.e. read the first number token, save it as x
and then consume the next x
numbers and group them together. The parsed line should look like:
[[1, 2], [3, 4, 5, 6], [7, 8, 9, 10, 11, 12, 13, 14], [15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30]]
I've written a simple python MWE not using pyparsing so it's clear what I'm trying to do:
A = range(1,31)
B, sub_b = [], []
consume = 0
for a in A:
if consume:
sub_b.append(a)
consume -= 1
else:
if sub_b: B.append(sub_b)
sub_b = [a,]
consume = a
B.append(sub_b)
print B
Two (related) questions: Can this be done with a BNF context-free grammar? If yes/no how can I do it with pyparsing
?
回答1:
There is no such thing as a parametrized size in a context-free grammar-- or in a regular grammar, for that matter. Numerical parameters like your consume
are not part of the CFG model, and it can be proved that getting the effect in a different way is impossible. You can write a production for any fixed length, though, so you can write productions for a block of length 1, length 2, length 3, etc.:
<block3> :: <number> <number> <number>
or, similarly, matching the length as a prefix or even as a post-fix:
<block3a> :: 3 <number> <number> <number>
<block3b> :: <number> <number> <number> 3
etc. So to do what you want, you just need a grammar containing rules of the kind, for all N
you might need.
A given CFG will only include a finite number of these productions. It is mathematically impossible to write a CFG (in BNF or any other form) that can handle unlimited parametrized sizes, whether as a prefix or as a postfix. In practice, you might be able to update your CFG on the fly with new productions as needed. E.g., read a number N
and create a rule blockN
for your grammar if it's not already present. But there's no single CFG that can capture unbounded parametrized sizes.
Edit, since you also asked about context-sensitive grammars: That still won't do it. The problem is the use of integer arithmetic, not the class of the grammar. Any formal language on the Chomsky hierarchy is defined in terms of a finite number of symbols (tokens), and since are infinitely many integers, they cannot be assigned distinct meanings (note that your parsing procedure relies on integer arithmetic).
If you were to pre-process the length into a sequence of as many stars (* * * 4 7 10
), the CFG parser is trivial:
<group> :: * <group> <number>
<group> :: * <number>
This is just the so-called a^n b^n
language. You could also have symbols that mean "ten", etc. But without pre-processing, the only solution (and what your procedure or a Turing machine does in practice) is to interpret numeral notation in your grammar. E.g., parse "21" as ten ten one. I doubt this can be done in a CFG (the problem is handling arbitrarily long numerals without separate rules for millions, billions etc.), but I'm not really sure. Either way it is only interesting as an academic exercise, since using real integers is so much easier. I'm sure people have studied the properties of formal languages with integers, but I can't say anything about that.
回答2:
Pyparsing includes the helper countedArray
which does exactly what you ask. It takes a single argument expr
, and will parse an integer followed by 'n' instances of expr. For your input string, you could write:
from pyparsing import *
from pprint import pprint
# make a long string of integers, starting at 0
source = ' '.join(map(str, range(50)))
# define an integer as a 'word' of numeric digits
integer = Word(nums)
# add a parse action to convert the parsed strings to integers, at parse time
integer.setParseAction(lambda t:int(t[0]))
# parse the source string into counted strings of integers, and print out
# with pprint
lists = OneOrMore(countedArray(integer)).parseString(source)
pprint(lists.asList())
prints out:
[[],
[2],
[4, 5, 6],
[8, 9, 10, 11, 12, 13, 14],
[16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30]]
The parse action attached to
integer
converts the numeric strings back to integers (no quotes around the numbers in the lists).Note the opening empty list, which was created by the leading '0' of the source string.
Even though the source string contains more numbers past 30, there aren't enough for another counted array.
countedArray
achieves this trick by creating an expression that matches a leading integer, followed by a captive undefined Forward expression. A parse action attached to the leading integer injects a expr*n
expression into the captive Forward, which then parses the following 'n' exprs. You could as easily write countedArray(Word(alphas))
and parse "4 the quick brown fox"
to get ['the', 'quick', 'brown', 'fox']
. As @Aaron points out in his answer, there is no need to preserve the leading counter, as you can easily get it by getting the len of the returned list.
pyparsing also supports more traditional lookaheads using the FollowedBy and NotAny constructs (NotAny can be abbreviated by using the '~' operator). In this string, "Four score and seven years ago..." you can pick out the strings that are followed by words that start with vowels using Word(alphas) + FollowedBy(Word('aeiou',alphas))
, which would match 'score' and 'years'; or words that are not followed by a period using Word(alphas) + ~Literal('.')
, which would match every word but 'ago'. In this case, trying to search thru an input string for matches, you would use searchString
or scanString
instead of parseString
.
回答3:
Not really.
The whole point of having a grammar is that it allows to define data in a human understandable way. Your demo string is readable but why does 1
mean something else than the 3
?
The correct input in your case would be:
[[2], [4, 5, 6], [8, 9, 10, 11, 12, 13, 14], [16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30]]
and the grammar would look like so:
<model> :: <term>
<list> :: <term> | <term> <opt-whitespace> <list>
<term> :: '[' <list> ']'
You can then recover the missing count element by looking at the length of the list.
来源:https://stackoverflow.com/questions/10030526/can-a-bnf-handle-forward-consumption