问题
I am trying to write a grammar for a set of sentences and using Pyparsing to parse it. These sentences tell what and how to search in a text file, and I need to convert them into corresponding regex search codes. However, there are some elements that are not really context-free and hence, I am finding it difficult to write production rules for them. Basically, my aim is to parse these sentences and then write regexes for them.
Some examples of context-sensitive elements found in these sentences -
LINE_CONTAINS phrase1 BEFORE {phrase2 AND phrase3}
means in the line,phrase1
can come anywhere beforephrase2
andphrase
. Similarly forAFTER
LINE_CONTAINS abc JOIN xyz
means search forabc xyz
andabc-xyz
andabcxyz
LINE_CONTAINS abcd AND xyzw
means the line should contain bothabcd
andxyzw
Example - LINE_CONTAINS we transfected BEFORE {sirna} AND gene AND LINE_STARTSWITH Therefore
This should be converted to re.search(r'(^!Therefore.*?we transfected.*?sirna)' and re.search(r'(gene))
(A better regex can be made I believe)
I had begun writing grammar for the sentences as -
Beginner = LINE_CONTAINS|LINE_STARTSWITH|other line beginners...
Phrase = word+
sentence = Beginner + phrase + AND + Beginner + phrase
Any of these motifs/elements can occur in any line and can be in combination too. Like
LINE_CONTAINS {x AND y} BEFORE {a letter AND b letter} AND zoo people AND LINE_STARTSWITH dfg
So my question is -
How do I write grammar rules that can handle such context-sensitive elements, given that any sentence can have them (though most sentences won't have multiple, but still). Should I write rules for many kinds of sentences, each containing one of these different kinds of elements. Or should I write a rule that contains all such elements and make them optional.
I do understand that these elements may not exactly be context-sensitive, but my problem lies in not being able to write independent production rules for elements like BEFORE
, JOIN
etc. How do I best define their function in the production rules?
Edit - The phrases can be multi-word
回答1:
Making some guesses about your grammar, here is a rough stab. Notice how I separately define the line expressions from the phrase expressions:
from pyparsing import (CaselessKeyword, Word, alphas, MatchFirst, quotedString,
infixNotation, opAssoc, Suppress, Group)
LINE_CONTAINS, LINE_STARTSWITH, LINE_ENDSWITH = map(CaselessKeyword,
"""LINE_CONTAINS LINE_STARTSWITH LINE_ENDSWITH""".split())
NOT, AND, OR = map(CaselessKeyword, "NOT AND OR".split())
BEFORE, AFTER, JOIN = map(CaselessKeyword, "BEFORE AFTER JOIN".split())
keyword = MatchFirst([LINE_CONTAINS, LINE_STARTSWITH, LINE_ENDSWITH, NOT, AND, OR,
BEFORE, AFTER, JOIN])
phrase_word = ~keyword + Word(alphas + '_')
phrase_term = phrase_word | quotedString
phrase_expr = infixNotation(phrase_term,
[
((BEFORE | AFTER | JOIN), 2, opAssoc.LEFT,),
(NOT, 1, opAssoc.RIGHT,),
(AND, 2, opAssoc.LEFT,),
(OR, 2, opAssoc.LEFT),
],
lpar=Suppress('{'), rpar=Suppress('}')
)
line_term = Group((LINE_CONTAINS | LINE_STARTSWITH | LINE_ENDSWITH)("line_directive") +
Group(phrase_expr)("phrase"))
line_contents_expr = infixNotation(line_term,
[(NOT, 1, opAssoc.RIGHT,),
(AND, 2, opAssoc.LEFT,),
(OR, 2, opAssoc.LEFT),
]
)
sample = """
LINE_CONTAINS transfected BEFORE {sirna} AND gene AND LINE_STARTSWITH Therefore
"""
line_contents_expr.runTests(sample)
parses your sample as:
LINE_CONTAINS transfected BEFORE {sirna} AND gene AND LINE_STARTSWITH Therefore
[[['LINE_CONTAINS', [[['transfected', 'BEFORE', 'sirna'], 'AND', 'gene']]], 'AND', ['LINE_STARTSWITH', ['Therefore']]]]
[0]:
[['LINE_CONTAINS', [[['transfected', 'BEFORE', 'sirna'], 'AND', 'gene']]], 'AND', ['LINE_STARTSWITH', ['Therefore']]]
[0]:
['LINE_CONTAINS', [[['transfected', 'BEFORE', 'sirna'], 'AND', 'gene']]]
- line_directive: 'LINE_CONTAINS'
- phrase: [[['transfected', 'BEFORE', 'sirna'], 'AND', 'gene']]
[0]:
[['transfected', 'BEFORE', 'sirna'], 'AND', 'gene']
[0]:
['transfected', 'BEFORE', 'sirna']
[1]:
AND
[2]:
gene
[1]:
AND
[2]:
['LINE_STARTSWITH', ['Therefore']]
- line_directive: 'LINE_STARTSWITH'
- phrase: ['Therefore']
The phrase_word
starts with a negative lookahead, to avoid accidentally treating strings like 'LINE_STARTSWITH' as phrase words. I also added quoted strings as valid phrase words, since you never know when your search will have to actually include the string "LINE_STARTSWITH".
You use {}
s for grouping in your phrase expressions, infixNotation
has optional lpar
and rpar
arguments to override the defaults of (
and )
.
From here, you can look at other infixNotation
examples (such as SimpleBool.py on the pyparsing wiki examples page) to convert this into your respective regex-generating code.
回答2:
This seems to me to be a very simplistic grammar. I think you are "overthinking" the problem.
Looking at your examples, I see this:
a JOIN b
a BEFORE b
a AND b
a OR b
STARTSWITH a
Those are simply operators. Unary operators (STARTSWITH) are like ~x
or -x
in python. Binary operators (JOIN, BEFORE, AND, OR) are like x + y
or x in y
in python.
I don't think CONTAINS
is an operator, so much as a place-holder. Pretty much everything except STARTSWITH
is implicitly a contains. So that's kind of like the unary-plus operator: defined, understood, allowed, but useless.
Anyway, figure out what the operators are (make a list). Figure out whether they are unary (startswith) or binary (and). Then figure out what their precedence and associativity are.
Once you know that information, you can build your parser: you will know the key words, and know how to arrange the key words in a grammar.
来源:https://stackoverflow.com/questions/42415837/writing-grammar-rules-for-context-sensitive-elements-using-pyparsing