How best to parse a simple grammar?

前端未结

关注

 5  1085

Ok, so I\'ve asked a bunch of smaller questions about this project, but I still don\'t have much confidence in the designs I\'m coming up with, so I\'m going to ask a questi

相关标签:

5条回答

天命终不由人

2020-12-07 15:25

I don't pretend to know much about parsing a grammar, and for your case the solution by unutbu is all you'll need. But I learnt a fair bit about parsing from Eric Lippert in his recent series of blog posts.

http://blogs.msdn.com/b/ericlippert/archive/2010/04/26/every-program-there-is-part-one.aspx

It's a 7 part series that goes through creating and parsing a grammar, then optimizing the grammar to make parsing easier and more performant. He produces C# code to generate all combinations of particular grammars, but it shouldn't be too much of a stretch to convert that into python to parse a fairly simple grammar of your own.

0 讨论(0)
发布评论:

提交评论
- 加载中...

旧时难觅i

2020-12-07 15:26

I know that this question is about a decade old and has certainly been answered now. I am mainly posting this answer to prove myself that I have understood PEG parsers at last. I'm using the fantastic parsimonious module here.
That being said, you could come up with a parsing grammar, build an ast and visit this one to obtain the desired structure:

from parsimonious.nodes import NodeVisitor
from parsimonious.grammar import Grammar
from itertools import groupby

grammar = Grammar(
    r"""
    term            = course (operator course)*
    course          = coursename? ws coursenumber
    coursename      = ~"[A-Z]+"
    coursenumber    = ~"\d+"
    operator        = ws (and / or / comma) ws
    and             = "and"
    or              = (comma ws)? "or"
    comma           = ","
    ws              = ~"\s*"
    """
)

class CourseVisitor(NodeVisitor):
    def __init__(self):
        self.current = None
        self.courses = []
        self.listnum = 1

    def generic_visit(self, node, children):
        pass

    def visit_coursename(self, node, children):
        if node.text:
            self.current = node.text

    def visit_coursenumber(self, node, children):
        course = (self.current, int(node.text), self.listnum)
        self.courses.append(course)

    def visit_or(self, node, children):
        self.listnum += 1

courses = ["CS 2110", "CS 2110 and INFO 3300",
           "CS 2110, INFO 3300", "CS 2110, 3300, 3140",
           "CS 2110 or INFO 3300", "MATH 2210, 2230, 2310, or 2940"]

for course in courses:
    tree = grammar.parse(course)
    cv = CourseVisitor()
    cv.visit(tree)
    courses = [list(v) for _, v in groupby(cv.courses, lambda x: x[2])]
    print(courses)

Here, we walk our way from bottom to top, starting with brickets like whitespace, the operators or, and and , which will eventually lead to the course and finally the term. The visitor class builds the desired (well, kind of, one needs to get rid of the last tuple element) structure.

0 讨论(0)

谎友^

2020-12-07 15:30

def parse(astr):
    astr=astr.replace(',','')
    astr=astr.replace('and','')    
    tokens=astr.split()
    dept=None
    number=None
    result=[]
    option=[]
    for tok in tokens:
        if tok=='or':
            result.append(option)
            option=[]
            continue
        if tok.isalpha():
            dept=tok
            number=None
        else:
            number=int(tok)
        if dept and number:
            option.append((dept,number))
    else:
        if option:
            result.append(option)
    return result

if __name__=='__main__':
    tests=[ ("CS 2110" , [[("CS", 2110)]]),
            ("CS 2110 and INFO 3300" , [[("CS", 2110), ("INFO", 3300)]]),
            ("CS 2110, INFO 3300" , [[("CS", 2110), ("INFO", 3300)]]),
            ("CS 2110, 3300, 3140", [[("CS", 2110), ("CS", 3300), ("CS", 3140)]]),
            ("CS 2110 or INFO 3300", [[("CS", 2110)], [("INFO", 3300)]]),
            ("MATH 2210, 2230, 2310, or 2940", [[("MATH", 2210), ("MATH", 2230), ("MATH", 2310)], [("MATH", 2940)]])]

    for test,answer in tests:
        result=parse(test)
        if result==answer:
            print('GOOD: {0} => {1}'.format(test,answer))
        else:
            print('ERROR: {0} => {1} != {2}'.format(test,result,answer))
            break

yields

GOOD: CS 2110 => [[('CS', 2110)]]
GOOD: CS 2110 and INFO 3300 => [[('CS', 2110), ('INFO', 3300)]]
GOOD: CS 2110, INFO 3300 => [[('CS', 2110), ('INFO', 3300)]]
GOOD: CS 2110, 3300, 3140 => [[('CS', 2110), ('CS', 3300), ('CS', 3140)]]
GOOD: CS 2110 or INFO 3300 => [[('CS', 2110)], [('INFO', 3300)]]
GOOD: MATH 2210, 2230, 2310, or 2940 => [[('MATH', 2210), ('MATH', 2230), ('MATH', 2310)], [('MATH', 2940)]]

0 讨论(0)

余生分开走

2020-12-07 15:36

For simple grammars I really like Parsing Expression Grammars (PEGs), which amount to a disciplined, structured way of writing a recursive-descent parser. In a dynamically typed language like Python you can do useful things without having a separate "parser generator". That means no nonsense with reduce-reduce conflicts or other arcana of LR parsing.

I did a little searching and pyPEG appears to be a nice library for Python.

0 讨论(0)
发布评论:

提交评论
- 加载中...
南旧

2020-12-07 15:51

If you get reduce/reduce conflicts you need to specify the precedence of "or" and "and". Im guessing "and" binds tightest, meaning "CS 101 and CS 102 or CS 201" means [[CS 101, CS 102] [CS 201]].

If you can find examples of both then the grammar is ambigous and you are out of luck. However you might be able to let this ambiguity be left underspecified, all depending on what you are going to do with the results.

PS, Looks like the language is regular, you could consider a DFA.

0 讨论(0)
发布评论:

提交评论
- 加载中...