Simply using parsec in python

问题

I'm looking at this library, which has little documentation: https://pythonhosted.org/parsec/#examples

I understand there are alternatives, but I'd like to use this library.

I have the following string I'd like to parse:

mystr = """
<kv>
  key1: "string"
  key2: 1.00005
  key3: [1,2,3]
</kv>
<csv>
date,windspeed,direction
20190805,22,NNW
20190805,23,NW
20190805,20,NE
</csv>"""

While I'd like to parse the whole thing, I'd settle for just grabbing the <tags>. I have:

>>> import parsec
>>> tag_start = parsec.Parser(lambda x: x == "<")
>>> tag_end = parsec.Parser(lambda x: x == ">")
>>> tag_name = parsec.Parser(parsec.Parser.compose(parsec.many1, parsec.letter))
>>> tag_open = parsec.Parser(parsec.Parser.joint(tag_start, tag_name, tag_end))

OK, looks good. Now to use it:

>>> tag_open.parse(mystr)
Traceback (most recent call last):
...
TypeError: <lambda>() takes 1 positional argument but 2 were given

This fails. I'm afraid I don't even understand what it meant about my lambda expression giving two arguments, it's clearly 1. How can I proceed?

My optimal desired output for all the bonus points is:

[
{"type": "tag", 
 "name" : "kv",
 "values"  : [
    {"key1" : "string"},
    {"key2" : 1.00005},
    {"key3" : [1,2,3]}
  ]
},
{"type" : "tag",
"name" : "csv", 
"values" : [
    {"date" : 20190805, "windspeed" : 22, "direction": "NNW"}
    {"date" : 20190805, "windspeed" : 23, "direction": "NW"}
    {"date" : 20190805, "windspeed" : 20, "direction": "NE"}
  ]
}

The output I'd settle for understanding in this question is using functions like those described above for start and end tags to generate:

[
  {"tag": "kv"},
  {"tag" : "csv"}
]

And simply be able to parse arbitrary xml-like tags out of the messy mixed text entry.

回答1:

I encourage you to define your own parser using those combinators, rather than construct the Parser directly.

If you want to construct a Parser by wrapping a function, as the documentation states, the fn should accept two arguments, the first is the text and the second is the current position. And fn should return a Value by Value.success or Value.failure, rather than a boolean. You can grep @Parser in the parsec/__init__.py in this package to find more examples of how it works.

For your case in the description, you could define the parser as follows:

from parsec import *

spaces = regex(r'\s*', re.MULTILINE)
name = regex(r'[_a-zA-Z][_a-zA-Z0-9]*')

tag_start = spaces >> string('<') >> name << string('>') << spaces
tag_stop = spaces >> string('</') >> name << string('>') << spaces

@generate
def header_kv():
    key = yield spaces >> name << spaces
    yield string(':')
    value = yield spaces >> regex('[^\n]+')
    return {key: value}

@generate
def header():
    tag_name = yield tag_start
    values = yield sepBy(header_kv, string('\n'))
    tag_name_end = yield tag_stop
    assert tag_name == tag_name_end
    return {
        'type': 'tag',
        'name': tag_name,
        'values': values
    }

@generate
def body():
    tag_name = yield tag_start
    values = yield sepBy(sepBy1(regex(r'[^\n<,]+'), string(',')), string('\n'))
    tag_name_end = yield tag_stop
    assert tag_name == tag_name_end
    return {
        'type': 'tag',
        'name': tag_name,
        'values': values
    }

parser = header + body

If you run parser.parse(mystr), it yields

({'type': 'tag',
  'name': 'kv',
  'values': [{'key1': '"string"'},
             {'key2': '1.00005'},
             {'key3': '[1,2,3]'}]},
 {'type': 'tag',
  'name': 'csv',
  'values': [['date', 'windspeed', 'direction'],
             ['20190805', '22', 'NNW'],
             ['20190805', '23', 'NW'],
             ['20190805', '20', 'NE']]}
)

You can refine the definition of values in the above code to get the result in the exact form you want.

回答2:

According to the tests, the proper way to parse your string would be the following:

from parsec import *

possible_chars = letter() | space() |  one_of('/.,:"[]') | digit()
parser =  many(many(possible_chars) + string("<") >> mark(many(possible_chars)) << string(">"))

parser.parse(mystr)
# [((1, 1), ['k', 'v'], (1, 3)), ((5, 1), ['/', 'k', 'v'], (5, 4)), ((6, 1), ['c', 's', 'v'], (6, 4)), ((11, 1), ['/', 'c', 's', 'v'], (11, 5))]

The construction of the parser:

For the sake of convenience, we first define the characters we wish to match. parsec provides many types:

letter(): matches any alphabetic character,
string(str): matches any specified string str,
space(): matches any whitespace character,
spaces(): matches multiple whitespace characters,
digit(): matches any digit,
eof(): matches EOF flag of a string,
regex(pattern): matches a provided regex pattern,
one_of(str): matches any character from the provided string,
none_of(str): match characters which are not in the provided string.

We can separate them with operators, according to the docs:

|: This combinator implements choice. The parser p | q first applies p. If it succeeds, the value of p is returned. If p fails without consuming any input, parser q is tried. NOTICE: without backtrack,
+: Joint two or more parsers into one. Return the aggregate of two results from this two parser.
^: Choice with backtrack. This combinator is used whenever arbitrary look ahead is needed. The parser p || q first applies p, if it success, the value of p is returned. If p fails, it pretends that it hasn't consumed any input, and then parser q is tried.
<<: Ends with a specified parser, and at the end parser consumed the end flag,
<: Ends with a specified parser, and at the end parser hasn't consumed any input,
>>: Sequentially compose two actions, discarding any value produced by the first,
mark(p): Marks the line and column information of the result of the parser p.

Then there are multiple "combinators":

times(p, mint, maxt=None): Repeats parser p from mint to maxt times,
count(p,n): Repeats parser p n-times. If n is smaller or equal to zero, the parser equals to return empty list,
(p, default_value=None): Make a parser optional. If success, return the result, otherwise return default_value silently, without raising any exception. If default_value is not provided None is returned instead,
many(p): Repeat parser p from never to infinitely many times,
many1(p): Repeat parser p at least once,
separated(p, sep, mint, maxt=None, end=None): ,
sepBy(p, sep): parses zero or more occurrences of parser p, separated by delimiter sep,
sepBy1(p, sep): parses at least one occurrence of parser p, separated by delimiter sep,
endBy(p, sep): parses zero or more occurrences of p, separated and ended by sep,
endBy1(p, sep): parses at least one occurrence of p, separated and ended by sep,
sepEndBy(p, sep): parses zero or more occurrences of p, separated and optionally ended by sep,
sepEndBy1(p, sep): parses at least one occurrence of p, separated and optionally ended by sep.

Using all of that, we have a parser which matches many occurrences of many possible_chars, followed by a <, then we mark the many occurrences of possible_chars up until >.

回答3:

Since the parser requires a function that has two alternative results (and two parameters), you may consider breaking the function argument rather than trying to do it with an inline function definition (lambda)

A Parser is an object that wraps a function to do the parsing work. Arguments of the function should be a string to be parsed and the index on which to begin parsing. The function should return either Value.success(next_index, value) if parsing successfully, or Value.failure(index, expected) on the failure

But if you want to use a lambda expression anyway you can specify both required parameters maybe with a lambda like: (Not real sure how the Value.success or Value.failure are expected to work without reading through the docs.)

lamdba x,y: Value.Success(y+1, x) if x[y] == "<" else Value.failure(y, x)

回答4:

As others noted, the parse function needs to accept two arguments.
The syntax for multiple input args is:lambda x, y: ...

Unfortunately lambda is not suitable for building a parsec Parser this way since you need to return a parsec.Value type not a boolean, so it will quickly lose its terseness.

The design of parsec requires a Parser to act independently on an input stream without knowledge of any other Parser. To do this effectively a Parser must manage an index position of the input string. They receive the starting index position and return the next position after consuming some tokens. This is why a parsec.Value is returned (boolean, output index) and an input index is required along with an input string.

Here's a basic example consuming a < token, to illustrate:

import parsec

def parse_start_tag(stream, index):
    if stream[0] == '<':
        return parsec.Value.success(index + 1, stream[1:])
    else:
        return parsec.Value.failure(index, '<')

tag_open = parsec.Parser(parse_start_tag)
print(tag_open.parse("<tag>")) # prints: "tag>"
print(tag_open.parse("tag>"))  # fails:   "expected <"

来源：https://stackoverflow.com/questions/57368870/simply-using-parsec-in-python

标签

python

parsec

parser-combinators