I'm working with a service that provides data as a Lisp-like S-Expression string. This data is arriving thick and fast, and I want to churn through it as quickly as possible, ideally directly on the byte stream (it's only single-byte characters) without any backtracking. These strings can be quite lengthy and I don't want the GC churn of allocating a string for the whole message.
My current implementation uses CoCo/R with a grammar, but it has a few problems. Due to the backtracking, it assigns the whole stream to a string. It's also a bit fiddly for users of my code to change if they have to. I'd rather have a pure C# solution. CoCo/R also does not allow for the reuse of parser/scanner objects, so I have to recreate them for each message.
Conceptually the data stream can be thought of as a sequence of S-Expressions:
(item 1 apple)(item 2 banana)(item 3 chainsaw)
Parsing this sequence would create three objects. The type of each object can be determined by the first value in the list, in the above case "item". The schema/grammar of the incoming stream is well known.
Before I start coding I'd like to know if there are libraries out there that do this already. I'm sure I'm not the first person to have this problem.
EDIT
Here's a little more detail on what I want as I think the original question may have been a little vague.
Given some SExpressions, such as:
(Hear 12.3 HelloWorld)
(HJ LAJ1 -0.42)
(FRP lf (pos 2.3 1.7 0.4))
I want a list of objects equivalent to this:
{
new HearPerceptorState(12.3, "HelloWorld"),
new HingeJointState("LAJ1", -0.42),
new ForceResistancePerceptorState("lf", new Polar(2.3, 1.7, 0.4))
}
The actual data set I'm working on is a list of perceptors from a robot model in the RoboCup 3D simulated soccer league. I may potentially also need to deserialise another set of related data with a more complex structure.
In my opinion a parse generator is unneccessary to parse simple S-expressions consisting only of lists, numbers and symbols. A hand-written recursive descent parser is probably simpler and at least as fast. The general pattern would look like this (in java, c# should be very similar):
Object readDatum(PushbackReader in) {
int ch = in.read();
return readDatum(in, ch);
}
Object readDatum(PushbackReader in, int ch) {
if (ch == '(')) {
return readList(in, ch);
} else if (isNumber(ch)) {
return readNumber(in, ch);
} else if (isSymbolStart(ch)) {
return readSymbol(in, ch);
} else {
error(ch);
}
}
List readList(PushbackReader in, int lookAhead) {
if (ch != '(') {
error(ch);
}
List result = new List();
while (true) {
int ch = in.read();
if (ch == ')') {
break;
} else if (isWhiteSpace(ch)) {
skipWhiteSpace(in);
} else {
result.append(readDatum(in, ch);
}
}
return result;
}
String readSymbol(PushbackReader in, int ch) {
StringBuilder result = new StringBuilder();
result.append((char)ch);
while (true) {
int ch2 = in.read();
if (isSymbol(ch2)) {
result.append((char)ch2);
} else if (isWhiteSpace(ch2) || ch2 == ')') {
in.unread(ch2);
break;
} else if (ch2 == -1) {
break;
} else {
error(ch2);
}
}
return result.toString();
}
I wrote an S-Expression parser in C# using OMeta#. It can parse the kind of S-Expressions that you are giving in your examples, you just need to add decimal numbers to the parser.
The code is available as SExpression.NET on github and a related article is available here. As an alternative I suggest to take a look at the YaYAML YAML parser for .NET also written using OMeta#.
Drew, perhaps you should add some context to the question, otherwise this answer will make no sense to other users, but try this:
CHARACTERS
letter = 'A'..'Z' + 'a'..'z' .
digit = "0123456789" .
messageChar = '\u0020'..'\u007e' - ' ' - '(' - ')' .
TOKENS
double = ['-'] digit { digit } [ '.' digit { digit } ] .
ident = letter { letter | digit | '_' } .
message = messageChar { messageChar } CONTEXT (")") .
Oh, I have to point out that '\u0020'
is the unicode SPACE, which you are subsequently removing with "- ' '
". Oh, and you can use CONTEXT (')')
if you don't need more than one character lookahead.
FWIW: CONTEXT
does not consume the enclosed sequence, you must still consume it in your production.
EDIT:
Ok, this seems to work. Really, I mean it this time :)
CHARACTERS
letter = 'A'..'Z' + 'a'..'z' .
digit = "0123456789" .
// messageChar = '\u0020'..'\u007e' - ' ' - '(' - ')' .
TOKENS
double = ['-'] digit { digit } [ '.' digit { digit } ] .
ident = letter { letter | digit | '_' } .
// message = letter { messageChar } CONTEXT (')') .
// MessageText<out string m> = message (. m = t.val; .)
// .
HearExpr<out HeardMessage message> = (. TimeSpan time; Angle direction = Angle.NaN; string messageText; .)
"(hear"
TimeSpan<out time>
( "self" | AngleInDegrees<out direction> )
// MessageText<out messageText> // REMOVED
{ ANY } (. messageText = t.val; .) // MOD
')' (. message = new HeardMessage(time, direction, new Message(messageText)); .)
.
Here's a relatively simple (and hopefully, easy to extend) solution:
public delegate object Acceptor(Token token, string match);
public class Symbol
{
public Symbol(string id) { Id = id ?? Guid.NewGuid().ToString("P"); }
public override string ToString() => Id;
public string Id { get; private set; }
}
public class Token : Symbol
{
internal Token(string id) : base(id) { }
public Token(string pattern, Acceptor acceptor) : base(pattern) { Regex = new Regex(string.Format("^({0})", !string.IsNullOrEmpty(Pattern = pattern) ? Pattern : ".*"), RegexOptions.Compiled); ValueOf = acceptor; }
public string Pattern { get; private set; }
public Regex Regex { get; private set; }
public Acceptor ValueOf { get; private set; }
}
public class SExpressionSyntax
{
private readonly Token Space = Token("\\s+", Echo);
private readonly Token Open = Token("\\(", Echo);
private readonly Token Close = Token("\\)", Echo);
private readonly Token Quote = Token("\\'", Echo);
private Token comment;
private static Exception Error(string message, params object[] arguments) => new Exception(string.Format(message, arguments));
private static object Echo(Token token, string match) => new Token(token.Id);
private static object Quoting(Token token, string match) => NewSymbol(token, match);
private Tuple<Token, string, object> Read(ref string input)
{
if (!string.IsNullOrEmpty(input))
{
var found = null as Match;
var sofar = input;
var tuple = Lexicon.FirstOrDefault(current => (found = current.Item2.Regex.Match(sofar)).Success && (found.Length > 0));
var token = tuple != null ? tuple.Item2 : null;
var match = token != null ? found.Value : null;
input = match != null ? input.Substring(match.Length) : input;
return token != null ? Tuple.Create(token, match, token.ValueOf(token, match)) : null;
}
return null;
}
private Tuple<Token, string, object> Next(ref string input)
{
Tuple<Token, string, object> read;
while (((read = Read(ref input)) != null) && ((read.Item1 == Comment) || (read.Item1 == Space))) ;
return read;
}
public object Parse(ref string input, Tuple<Token, string, object> next)
{
var value = null as object;
if (next != null)
{
var token = next.Item1;
if (token == Open)
{
var list = new List<object>();
while (((next = Next(ref input)) != null) && (next.Item1 != Close))
{
list.Add(Parse(ref input, next));
}
if (next == null)
{
throw Error("unexpected EOF");
}
value = list.ToArray();
}
else if (token == Quote)
{
var quote = next.Item3;
next = Next(ref input);
value = new[] { quote, Parse(ref input, next) };
}
else
{
value = next.Item3;
}
}
else
{
throw Error("unexpected EOF");
}
return value;
}
protected Token TokenOf(Acceptor acceptor)
{
var found = Lexicon.FirstOrDefault(pair => pair.Item2.ValueOf == acceptor);
var token = found != null ? found.Item2 : null;
if ((token == null) && (acceptor != Commenting))
{
throw Error("missing required token definition: {0}", acceptor.Method.Name);
}
return token;
}
protected IList<Tuple<string, Token>> Lexicon { get; private set; }
protected Token Comment { get { return comment = comment ?? TokenOf(Commenting); } }
public static Token Token(string pattern, Acceptor acceptor) => new Token(pattern, acceptor);
public static object Commenting(Token token, string match) => Echo(token, match);
public static object NewSymbol(Token token, string match) => new Symbol(match);
public static Symbol Symbol(object value) => value as Symbol;
public static string Moniker(object value) => Symbol(value) != null ? Symbol(value).Id : null;
public static string ToString(object value)
{
return
value is object[] ?
(
((object[])value).Length > 0 ?
((object[])value).Aggregate(new StringBuilder("("), (result, obj) => result.AppendFormat(" {0}", ToString(obj))).Append(" )").ToString()
:
"( )"
)
:
(value != null ? (value is string ? string.Concat('"', (string)value, '"') : (value is bool ? value.ToString().ToLower() : value.ToString())).Replace("\\\r\n", "\r\n").Replace("\\\n", "\n").Replace("\\t", "\t").Replace("\\n", "\n").Replace("\\r", "\r").Replace("\\\"", "\"") : null) ?? "(null)";
}
public SExpressionSyntax()
{
Lexicon = new List<Tuple<string, Token>>();
Include(Space, Open, Close, Quote);
}
public SExpressionSyntax Include(params Token[] tokens)
{
foreach (var token in tokens)
{
Lexicon.Add(new Tuple<string, Token>(token.Id, token));
}
return this;
}
public object Parse(string input)
{
var next = Next(ref input);
var value = Parse(ref input, next);
if ((next = Next(ref input)) != null)
{
throw Error("unexpected ", next.Item1);
}
return value;
}
}
public class CustomSExpressionSyntax : SExpressionSyntax
{
public CustomSExpressionSyntax()
: base()
{
Include
(
// "//" comments
Token("\\/\\/.*", SExpressionSyntax.Commenting),
// Obvious
Token("false", (token, match) => false),
Token("true", (token, match) => true),
Token("null", (token, match) => null),
Token("\\-?[0-9]+\\.[0-9]+", (token, match) => double.Parse(match)),
Token("\\-?[0-9]+", (token, match) => int.Parse(match)),
// String literals
Token("\\\"(\\\\\\n|\\\\t|\\\\n|\\\\r|\\\\\\\"|[^\\\"])*\\\"", (token, match) => match.Substring(1, match.Length - 2)),
// Identifiers
Token("[_A-Za-z][_0-9A-Za-z]*", NewSymbol)
);
}
}
public class Node { }
public class HearPerceptorState : Node
{
public string Ident { get; set; }
public double Value { get; set; }
}
public class HingeJointState : Node
{
public string Ident { get; set; }
public double Value { get; set; }
}
public class Polar : Tuple<double, double, double>
{
public Polar(double a, double b, double c) : base(a, b, c) { }
}
public class ForceResistancePerceptorState : Node
{
public string Ident { get; set; }
public Polar Polar { get; set; }
}
public class Test
{
public static void Main()
{
var input = @"
(
(Hear 12.3 HelloWorld)
(HJ LAJ1 -0.42)
(FRP lf (pos 2.3 1.7 0.4))
)
";
// visit DRY helpers
Func<object, object[]> asRecord = value => (object[])value;
Func<object, Symbol> symbol = value => SExpressionSyntax.Symbol(value);
Func<object, string> identifier = value => symbol(value).Id;
// the SExpr visit, proper
Func<object[], Node[]> visitAll = null;
Func<object[], Node> visitHear = null;
Func<object[], Node> visitHJ = null;
Func<object[], Node> visitFRP = null;
visitAll =
all =>
all.
Select
(
item =>
symbol(asRecord(item)[0]).Id != "Hear" ?
(
symbol(asRecord(item)[0]).Id != "HJ" ?
visitFRP(asRecord(item))
:
visitHJ(asRecord(item))
)
:
visitHear(asRecord(item))
).
ToArray();
visitHear =
item =>
new HearPerceptorState { Value = (double)asRecord(item)[1], Ident = identifier(asRecord(item)[2]) };
visitHJ =
item =>
new HingeJointState { Ident = identifier(asRecord(item)[1]), Value = (double)asRecord(item)[2] };
visitFRP =
item =>
new ForceResistancePerceptorState
{
Ident = identifier(asRecord(item)[1]),
Polar =
new Polar
(
(double)asRecord(asRecord(item)[2])[1],
(double)asRecord(asRecord(item)[2])[2],
(double)asRecord(asRecord(item)[2])[3]
)
};
var syntax = new CustomSExpressionSyntax();
var sexpr = syntax.Parse(input);
var nodes = visitAll(asRecord(sexpr));
Console.WriteLine("SO_3051254");
Console.WriteLine();
Console.WriteLine(nodes.Length == 3);
Console.WriteLine(nodes[0] is HearPerceptorState);
Console.WriteLine(nodes[1] is HingeJointState);
Console.WriteLine(nodes[2] is ForceResistancePerceptorState);
}
}
Testable here:
'HTH,
来源:https://stackoverflow.com/questions/3051254/parsing-lisp-s-expressions-with-known-schema-in-c-sharp