I'm trying to figure out how to parse a string in this format into a tree like data structure of arbitrary depth.
"{{Hello big|Hi|Hey} {world|earth}|{Goodbye|farewell} {planet|rock|globe{.|!}}}"
[[["Hello big" "Hi" "Hey"]
["world" "earth"]]
[["Goodbye" "farewell"]
["planet" "rock" "globe" ["."
"!"]]]]
I've tried playing with some regular expressions for this (such as #"{([^{}]*)}" ), but everything I've tried seems to "flatten" the tree into a big list of lists. I could be approaching this from the wrong angle, or maybe a regex just isn't the right tool for the job.
Thanks for your help!
Don't use regular expressions for this task. An easier method would be to describe your string with a grammar (BNF or EBNF) and then write a parser to parse the string according to the grammar. You can generate a parse-tree from your EBNF and BNF and so you naturally end up with a tree structure.
You can start with something like this:
element ::= element-type, { ["|"], element-type }
element-type ::= primitive | "{", element, "}"
primitive ::= symbol | word
symbol ::= "." | "!"
word ::= character { character }
character ::= "a" | "b" | ... | "z"
Note: I wrote this up quickly, and so it may not be completely correct. But it should give you an idea.
Trying to match the whole thing with a single regular expression isn't going to get you too far, since regular expressions output at most a list of matching substring positions, nothing tree-like. You want a lexer or grammar which does something like this:
Divide the input into tokens - atomic pieces like '{', '|', and 'world', then process those tokens in order. Start with an empty tree with a single root node.
Every time you find {
, create and go to a child node.
Every time you find |
, create and go to a sibling node.
Every time you find }
, go up to the parent node.
Every time you find a word, put that word in the current leaf node.
if you want a quick hack:
- replace the { chars with [
- replace the } chars with ]
- replace the | chars with spaces
- hope you dont get input with spaces.
read
it in so it comes up as nested arrays.
ps: I agree that a reg-ex can't do this.
pss: set * read-eval * to false (you don't want the input running it's self)
You can use amotoen to build grammar and parse this:
(ns pegg.core
(:gen-class)
(:use
(com.lithinos.amotoen
core string-wrapper))
(:use clojure.contrib.pprint))
(def input "{{Hello big|Hi|Hey} {world|earth}|{Goodbye|farewell} {planet|rock|globe{.|!}}}")
(def grammar
{
:Start :List
:ws #"^[ \n\r\t]*"
:Sep "|"
:String #"^[A-Za-z !.]+"
:Item '(| :String :List)
:Items [:Item '(+ [:Sep :Item])]
:List [:ws "{" '(* (| :Items :Item)) "}" :ws]
})
(def parser (create-parser grammar))
(defn parse
[^String input]
(validate grammar)
(pprint (parser (wrap-string input))))
Result:
pegg.core> (parse input)
{:List [{:ws ""} "{" ({:Item {:List [{:ws ""} "{" ({:Items [{:Item {:String "Hello big"}} ([{:Sep "|"} {:Item {:String "Hi"}}] [{:Sep "|"} {:Item {:String "Hey"}}])]}) "}" {:ws " "}]}} {:Items [{:Item {:List [{:ws ""} "{" ({:Items [{:Item {:String "world"}} ([{:Sep "|"} {:Item {:String "earth"}}])]}) "}" {:ws ""}]}} ([{:Sep "|"} {:Item {:List [{:ws ""} "{" ({:Items [{:Item {:String "Goodbye"}} ([{:Sep "|"} {:Item {:String "farewell"}}])]}) "}" {:ws " "}]}}])]} {:Item {:List [{:ws ""} "{" ({:Items [{:Item {:String "planet"}} ([{:Sep "|"} {:Item {:String "rock"}}] [{:Sep "|"} {:Item {:String "globe"}}])]} {:Item {:List [{:ws ""} "{" ({:Items [{:Item {:String "."}} ([{:Sep "|"} {:Item {:String "!"}}])]}) "}" {:ws ""}]}}) "}" {:ws ""}]}}) "}" {:ws ""}]}
P.S. This is one of my first peg grammar and it can be better. Also see http://en.wikipedia.org/wiki/Parsing_expression_grammar
来源:https://stackoverflow.com/questions/3826395/parse-string-into-a-tree-structure