How to create/write a simple XML parser from scratch?
Rather than code samples, I want to know what are the simplified, basic steps in English.
How is a good par
for and event based parser the user need to pass it some functions (startNode(name,attrs)
, endNode(name)
and someText(txt)
likely through an interface) and call them when needed as you pass over the file
the parser will have a while loop that will alternate between reading until <
and until >
and do the proper conversions to the parameter types
void parse(EventParser p, File file){
string str;
while((str = file.readln('<')).length !=0){
//not using a rewritable buffer to take advantage of slicing
//but it's a quick conversion to a implementation with a rewritable buffer though
if(str.length>1)p.someText(str.chomp('<'));
str = file.readln('>');
str = str.chomp('>');
//split str in name and attrs
auto parts = str.split();
string name = parts[0];
string[string] attrs;
foreach(attribute;parts[1..$]){
auto splitAtrr = attribute.split("=");
attrs[splitAtrr[0]] = splitAtrr[1];
}
if(str[0] == '/')p.endNode(name);
else {
p.startNode(name,attrs);
if(str[str.length-1]=='/')p.endNode(name);//self closing tag
}
}
}
you can build a DOM parser on top of a event based parser and the basic functionality you'll need for each node is getChildren and getParent getName and getAttributes (with setters when building ;) )
the object for the dom parser with the above described methods:
class DOMEventParser : EventParser{
DOMNode current = new RootNode();
overrides void startNode(string name,string[string] attrs){
DOMNode tmp = new ElementNode(current,name,attrs);
current.appendChild(tmp);
current = tmp;
}
overrides void endNode(string name){
asser(name == current.name);
current = current.parent;
}
overrides void someText(string txt){
current.appendChild(new TextNode(txt));
}
}
when the parsing ends the rootnode will have the root of the DOM tree
note: I didn't put any verification code in there to ensure correctness of the xml
edit: the parsing of the attributes has a bug in it, instead of splitting on whitespace a regex is better for that