How to create/write a simple XML parser from scratch?
Rather than code samples, I want to know what are the simplified, basic steps in English.
How is a good par
If you don't know how to write a parser, then you need to do some reading. Get hold of any book on compiler-writing (many of the best ones were written 30 or 40 years ago, e.g. Aho and Ullmann) and study the chapters on lexical analysis and syntax analysis. XML is essentially no different, except that the lexical and grammar phases are not as clearly isolated from each other as in some languages.
One word of warning, if you want to write a fully-conformant XML parser then 90% of your effort will be spent getting edge cases right in obscure corners of the spec dealing with things such as parameter entities that most XML users aren't even aware of.
The first element in the document should be the prolog. This states the xml version, the encoding, whether the file is standalone, and maybe some other stuff. The prolog opens with <?
.
After the prolog, there's tags with metadata. The special tags, like comments, doctypes, and element definitions should start with <!
. Processing instructions start with <?
. It is possible to have nested tags here, as the <!DOCTYPE
tag can have <!ELEMENT
and <!ATTLIST
tags in a dtd style xml document--see Wikipedia for a thorough example.
There should be exactly one top level element. It's the only one without a <!
or a <?
preceding it. There may be more metadata tags after the top level element; process those first.
For the explicit parsing: First identify tags--they all start with <
--then determine what kind of tag it is and what its closure looks like. <!--
is a comment tag, and cannot have --
anywhere except for its end. <?
ends with ?>
. <!
end with >
. To repeat: <!DOCTYPE
can have tags nested before its closure, and there may be other nested tags I don't know of.
Once you find a tag, you'll want to find its closing tag. Check if the tag is self closing first; otherwise, find its closure.
For data structures: I would recommend a tree structure, where each element is a node, and each node has an indexed/mapped list of subelements.
Obviously, a full parser will require a lot more research; I hope this is enough to get you started.
There is a difference between a parser and a nodelist. The parser is the piece that takes a bunch of plain text XML and tries to determine what nodes are in there. Then there is an internal structure you save the nodes in. In a layer over that structure you find the DOM, the Document Object Model. This is a structure of nested nodes that make up your XML document. The parser only needs to know the generic DOM interface to create nodes.
I wouldn't use regex as a parser for this. I think the best thing is just traverse the string char by char and check if what you get matches with what you should get.
But why not use any of the existing XML parsers? There are many possibilities in encoding data. Many exceptions. And if your parsers doesn't manage them all it is hardly worth the title of XML parser.
Since D is rather closely related to Java, maybe generating an XML parser with ANTLR (since there are most probably XML EBNF grammars for ANTLR already, you could then use these), and then converting the generated Java parser code to D, could be an option? At least that would give you a starting point, and you could then put some efforts in trying optimizing the code specifically for D ...
At least ANTLR is not at all as hard as many seem to think. I got started after knowing nothing about it, by watching 3-4 of this great set of screencasts on ANTLR.
Btw, I found ANTLRWorks a breeze to work with (as opposed to the Eclipse plugin used in the screencast ... but the screencast content applies anyway).
Just my 0.02c.
A parser must fit the needs of your input language. In your case, simple XML. The first thing to know about XML is that it is context-free and absolutely not ambiguous, everything is wrapped between two tokens, and this is what makes XML famous: it is easy to parse. Finally, XML is always simply represented by a tree structure. As stated, you can simply parse your XML and execute code in the meantime, or parse the XML, generating the tree, and then execute code according to this tree.
D provides a very interesting way to write an XML parser very easily, for example:
doc.onStartTag["pointlight"] = (ElementParser xml)
{
debug writefln("Parsing pointlight element");
auto l = new DistantLight(to!int(xml.tag.attr["x"]),
to!int(xml.tag.attr["y"]),
to!int(xml.tag.attr["z"]),
to!ubyte(xml.tag.attr["red"]),
to!ubyte(xml.tag.attr["green"]),
to!ubyte(xml.tag.attr["blue"]));
lights ~= l;
xml.parse();
};
for and event based parser the user need to pass it some functions (startNode(name,attrs)
, endNode(name)
and someText(txt)
likely through an interface) and call them when needed as you pass over the file
the parser will have a while loop that will alternate between reading until <
and until >
and do the proper conversions to the parameter types
void parse(EventParser p, File file){
string str;
while((str = file.readln('<')).length !=0){
//not using a rewritable buffer to take advantage of slicing
//but it's a quick conversion to a implementation with a rewritable buffer though
if(str.length>1)p.someText(str.chomp('<'));
str = file.readln('>');
str = str.chomp('>');
//split str in name and attrs
auto parts = str.split();
string name = parts[0];
string[string] attrs;
foreach(attribute;parts[1..$]){
auto splitAtrr = attribute.split("=");
attrs[splitAtrr[0]] = splitAtrr[1];
}
if(str[0] == '/')p.endNode(name);
else {
p.startNode(name,attrs);
if(str[str.length-1]=='/')p.endNode(name);//self closing tag
}
}
}
you can build a DOM parser on top of a event based parser and the basic functionality you'll need for each node is getChildren and getParent getName and getAttributes (with setters when building ;) )
the object for the dom parser with the above described methods:
class DOMEventParser : EventParser{
DOMNode current = new RootNode();
overrides void startNode(string name,string[string] attrs){
DOMNode tmp = new ElementNode(current,name,attrs);
current.appendChild(tmp);
current = tmp;
}
overrides void endNode(string name){
asser(name == current.name);
current = current.parent;
}
overrides void someText(string txt){
current.appendChild(new TextNode(txt));
}
}
when the parsing ends the rootnode will have the root of the DOM tree
note: I didn't put any verification code in there to ensure correctness of the xml
edit: the parsing of the attributes has a bug in it, instead of splitting on whitespace a regex is better for that