问题
I read some articles about the XML parsers and came across SAX and DOM.
SAX is event-based and DOM is tree model -- I don't understand the differences between these concepts.
From what I have understood, event-based means some kind of event happens to the node. Like when one clicks a particular node it will give all the sub nodes rather than loading all the nodes at the same time. But in the case of DOM parsing it will load all the nodes and make the tree model.
Is my understanding correct?
Please correct me If I am wrong or explain to me event-based and tree model in a simpler manner.
回答1:
Well, you are close.
In SAX, events are triggered when the XML is being parsed. When the parser is parsing the XML, and encounters a tag starting (e.g. <something>
), then it triggers the tagStarted
event (actual name of event might differ). Similarly when the end of the tag is met while parsing (</something>
), it triggers tagEnded
. Using a SAX parser implies you need to handle these events and make sense of the data returned with each event.
In DOM, there are no events triggered while parsing. The entire XML is parsed and a DOM tree (of the nodes in the XML) is generated and returned. Once parsed, the user can navigate the tree to access the various data previously embedded in the various nodes in the XML.
In general, DOM is easier to use but has an overhead of parsing the entire XML before you can start using it.
回答2:
In just a few words...
SAX (Simple API for XML): Is a stream-based processor. You only have a tiny part in memory at any time and you "sniff" the XML stream by implementing callback code for events like tagStarted()
etc. It uses almost no memory, but you can't do "DOM" stuff, like use xpath or traverse trees.
DOM (Document Object Model): You load the whole thing into memory - it's a massive memory hog. You can blow memory with even medium sized documents. But you can use xpath and traverse the tree etc.
回答3:
Here in simpler words:
DOM
Tree model parser (Object based) (Tree of nodes).
DOM loads the file into the memory and then parse- the file.
Has memory constraints since it loads the whole XML file before parsing.
DOM is read and write (can insert or delete nodes).
If the XML content is small, then prefer DOM parser.
Backward and forward search is possible for searching the tags and evaluation of the information inside the tags. So this gives the ease of navigation.
Slower at run time.
SAX
Event based parser (Sequence of events).
SAX parses the file as it reads it, i.e. parses node by node.
No memory constraints as it does not store the XML content in the memory.
SAX is read only i.e. can’t insert or delete the node.
Use SAX parser when memory content is large.
SAX reads the XML file from top to bottom and backward navigation is not possible.
Faster at run time.
回答4:
You are correct in your understanding of the DOM based model. The XML file will be loaded as a whole and all its contents will be built as an in-memory representation of the tree the document represents. This can be time- and memory-consuming, depending on how large the input file is. The benefit of this approach is that you can easily query any part of the document, and freely manipulate all the nodes in the tree.
The DOM approach is typically used for small XML structures (where small depends on how much horsepower and memory your platform has) that may need to be modified and queried in different ways once they have been loaded.
SAX on the other hand is designed to handle XML input of virtually any size. Instead of the XML framework doing the hard work for you in figuring out the structure of the document and preparing potentially lots of objects for all the nodes, attributes etc., SAX completely leaves that to you.
What it basically does is read the input from the top and invoke callback methods you provide when certain "events" occur. An event might be hitting an opening tag, an attribute in the tag, finding text inside an element or coming across an end-tag.
SAX stubbornly reads the input and tells you what it sees in this fashion. It is up to you to maintain all state-information you require. Usually this means you will build up some sort of state-machine.
While this approach to XML processing is a lot more tedious, it can be very powerful, too. Imagine you want to just extract the titles of news articles from a blog feed. If you read this XML using DOM it would load all the article contents, all the images etc. that are contained in the XML into memory, even though you are not even interested in it.
With SAX you can just check if the element name is (e. g.) "title" whenever your "startTag" event method is called. If so, you know that you needs to add whatever the next "elementText" event offers you. When you receive the "endTag" event call, you check again if this is the closing element of the "title". After that, you just ignore all further elements, until either the input ends, or another "startTag" with a name of "title" comes along. And so on...
You could read through megabytes and megabytes of XML this way, just extracting the tiny amount of data you need.
The negative side of this approach is of course, that you need to do a lot more book-keeping yourself, depending on what data you need to extract and how complicated the XML structure is. Furthermore, you naturally cannot modify the structure of the XML tree, because you never have it in hand as a whole.
So in general, SAX is suitable for combing through potentially large amounts of data you receive with a specific "query" in mind, but need not modify, while DOM is more aimed at giving you full flexibility in changing structure and contents, at the expense of higher resource demand.
回答5:
You're comparing apples and pears. SAX is a parser that parses serialized DOM structures. There are many different parsers, and "event-based" refers to the parsing method.
Maybe a small recap is in order:
The document object model (DOM) is an abstract data model that describes a hierarchical, tree-based document structure; a document tree consists of nodes, namely element, attribute and text nodes (and some others). Nodes have parents, siblings and children and can be traversed, etc., all the stuff you're used to from doing JavaScript (which incidentally has nothing to do with the DOM).
A DOM structure may be serialized, i.e. written to a file, using a markup language like HTML or XML. An HTML or XML file thus contains a "written out" or "flattened out" version of an abstract document tree.
For a computer to manipulate, or even display, a DOM tree from a file, it has to deserialize, or parse, the file and reconstruct the abstract tree in memory. This is where parsing comes in.
Now we come to the nature of parsers. One way to parse would be to read in the entire document and recursively build up a tree structure in memory, and finally expose the entire result to the user. (I suppose you could call these parsers "DOM parsers".) That would be very handy for the user (I think that's what PHP's XML parser does), but it suffers from scalability problems and becomes very expensive for large documents.
On the other hand, event-based parsing, as done by SAX, looks at the file linearly and simply makes call-backs to the user whenever it encounters a structural piece of data, like "this element started", "that element ended", "some text here", etc. This has the benefit that it can go on forever without concern for the input file size, but it's a lot more low-level because it requires the user to do all the actual processing work (by providing call-backs). To return to your original question, the term "event-based" refers to those parsing events that the parser raises as it traverses the XML file.
The Wikipedia article has many details on the stages of SAX parsing.
回答6:
I will provide general Q&A-oriented answer for this question:
Answer to Questions
Why do we need XML parser?
We need XML parser because we do not want to do everything in our application from scratch, and we need some "helper" programs or libraries to do something very low-level but very necessary to us. These low-level but necessary things include checking the well-formedness, validating the document against its DTD or schema (just for validating parsers), resolving character reference, understanding CDATA sections, and so on. XML parsers are just such "helper" programs and they will do all these jobs. With XML parser, we are shielded from a lot of these complexities and we could concentrate ourselves on just programming at high-level through the API's implemented by the parsers, and thus gain programming efficiency.
Which one is better, SAX or DOM ?
Both SAX and DOM parser have their advantages and disadvantages. Which one is better should depend on the characteristics of your application (please refer to some questions below).
Which parser can get better speed, DOM or SAX parsers?
SAX parser can get better speed.
What's the difference between tree-based API and event-based API?
A tree-based API is centered around a tree structure and therefore provides interfaces on components of a tree (which is a DOM document) such as Document interface,Node interface, NodeList interface, Element interface, Attr interface and so on. By contrast, however, an event-based API provides interfaces on handlers. There are four handler interfaces, ContentHandler interface, DTDHandler interface, EntityResolver interface and ErrorHandler interface.
What is the difference between a DOM Parser and a SAX Parser?
DOM parsers and SAX parsers work in different ways:
A DOM parser creates a tree structure in memory from the input document and then waits for requests from client. But a SAX parser does not create any internal structure. Instead, it takes the occurrences of components of a input document as events, and tells the client what it reads as it reads through the input document. A
DOM parser always serves the client application with the entire document no matter how much is actually needed by the client. But a SAX parser serves the client application always only with pieces of the document at any given time.
- With DOM parser, method calls in client application have to be explicit and forms a kind of chain. But with SAX, some certain methods (usually overriden by the cient) will be invoked automatically (implicitly) in a way which is called "callback" when some certain events occur. These methods do not have to be called explicitly by the client, though we could call them explicitly.
How do we decide on which parser is good?
Ideally a good parser should be fast (time efficient),space efficient, rich in functionality and easy to use. But in reality, none of the main parsers have all these features at the same time. For example, a DOM Parser is rich in functionality (because it creates a DOM tree in memory and allows you to access any part of the document repeatedly and allows you to modify the DOM tree), but it is space inefficient when the document is huge, and it takes a little bit long to learn how to work with it. A SAX Parser, however, is much more space efficient in case of big input document (because it creates no internal structure). What's more, it runs faster and is easier to learn than DOM Parser because its API is really simple. But from the functionality point of view, it provides less functions which mean that the users themselves have to take care of more, such as creating their own data structures. By the way, what is a good parser? I think the answer really depends on the characteristics of your application.
What are some real world applications where using SAX parser is advantageous than using DOM parser and vice versa? What are the usual application for a DOM parser and for a SAX parser?
In the following cases, using SAX parser is advantageous than using DOM parser.
- The input document is too big for available memory (actually in this case SAX is your only choice)
- You can process the document in small contiguous chunks of input. You do not need the entire document before you can do useful work
- You just want to use the parser to extract the information of interest, and all your computation will be completely based on the data structures created by yourself. Actually in most of our applications, we create data structures of our own which are usually not as complicated as the DOM tree. From this sense, I think, the chance of using a DOM parser is less than that of using a SAX parser.
In the following cases, using DOM parser is advantageous than using SAX parser.
- Your application needs to access widely separately parts of the document at the same time.
- Your application may probably use a internal data structure which is almost as complicated as the document itself.
- Your application has to modify the document repeatedly.
- Your application has to store the document for a significant amount of time through many method calls.
Example (Use a DOM parser or a SAX parser?):
Assume that an instructor has an XML document containing all the personal information of the students as well as the points his students made in his class, and he is now assigning final grades for the students using an application. What he wants to produce, is a list with the SSN and the grades. Also we assume that in his application, the instructor use no data structure such as arrays to store the student personal information and the points. If the instructor decides to give A's to those who earned the class average or above, and give B's to the others, then he'd better to use a DOM parser in his application. The reason is that he has no way to know how much is the class average before the entire document gets processed. What he probably need to do in his application, is first to look through all the students' points and compute the average, and then look through the document again and assign the final grade to each student by comparing the points he earned to the class average. If, however, the instructor adopts such a grading policy that the students who got 90 points or more, are assigned A's and the others are assigned B's, then probably he'd better use a SAX parser. The reason is, to assign each student a final grade, he do not need to wait for the entire document to be processed. He could immediately assign a grade to a student once the SAX parser reads the grade of this student. In the above analysis, we assumed that the instructor created no data structure of his own. What if he creates his own data structure, such as an array of strings to store the SSN and an array of integers to sto re the points ? In this case, I think SAX is a better choice, before this could save both memory and time as well, yet get the job done. Well, one more consideration on this example. What if what the instructor wants to do is not to print a list, but to save the original document back with the grade of each student updated ? In this case, a DOM parser should be a better choice no matter what grading policy he is adopting. He does not need to create any data structure of his own. What he needs to do is to first modify the DOM tree (i.e., set value to the 'grade' node) and then save the whole modified tree. If he choose to use a SAX parser instead of a DOM parser, then in this case he has to create a data structure which is almost as complicated as a DOM tree before he could get the job done.
An Example
Problem statement: Write a Java program to extract all the information about circles which are elements in a given XML document. We assume that each circle element has three child elements(i.e., x, y and radius) as well as a color attribute. A sample document is given below:
<?xml version="1.0"?>
<!DOCTYPE shapes [
<!ELEMENT shapes (circle)*>
<!ELEMENT circle (x,y,radius)>
<!ELEMENT x (#PCDATA)>
<!ELEMENT y (#PCDATA)>
<!ELEMENT radius (#PCDATA)>
<!ATTLIST circle color CDATA #IMPLIED>
]>
<shapes>
<circle color="BLUE">
<x>20</x>
<y>20</y>
<radius>20</radius>
</circle>
<circle color="RED" >
<x>40</x>
<y>40</y>
<radius>20</radius>
</circle>
</shapes>
Program with DOMparser
import java.io.*;
import org.w3c.dom.*;
import org.apache.xerces.parsers.DOMParser;
public class shapes_DOM {
static int numberOfCircles = 0; // total number of circles seen
static int x[] = new int[1000]; // X-coordinates of the centers
static int y[] = new int[1000]; // Y-coordinates of the centers
static int r[] = new int[1000]; // radius of the circle
static String color[] = new String[1000]; // colors of the circles
public static void main(String[] args) {
try{
// create a DOMParser
DOMParser parser=new DOMParser();
parser.parse(args[0]);
// get the DOM Document object
Document doc=parser.getDocument();
// get all the circle nodes
NodeList nodelist = doc.getElementsByTagName("circle");
numberOfCircles = nodelist.getLength();
// retrieve all info about the circles
for(int i=0; i<nodelist.getLength(); i++) {
// get one circle node
Node node = nodelist.item(i);
// get the color attribute
NamedNodeMap attrs = node.getAttributes();
if(attrs.getLength() > 0)
color[i]=(String)attrs.getNamedItem("color").getNodeValue();
// get the child nodes of a circle node
NodeList childnodelist = node.getChildNodes();
// get the x and y value
for(int j=0; j<childnodelist.getLength(); j++) {
Node childnode = childnodelist.item(j);
Node textnode = childnode.getFirstChild();//the only text node
String childnodename=childnode.getNodeName();
if(childnodename.equals("x"))
x[i]= Integer.parseInt(textnode.getNodeValue().trim());
else if(childnodename.equals("y"))
y[i]= Integer.parseInt(textnode.getNodeValue().trim());
else if(childnodename.equals("radius"))
r[i]= Integer.parseInt(textnode.getNodeValue().trim());
}
}
// print the result
System.out.println("circles="+numberOfCircles);
for(int i=0;i<numberOfCircles;i++) {
String line="";
line=line+"(x="+x[i]+",y="+y[i]+",r="+r[i]+",color="+color[i]+")";
System.out.println(line);
}
} catch (Exception e) {e.printStackTrace(System.err);}
}
}
Program with SAXparser
import java.io.*;
import org.xml.sax.*;
import org.xml.sax.helpers.DefaultHandler;
import org.apache.xerces.parsers.SAXParser;
public class shapes_SAX extends DefaultHandler {
static int numberOfCircles = 0; // total number of circles seen
static int x[] = new int[1000]; // X-coordinates of the centers
static int y[] = new int[1000]; // Y-coordinates of the centers
static int r[] = new int[1000]; // radius of the circle
static String color[] = new String[1000]; // colors of the circles
static int flagX=0; //to remember what element has occurred
static int flagY=0; //to remember what element has occurred
static int flagR=0; //to remember what element has occurred
// main method
public static void main(String[] args) {
try{
shapes_SAX SAXHandler = new shapes_SAX (); // an instance of this class
SAXParser parser=new SAXParser(); // create a SAXParser object
parser.setContentHandler(SAXHandler); // register with the ContentHandler
parser.parse(args[0]);
} catch (Exception e) {e.printStackTrace(System.err);} // catch exeptions
}
// override the startElement() method
public void startElement(String uri, String localName,
String rawName, Attributes attributes) {
if(rawName.equals("circle")) // if a circle element is seen
color[numberOfCircles]=attributes.getValue("color"); // get the color attribute
else if(rawName.equals("x")) // if a x element is seen set the flag as 1
flagX=1;
else if(rawName.equals("y")) // if a y element is seen set the flag as 2
flagY=1;
else if(rawName.equals("radius")) // if a radius element is seen set the flag as 3
flagR=1;
}
// override the endElement() method
public void endElement(String uri, String localName, String rawName) {
// in this example we do not need to do anything else here
if(rawName.equals("circle")) // if a circle element is ended
numberOfCircles += 1; // increment the counter
}
// override the characters() method
public void characters(char characters[], int start, int length) {
String characterData =
(new String(characters,start,length)).trim(); // get the text
if(flagX==1) { // indicate this text is for <x> element
x[numberOfCircles] = Integer.parseInt(characterData);
flagX=0;
}
else if(flagY==1) { // indicate this text is for <y> element
y[numberOfCircles] = Integer.parseInt(characterData);
flagY=0;
}
else if(flagR==1) { // indicate this text is for <radius> element
r[numberOfCircles] = Integer.parseInt(characterData);
flagR=0;
}
}
// override the endDocument() method
public void endDocument() {
// when the end of document is seen, just print the circle info
System.out.println("circles="+numberOfCircles);
for(int i=0;i<numberOfCircles;i++) {
String line="";
line=line+"(x="+x[i]+",y="+y[i]+",r="+r[i]+",color="+color[i]+")";
System.out.println(line);
}
}
}
回答7:
In practical: book.xml
<bookstore>
<book category="cooking">
<title lang="en">Everyday Italian</title>
<author>Giada De Laurentiis</author>
<year>2005</year>
<price>30.00</price>
</book>
</bookstore>
- DOM presents the xml document as a the following tree-structure in memory.
- DOM is W3C standard.
- DOM parser works on Document Object Model.
- DOM occupies more memory, preferred for small XML documents
- DOM is Easy to navigate either forward or backward.
- SAX presents the xml document as event based like
start element:abc
,end element:abc
. - SAX is not W3C standard, it was developed by group of developers.
- SAX does not use memory, preferred for large XML documents.
- Backward navigation is not possible as it sequentially process the documents.
- Event happens to a node/element and it gives all sub nodes(Latin nodus, ‘knot’).
This XML document, when passed through a SAX parser, will generate a sequence of events like the following:
start element: bookstore
start element: book with an attribute category equal to cooking
start element: title with an attribute lang equal to en
Text node, with data equal to Everyday Italian
....
end element: title
.....
end element: book
end element: bookstore
回答8:
DOM Stands for Document Object Model and it represent an XML Document into tree format which each element representing tree branches. DOM Parser creates an In Memory tree representation of XML file and then parses it, so it requires more memory and its advisable to have increased heap size for DOM parser in order to avoid Java.lang.OutOfMemoryError:java heap space . Parsing XML file using DOM parser is quite fast if XML file is small but if you try to read a large XML file using DOM parser there is more chances that it will take a long time or even may not be able to load it completely simply because it requires lot of memory to create XML Dom Tree. Java provides support DOM Parsing and you can parse XML files in Java using DOM parser. DOM classes are in w3c.dom package while DOM Parser for Java is in JAXP (Java API for XML Parsing) package.
SAX XML Parser in Java
SAX Stands for Simple API for XML Parsing. This is an event based XML Parsing and it parse XML file step by step so much suitable for large XML Files. SAX XML Parser fires event when it encountered opening tag, element or attribute and the parsing works accordingly. It’s recommended to use SAX XML parser for parsing large xml files in Java because it doesn't require to load whole XML file in Java and it can read a big XML file in small parts. Java provides support for SAX parser and you can parse any xml file in Java using SAX Parser, I have covered example of reading xml file using SAX Parser here. One disadvantage of using SAX Parser in java is that reading XML file in Java using SAX Parser requires more code in comparison of DOM Parser.
Difference between DOM and SAX XML Parser
Here are few high level differences between DOM parser and SAX Parser in Java:
1) DOM parser loads whole xml document in memory while SAX only loads small part of XML file in memory.
2) DOM parser is faster than SAX because it access whole XML document in memory.
3) SAX parser in Java is better suitable for large XML file than DOM Parser because it doesn't require much memory.
4) DOM parser works on Document Object Model while SAX is an event based xml parser.
Read more: http://javarevisited.blogspot.com/2011/12/difference-between-dom-and-sax-parsers.html#ixzz2uz1bJQqZ
回答9:
Both SAX and DOM are used to parse the XML document. Both has advantages and disadvantages and can be used in our programming depending on the situation
SAX:
Parses node by node
Does not store the XML in memory
We cant insert or delete a node
Top to bottom traversing
DOM
Stores the entire XML document into memory before processing
Occupies more memory
We can insert or delete nodes
Traverse in any direction.
If we need to find a node and does not need to insert or delete we can go with SAX itself otherwise DOM provided we have more memory.
回答10:
1) DOM parser loads whole XML document in memory while SAX only loads a small part of the XML file in memory.
2) DOM parser is faster than SAX because it access whole XML document in memory.
3) SAX parser in Java is better suitable for large XML file than DOM Parser because it doesn't require much memory.
4) DOM parser works on Document Object Model while SAX is an event based XML parser.
Read more: http://javarevisited.blogspot.com/2011/12/difference-between-dom-and-sax-parsers.html#ixzz498y3vPFR
来源:https://stackoverflow.com/questions/6828703/what-is-the-difference-between-sax-and-dom