问题
The task is to parse a simple XML document, and analyze the contents by line number.
The right Python package seems to be xml.sax
. But how do I use it?
After some digging in the documentation, I found:
- The
xmlreader.Locator
interface has the information:getLineNumber()
. - The
handler.ContentHandler
interface hassetDocumentHandler()
.
The first thought would be to create a Locator
, pass this to the ContentHandler
, and read the information off the Locator during calls to its character()
methods, etc.
BUT, xmlreader.Locator
is only a skeleton interface, and can only return -1 from any of its methods.
So as a poor user, WHAT am I to do, short of writing a whole Parser
and Locator
of my own??
I'll answer my own question presently.
(Well I would have, except for the arbitrary, annoying rule that says I can't.)
I was unable to figure this out using the existing documentation (or by web searches), and was forced to read the source code for xml.sax
(under /usr/lib/python2.7/xml/sax/ on my system).
The xml.sax
function make_parser()
by default creates a real Parser
, but what kind of thing is that?
In the source code one finds that it is an ExpatParser
, defined in expatreader.py.
And...it has its own Locator
, an ExpatLocator
. But, there is no access to this thing.
Much head-scratching came between this and a solution.
- write your own
ContentHandler
, which knows about aLocato
r, and uses it to determine line numbers - create an
ExpatParser
withxml.sax.make_parser()
- create an
ExpatLocator
, passing it theExpatParser
instance. - make the
ContentHandler
, giving it thisExpatLocator
- pass the
ContentHandler
to the parser'ssetContentHandler()
- call
parse()
on theParser
.
For example:
import sys
import xml.sax
class EltHandler( xml.sax.handler.ContentHandler ):
def __init__( self, locator ):
xml.sax.handler.ContentHandler.__init__( self )
self.loc = locator
self.setDocumentLocator( self.loc )
def startElement( self, name, attrs ): pass
def endElement( self, name ): pass
def characters( self, data ):
lineNo = self.loc.getLineNumber()
print >> sys.stdout, "LINE", lineNo, data
def spit_lines( filepath ):
try:
parser = xml.sax.make_parser()
locator = xml.sax.expatreader.ExpatLocator( parser )
handler = EltHandler( locator )
parser.setContentHandler( handler )
parser.parse( filepath )
except IOError as e:
print >> sys.stderr, e
if len( sys.argv ) > 1:
filepath = sys.argv[1]
spit_lines( filepath )
else:
print >> sys.stderr, "Try providing a path to an XML file."
Martijn Pieters points out below another approach with some advantages.
If the superclass initializer of the ContentHandler
is properly called,
then it turns out a private-looking, undocumented member ._locator
is
set, which ought to contain a proper Locator
.
Advantage: you don't have to create your own Locator
(or find out how to create it).
Disadvantage: it's nowhere documented, and using an undocumented private variable is sloppy.
Thanks Martijn!
回答1:
The sax parser itself is supposed to provide your content handler with a locator. The locator has to implement certain methods, but it can be any object as long as it has the right methods. The xml.sax.xmlreader.Locator class is the interface a locator is expected to implement; if the parser provided a locator object to your handler then you can count on those 4 methods being present on the locator.
The parser is only encouraged to set a locator, it is not required to do so. The expat XML parser does provide it.
If you subclass xml.sax.handler.ContentHandler() then it'll provide a standard setDocumentHandler()
method for you, and by the time .startDocument()
on the handler is called your content handler instance will have self._locator
set:
from xml.sax.handler import ContentHandler
class MyContentHandler(ContentHandler):
def __init__(self):
ContentHandler.__init__(self)
# initialize your handler
def startElement(self, name, attrs):
loc = self._locator
if loc is not None:
line, col = loc.getLineNumber(), loc.getColumnNumber()
else:
line, col = 'unknown', 'unknown'
print 'start of {} element at line {}, column {}'.format(name, line, col)
回答2:
This is an old question, but I think that there is a better answer to it than the one given, so I'm going to add another answer anyway.
While there may indeed be an undocumented private data member named _locator in the ContentHandler superclass, as described in the above answer by Martijn, accessing location information using this data member does not appear to me to be the intended use of the location facilities.
In my opinion, Steve White raises good questions about why this member is not documented. I think the answer to those questions is that it was probably not intended to be for public use. It appears to be a private implementation detail of the ContentHandler superclass. Since it is an undocumented private implementation detail, it could disappear without warning with any future release of the SAX library, so relying on it could be dangerous.
It appears to me, from reading the documentation for the ContentHandler class, and specifically the documentation for ContentHandler.setDocumentLocator, that the designers intended for users to instead override the ContentHandler.setDocumentLocator function so that when the parser calls it, the user's content handler subclass can save a reference to the passed-in locator object (which was created by the SAX parser), and can later use that saved object to get location information. For example:
class MyContentHandler(ContentHandler):
def __init__(self):
super().__init__()
self._mylocator = None
# initialize your handler
def setDocumentLocator(self, locator):
self._mylocator = locator
def startElement(self, name, attrs):
loc = self._mylocator
if loc is not None:
line, col = loc.getLineNumber(), loc.getColumnNumber()
else:
line, col = 'unknown', 'unknown'
print 'start of {} element at line {}, column {}'.format(name, line, col)
With this approach, there is no need to rely on undocumented fields.
来源:https://stackoverflow.com/questions/15477363/xml-sax-parser-and-line-numbers-etc