I working on xml sax parser to parse xml files and below is my code
xml file code:
Registered Nurse-Epilepsy&
You need to implement a characters handler too:
def characters(self, content):
print content
but this potentially gives you text in chunks instead of as one block per tag.
Do yourself a big favour though and use the ElementTree API instead; that API is far pythononic and easier to use than the XML DOM API.
from xml.etree import ElementTree as ET
etree = ET.parse('/path/to/xml_file.xml')
jobtitle = etree.find('job/title').text
If all you want is a straight conversion to a dictionary, take a look at this handy ActiveState Python Cookbook recipe: Converting XML to dictionary and back. Note that it uses the ElementTree API as well.
If you have a set of existing elements you want to look for, just use these in the find()
method:
fieldnames = [
'title', 'job-code', 'detail-url', 'job-category', 'description',
'summary', 'posted-date', 'location', 'address', 'city', 'state',
'zip', 'country', 'company', 'name', 'url']
fields = {}
etree = ET.parse('/path/to/xml_file.xml')
for field in fieldnames:
elem = etree.find(field)
if field is not None and field.text is not None:
fields[field] = elem.text
To get the text content of a node, you need to implement a characters method. E.g.
class Exact(xml.sax.handler.ContentHandler):
def __init__(self):
self.curpath = []
def startElement(self, name, attrs):
print name,attrs
def endElement(self, name):
print 'end ' + name
def characters(self, content):
print content
Would output:
job <xml.sax.xmlreader.AttributesImpl instance at 0xb6d9baec>
title <xml.sax.xmlreader.AttributesImpl instance at 0xb6d9bb0c>
Registered Nurse-Epilepsy
end title
job-code <xml.sax.xmlreader.AttributesImpl instance at 0xb6d9bb2c>
881723
end job-code
detail-url <xml.sax.xmlreader.AttributesImpl instance at 0xb6d9bb2c>
http://search.careers-hcanorthtexas.com/s/Job-Details/Registered-Nurse-Epilepsy-Job/Medical-City/xjdp-cl289619-jf120-ct2181-jid4041800?s_cid=Advance
end detail-url
(sniped)
I would recommend using a pulldom. This allows you to load a doc with a sax parser, and when you find a node that you are interested in, to load just that node into a dom fragment.
Here is an article on using it with some examples: https://www.ibm.com/developerworks/xml/library/x-tipulldom/index.html
To get the content of an element, you need to overwrite the characters
method... add this to your handler class:
def characters(self, data):
print data
Be careful with this, though: The parser is not required to give you all data in a single chunk. You should use an internal Buffer and read it when needed. In most of my xml/sax code I do something like this:
class MyHandler(xml.sax.handler.ContentHandler):
def __init__(self):
self._charBuffer = []
def _flushCharBuffer(self):
s = ''.join(self._charBuffer)
self._charBuffer = []
return s
def characters(self, data):
self._charBuffer.append(data)
... and then call the flush method on the end of elements where I need the data.
For your whole use case - assuming you have a file containing multiple job descriptions and want a list which holds the jobs with each job being a dictionary of the fields, do something like this:
class MyHandler(xml.sax.handler.ContentHandler):
def __init__(self):
self._charBuffer = []
self._result = []
def _getCharacterData(self):
data = ''.join(self._charBuffer).strip()
self._charBuffer = []
return data.strip() #remove strip() if whitespace is important
def parse(self, f):
xml.sax.parse(f, self)
return self._result
def characters(self, data):
self._charBuffer.append(data)
def startElement(self, name, attrs):
if name == 'job': self._result.append({})
def endElement(self, name):
if not name == 'job': self._result[-1][name] = self._getCharacterData()
jobs = MyHandler().parse("job-file.xml") #a list of all jobs
If you just need to parse a single job at a time, you can simplify the list part and throw away the startElement
method - just set _result to a dict and assign to it directly in endElement
.