I have to monitor an XML file being written by a tool running all the day. But the XML file is properly completed and closed only at the end of the day.
Same constraints as XML stream processing:
- Parse an incomplete XML file on-the-fly and trigger actions
- Keep track of the last position within the file to avoid processing it again from the beginning
On answer of Need to read XML files as a stream using BeautifulSoup in Python, slezica suggests xml.sax
, xml.etree.ElementTree
and cElementTree
. But no success with my attempts to use xml.etree.ElementTree
and cElementTree
. There are also xml.dom
, xml.parsers.expat
and lxml
but I do not see support for "on-the-fly parsing".
I need more obvious examples...
I am currently using Python 2.7 on Linux, but I will migrate to Python 3.x => please also provide tips on new Python 3.x features. I also use watchdog
to detect XML file modifications => Optionally, reuse the watchdog
mechanism. Optionally support also Windows.
Please provide easy to understand/maintain solutions. If it is too complex, I may just use tell()
/seek()
to move within the file, use stupid text search in the raw XML and finally extract the values using basic regex.
XML sample:
<dfxml xmloutputversion='1.0'>
<creator version='1.0'>
<program>TCPFLOW</program>
<version>1.4.6</version>
</creator>
<configuration>
<fileobject>
<filename>file1</filename>
<filesize>288</filesize>
<tcpflow packets='12' srcport='1111' dstport='2222' family='2' />
</fileobject>
<fileobject>
<filename>file2</filename>
<filesize>352</filesize>
<tcpflow packets='12' srcport='3333' dstport='4444' family='2' />
</fileobject>
<fileobject>
<filename>file3</filename>
<filesize>456</filesize>
...
...
First test using SAX failed:
import xml.sax
class StreamHandler(xml.sax.handler.ContentHandler):
def startElement(self, name, attrs):
print 'start: name=', name
def endElement(self, name):
print 'end: name=', name
if name == 'root':
raise StopIteration
if __name__ == '__main__':
parser = xml.sax.make_parser()
parser.setContentHandler(StreamHandler())
with open('f.xml') as f:
parser.parse(f)
Shell:
$ while read line; do echo $line; sleep 1; done <i.xml >f.xml &
...
$ ./test-using-sax.py
start: name= dfxml
start: name= creator
start: name= program
end: name= program
start: name= version
end: name= version
Traceback (most recent call last):
File "./test-using-sax.py", line 17, in <module>
parser.parse(f)
File "/usr/lib64/python2.7/xml/sax/expatreader.py", line 107, in parse
xmlreader.IncrementalParser.parse(self, source)
File "/usr/lib64/python2.7/xml/sax/xmlreader.py", line 125, in parse
self.close()
File "/usr/lib64/python2.7/xml/sax/expatreader.py", line 220, in close
self.feed("", isFinal = 1)
File "/usr/lib64/python2.7/xml/sax/expatreader.py", line 214, in feed
self._err_handler.fatalError(exc)
File "/usr/lib64/python2.7/xml/sax/handler.py", line 38, in fatalError
raise exception
xml.sax._exceptions.SAXParseException: report.xml:15:0: no element found
Since yesterday I found the Peter Gibson's answer about the undocumented xml.etree.ElementTree.XMLTreeBuilder._parser.EndElementHandler
.
This example is similar to the other one but uses xml.etree.ElementTree
(and watchdog
).
It does not work when ElementTree
is replaced by cElementTree
:-/
import time
import watchdog.events
import watchdog.observers
import xml.etree.ElementTree
class XmlFileEventHandler(watchdog.events.PatternMatchingEventHandler):
def __init__(self):
watchdog.events.PatternMatchingEventHandler.__init__(self, patterns=['*.xml'])
self.xml_file = None
self.parser = xml.etree.ElementTree.XMLTreeBuilder()
def end_tag_event(tag):
node = self.parser._end(tag)
print 'tag=', tag, 'node=', node
self.parser._parser.EndElementHandler = end_tag_event
def on_modified(self, event):
if not self.xml_file:
self.xml_file = open(event.src_path)
buffer = self.xml_file.read()
if buffer:
self.parser.feed(buffer)
if __name__ == '__main__':
observer = watchdog.observers.Observer()
event_handler = XmlFileEventHandler()
observer.schedule(event_handler, path='.')
try:
observer.start()
while True:
time.sleep(10)
finally:
observer.stop()
observer.join()
While the script is running, do not forget to touch
one XML file, or simulate the on-the-fly writing using this one line script:
while read line; do echo $line; sleep 1; done <in.xml >out.xml &
For information, the xml.etree.ElementTree.iterparse
does not seem to support a file being written. My test code:
from __future__ import print_function, division
import xml.etree.ElementTree
if __name__ == '__main__':
context = xml.etree.ElementTree.iterparse('f.xml', events=('end',))
for action, elem in context:
print(action, elem.tag)
My output:
end program
end version
end creator
end filename
end filesize
end tcpflow
end fileobject
end filename
end filesize
end tcpflow
end fileobject
end filename
end filesize
Traceback (most recent call last):
File "./iter.py", line 9, in <module>
for action, elem in context:
File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 1281, in next
self._root = self._parser.close()
File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 1654, in close
self._raiseerror(v)
File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 1506, in _raiseerror
raise err
xml.etree.ElementTree.ParseError: no element found: line 20, column 0
Three hours after posting my question, no answer received. But I have finally implemented the simple example I was looking for.
My inspiration is from saaj's answer and is based on xml.sax
and watchdog
.
from __future__ import print_function, division
import time
import watchdog.events
import watchdog.observers
import xml.sax
class XmlStreamHandler(xml.sax.handler.ContentHandler):
def startElement(self, tag, attributes):
print(tag, 'attributes=', attributes.items())
self.tag = tag
def characters(self, content):
print(self.tag, 'content=', content)
class XmlFileEventHandler(watchdog.events.PatternMatchingEventHandler):
def __init__(self):
watchdog.events.PatternMatchingEventHandler.__init__(self, patterns=['*.xml'])
self.file = None
self.parser = xml.sax.make_parser()
self.parser.setContentHandler(XmlStreamHandler())
def on_modified(self, event):
if not self.file:
self.file = open(event.src_path)
self.parser.feed(self.file.read())
if __name__ == '__main__':
observer = watchdog.observers.Observer()
event_handler = XmlFileEventHandler()
observer.schedule(event_handler, path='.')
try:
observer.start()
while True:
time.sleep(10)
finally:
observer.stop()
observer.join()
While the script is running, do not forget to touch
one XML file, or simulate the on-the-fly writing using the following command:
while read line; do echo $line; sleep 1; done <in.xml >out.xml &
来源:https://stackoverflow.com/questions/44395089/read-xml-file-while-it-is-being-written-in-python