My input file is actually multiple XML files appending to one file. (It\'s from Google Patents). It has below structure:
Here's my take on it, using a generator and lxml.etree
. Extracted information purely for example.
import urllib2, os, zipfile
from lxml import etree
def xmlSplitter(data,separator=lambda x: x.startswith(' 10: break
doc = etree.XML(item)
docID = "-".join(doc.xpath('//publication-reference/document-id/*/text()'))
title = first(doc.xpath('//invention-title/text()'))
assignee = first(doc.xpath('//assignee/addressbook/orgname/text()'))
print "DocID: {0}\nTitle: {1}\nAssignee: {2}\n".format(docID,title,assignee)
Yields:
DocID: US-D0629996-S1-20110104 Title: Glove backhand Assignee: Blackhawk Industries Product Group Unlimited LLC DocID: US-D0629997-S1-20110104 Title: Belt sleeve Assignee: None DocID: US-D0629998-S1-20110104 Title: Underwear Assignee: X-Technology Swiss GmbH DocID: US-D0629999-S1-20110104 Title: Portion of compression shorts Assignee: Nike, Inc. DocID: US-D0630000-S1-20110104 Title: Apparel Assignee: None DocID: US-D0630001-S1-20110104 Title: Hooded shirt Assignee: None DocID: US-D0630002-S1-20110104 Title: Hooded shirt Assignee: None DocID: US-D0630003-S1-20110104 Title: Hooded shirt Assignee: None DocID: US-D0630004-S1-20110104 Title: Headwear cap Assignee: None DocID: US-D0630005-S1-20110104 Title: Footwear Assignee: Vibram S.p.A.