Python to parse non-standard XML file

前端 未结 3 1868
野趣味
野趣味 2021-01-05 11:24

My input file is actually multiple XML files appending to one file. (It\'s from Google Patents). It has below structure:



        
3条回答
  •  北海茫月
    2021-01-05 12:09

    Here's my take on it, using a generator and lxml.etree. Extracted information purely for example.

    import urllib2, os, zipfile
    from lxml import etree
    
    def xmlSplitter(data,separator=lambda x: x.startswith(' 10: break
      doc = etree.XML(item)
      docID = "-".join(doc.xpath('//publication-reference/document-id/*/text()'))
      title = first(doc.xpath('//invention-title/text()'))
      assignee = first(doc.xpath('//assignee/addressbook/orgname/text()'))
      print "DocID:    {0}\nTitle:    {1}\nAssignee: {2}\n".format(docID,title,assignee)
    

    Yields:

    DocID:    US-D0629996-S1-20110104
    Title:    Glove backhand
    Assignee: Blackhawk Industries Product Group Unlimited LLC
    
    DocID:    US-D0629997-S1-20110104
    Title:    Belt sleeve
    Assignee: None
    
    DocID:    US-D0629998-S1-20110104
    Title:    Underwear
    Assignee: X-Technology Swiss GmbH
    
    DocID:    US-D0629999-S1-20110104
    Title:    Portion of compression shorts
    Assignee: Nike, Inc.
    
    DocID:    US-D0630000-S1-20110104
    Title:    Apparel
    Assignee: None
    
    DocID:    US-D0630001-S1-20110104
    Title:    Hooded shirt
    Assignee: None
    
    DocID:    US-D0630002-S1-20110104
    Title:    Hooded shirt
    Assignee: None
    
    DocID:    US-D0630003-S1-20110104
    Title:    Hooded shirt
    Assignee: None
    
    DocID:    US-D0630004-S1-20110104
    Title:    Headwear cap
    Assignee: None
    
    DocID:    US-D0630005-S1-20110104
    Title:    Footwear
    Assignee: Vibram S.p.A.

提交回复
热议问题