Python to parse non-standard XML file

前端 未结 3 1869
野趣味
野趣味 2021-01-05 11:24

My input file is actually multiple XML files appending to one file. (It\'s from Google Patents). It has below structure:



        
相关标签:
3条回答
  • 2021-01-05 12:02

    I'd opt for parsing each chunk of XML separately.

    You seem to already be doing that in your sample code. Here's my take on your code:

    def parse_xml_buffer(buffer):
        dom = minidom.parseString("".join(buffer))  # join list into string of XML
        # .... parse dom ...
    
    buffer = [file.readline()]  # initialise with the first line
    for line in file:
        if line.startswith("<?xml "):
            parse_xml_buffer(buffer)
            buffer = []  # reset buffer
        buffer.append(line)  # list operations are faster than concatenating strings
    parse_xml_buffer(buffer)  # parse final chunk
    

    Once you've broken the file down to individual XML blocks, how you actually do the parsing depends on your requirements and, to some extent, your preference. Options are lxml, minidom, elementtree, expat, BeautifulSoup, etc.


    Update:

    Starting from scratch, here's how I would do it (using BeautifulSoup):

    #!/usr/bin/env python
    from BeautifulSoup import BeautifulSoup
    
    def separated_xml(infile):
        file = open(infile, "r")
        buffer = [file.readline()]
        for line in file:
            if line.startswith("<?xml "):
                yield "".join(buffer)
                buffer = []
            buffer.append(line)
        yield "".join(buffer)
        file.close()
    
    for xml_string in separated_xml("ipgb20110104.xml"):
        soup = BeautifulSoup(xml_string)
        for num in soup.findAll("doc-number"):
            print num.contents[0]
    

    This returns:

    D0629996
    29316765
    D471343
    D475175
    6715152
    D498899
    D558952
    D571528
    D577177
    D584027
    .... (lots more)...
    
    0 讨论(0)
  • 2021-01-05 12:08

    I don't know about minidom, nor much about XML parsing, but I have used XPath to parse XML/HTML. E.g. within the lxml module.

    Here you can find some XPath Examples: http://www.w3schools.com/xpath/xpath_examples.asp

    0 讨论(0)
  • 2021-01-05 12:09

    Here's my take on it, using a generator and lxml.etree. Extracted information purely for example.

    import urllib2, os, zipfile
    from lxml import etree
    
    def xmlSplitter(data,separator=lambda x: x.startswith('<?xml')):
      buff = []
      for line in data:
        if separator(line):
          if buff:
            yield ''.join(buff)
            buff[:] = []
        buff.append(line)
      yield ''.join(buff)
    
    def first(seq,default=None):
      """Return the first item from sequence, seq or the default(None) value"""
      for item in seq:
        return item
      return default
    
    datasrc = "http://commondatastorage.googleapis.com/patents/grantbib/2011/ipgb20110104_wk01.zip"
    filename = datasrc.split('/')[-1]
    
    if not os.path.exists(filename):
      with open(filename,'wb') as file_write:
        r = urllib2.urlopen(datasrc)
        file_write.write(r.read())
    
    zf = zipfile.ZipFile(filename)
    xml_file = first([ x for x in zf.namelist() if x.endswith('.xml')])
    assert xml_file is not None
    
    count = 0
    for item in xmlSplitter(zf.open(xml_file)):
      count += 1
      if count > 10: break
      doc = etree.XML(item)
      docID = "-".join(doc.xpath('//publication-reference/document-id/*/text()'))
      title = first(doc.xpath('//invention-title/text()'))
      assignee = first(doc.xpath('//assignee/addressbook/orgname/text()'))
      print "DocID:    {0}\nTitle:    {1}\nAssignee: {2}\n".format(docID,title,assignee)
    

    Yields:

    DocID:    US-D0629996-S1-20110104
    Title:    Glove backhand
    Assignee: Blackhawk Industries Product Group Unlimited LLC
    
    DocID:    US-D0629997-S1-20110104
    Title:    Belt sleeve
    Assignee: None
    
    DocID:    US-D0629998-S1-20110104
    Title:    Underwear
    Assignee: X-Technology Swiss GmbH
    
    DocID:    US-D0629999-S1-20110104
    Title:    Portion of compression shorts
    Assignee: Nike, Inc.
    
    DocID:    US-D0630000-S1-20110104
    Title:    Apparel
    Assignee: None
    
    DocID:    US-D0630001-S1-20110104
    Title:    Hooded shirt
    Assignee: None
    
    DocID:    US-D0630002-S1-20110104
    Title:    Hooded shirt
    Assignee: None
    
    DocID:    US-D0630003-S1-20110104
    Title:    Hooded shirt
    Assignee: None
    
    DocID:    US-D0630004-S1-20110104
    Title:    Headwear cap
    Assignee: None
    
    DocID:    US-D0630005-S1-20110104
    Title:    Footwear
    Assignee: Vibram S.p.A.
    0 讨论(0)
提交回复
热议问题