Open a zip file and stream the xml file inside of the zip file

人走茶凉 提交于 2019-12-13 08:37:19

问题


I am trying to open bulk data from the USPTO. The xml files within the zips are concatenated xml files containing multiple xml declarations and are quiet large. I am trying to only read lines from the xml until i get to the next xml declaration. I found this related question, without code.

What I want to create is a function that does the following:

  1. For each *.zip file
  2. Extract all xml file(s) (or open xml file(s) for reading)
  3. Read lines from the xml file(s)
  4. Append each line until the next xml declaration
  5. Return the string

So far, I've been able to open the zip file, find all the xml file(s) and extract each xml file. I would prefer to not write the xml file to disk, but instead create a string that is a single xml document that I then further parse.

def main():
path = 'bulk/'
allFiles = glob.glob(path + '*.zip')
allFiles.sort()

for file in allFiles:
    try:
        with zipfile.ZipFile(file, mode = 'r', allowZip64 = True) as fin:
            print(fin, '- ok')
            print(fin.namelist())
            for name in fin.namelist():
                if name.endswith('xml'):
                    print(name) # all files that end in 'xml'
                    fin.extract(name, path='bulk/')
                    print('extracted ', name)
                    # TODO function to read lines of the xml file and




    except zipfile.BadZipFile:
            print(file,'- Bad zip file')

if __name__ == '__main__': main()

回答1:


Use read instead of extract. It returns the bytes of a file in the zip, given a name. It's important to understand that you're essentially extracting the archive to memory, so be aware of how much data is actually going to be extracted and your limitations in that regard.

For example, the following function returns a dict with the names of a zip archive's files as keys, and the files' contents as values:

from zipfile import ZipFile

def extract(f):
    zf = ZipFile(f)
    return {name: zf.read(name) for name in zf.namelist()}


来源:https://stackoverflow.com/questions/54030279/open-a-zip-file-and-stream-the-xml-file-inside-of-the-zip-file

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!