Open a zip file and stream the xml file inside of the zip file

问题

I am trying to open bulk data from the USPTO. The xml files within the zips are concatenated xml files containing multiple xml declarations and are quiet large. I am trying to only read lines from the xml until i get to the next xml declaration. I found this related question, without code.

What I want to create is a function that does the following:

For each *.zip file
Extract all xml file(s) (or open xml file(s) for reading)
Read lines from the xml file(s)
Append each line until the next xml declaration
Return the string

So far, I've been able to open the zip file, find all the xml file(s) and extract each xml file. I would prefer to not write the xml file to disk, but instead create a string that is a single xml document that I then further parse.

def main():
path = 'bulk/'
allFiles = glob.glob(path + '*.zip')
allFiles.sort()

for file in allFiles:
    try:
        with zipfile.ZipFile(file, mode = 'r', allowZip64 = True) as fin:
            print(fin, '- ok')
            print(fin.namelist())
            for name in fin.namelist():
                if name.endswith('xml'):
                    print(name) # all files that end in 'xml'
                    fin.extract(name, path='bulk/')
                    print('extracted ', name)
                    # TODO function to read lines of the xml file and




    except zipfile.BadZipFile:
            print(file,'- Bad zip file')

if __name__ == '__main__': main()

回答1:

Use read instead of extract. It returns the bytes of a file in the zip, given a name. It's important to understand that you're essentially extracting the archive to memory, so be aware of how much data is actually going to be extracted and your limitations in that regard.

For example, the following function returns a dict with the names of a zip archive's files as keys, and the files' contents as values:

from zipfile import ZipFile

def extract(f):
    zf = ZipFile(f)
    return {name: zf.read(name) for name in zf.namelist()}

来源：https://stackoverflow.com/questions/54030279/open-a-zip-file-and-stream-the-xml-file-inside-of-the-zip-file

标签

python

xml

zipfile