问题
I am trying to open bulk data from the USPTO. The xml files within the zips are concatenated xml files containing multiple xml declarations and are quiet large. I am trying to only read lines from the xml until i get to the next xml declaration. I found this related question, without code.
What I want to create is a function that does the following:
- For each *.zip file
- Extract all xml file(s) (or open xml file(s) for reading)
- Read lines from the xml file(s)
- Append each line until the next xml declaration
- Return the string
So far, I've been able to open the zip file, find all the xml file(s) and extract each xml file. I would prefer to not write the xml file to disk, but instead create a string that is a single xml document that I then further parse.
def main():
path = 'bulk/'
allFiles = glob.glob(path + '*.zip')
allFiles.sort()
for file in allFiles:
try:
with zipfile.ZipFile(file, mode = 'r', allowZip64 = True) as fin:
print(fin, '- ok')
print(fin.namelist())
for name in fin.namelist():
if name.endswith('xml'):
print(name) # all files that end in 'xml'
fin.extract(name, path='bulk/')
print('extracted ', name)
# TODO function to read lines of the xml file and
except zipfile.BadZipFile:
print(file,'- Bad zip file')
if __name__ == '__main__': main()
回答1:
Use read instead of extract
. It returns the bytes of a file in the zip, given a name. It's important to understand that you're essentially extracting the archive to memory, so be aware of how much data is actually going to be extracted and your limitations in that regard.
For example, the following function returns a dict with the names of a zip archive's files as keys, and the files' contents as values:
from zipfile import ZipFile
def extract(f):
zf = ZipFile(f)
return {name: zf.read(name) for name in zf.namelist()}
来源:https://stackoverflow.com/questions/54030279/open-a-zip-file-and-stream-the-xml-file-inside-of-the-zip-file