问题
I am trying to work with csv files contained in a tar.gz file and I am having issues passing the correct data/object through to the csv module.
Say I have a tar.gz file with a number of csv files formated as follows.
1079,SAMPLE_A,GROUP,001,,2017/02/15 22:57:30
1041,SAMPLE_B,GROUP,023,,2017/02/15 22:57:26
1077,SAMPLE_C,GROUP,005,,2017/02/15 22:57:31
1079,SAMPLE_A,GROUP,128,,2017/02/15 22:57:38
I want to be able to access each csv file in memory without extracting each file from the tar file and writing them to disk. For example:
import tarfile
import csv
tar = tarfile.open("tar-file.tar.gz")
for member in tar.getmembers():
f = tar.extractfile(member).read()
content = csv.reader(f)
for row in content:
print(row)
tar.close()
This produces the following error.
for row in content:
_csv.Error: iterator should return strings, not int (did you open the file in text mode?)
I have also tried parsing f as a string as described in the csv module documentation.
content = csv.reader([f])
The above produces the same error.
I have tried parsing the file object f as ascii.
f = tar.extractfile(member).read().decode('ascii')
but this iterates each csv element instead of iterating rows containing lists of elements.
['1']
['0']
['7']
['9']
['', '']
['S']
['A']
['M']
['P']
['L']
['E']
['_']
['A']
['', '']
['G']
['R']
snip...
['2']
['0']
['1']
['7']
['/']
['0']
['2']
['/']
['1']
['5']
[' ']
['2']
['2']
[':']
['5']
['7']
[':']
['3']
['8']
[]
[]
Trying to both parse f as ascii and read it as a string
f = tar.extractfile(member).read().decode('ascii')
content = csv.reader([f])
produces the following output
for row in content:
_csv.Error: new-line character seen in unquoted field - do you need to open the file in universal-newline mode?
To demonstrate the different outputs I used the following code.
import tarfile
import csv
tar = tarfile.open("tar-file.tar.gz")
for member in tar.getmembers():
f = tar.extractfile(member).read()
print(member.name)
print('Raw :', type(f))
print(f)
print()
f = f.decode('ascii')
print('ASCII:', type(f))
print(f)
tar.close()
This produces the following output. (each csv contains the same data for this example).
./raw_data/csv-file1.csv
Raw : <class 'bytes'>
b'1079,SAMPLE_A,GROUP,001,,2017/02/15 22:57:30\n1041,SAMPLE_B,GROUP,023,,2017/02/15 22:57:26\n1077,SAMPLE_C,GROUP,005,,2017/02/15 22:57:31\n1079,SAMPLE_A,GROUP,128,,2017/02/15 22:57:38\n\n'
ASCII: <class 'str'>
1079,SAMPLE_A,GROUP,001,,2017/02/15 22:57:30
1041,SAMPLE_B,GROUP,023,,2017/02/15 22:57:26
1077,SAMPLE_C,GROUP,005,,2017/02/15 22:57:31
1079,SAMPLE_A,GROUP,128,,2017/02/15 22:57:38
./raw_data/csv-file2.csv
Raw : <class 'bytes'>
b'1079,SAMPLE_A,GROUP,001,,2017/02/15 22:57:30\n1041,SAMPLE_B,GROUP,023,,2017/02/15 22:57:26\n1077,SAMPLE_C,GROUP,005,,2017/02/15 22:57:31\n1079,SAMPLE_A,GROUP,128,,2017/02/15 22:57:38\n\n'
ASCII: <class 'str'>
1079,SAMPLE_A,GROUP,001,,2017/02/15 22:57:30
1041,SAMPLE_B,GROUP,023,,2017/02/15 22:57:26
1077,SAMPLE_C,GROUP,005,,2017/02/15 22:57:31
1079,SAMPLE_A,GROUP,128,,2017/02/15 22:57:38
./raw_data/csv-file3.csv
Raw : <class 'bytes'>
b'1079,SAMPLE_A,GROUP,001,,2017/02/15 22:57:30\n1041,SAMPLE_B,GROUP,023,,2017/02/15 22:57:26\n1077,SAMPLE_C,GROUP,005,,2017/02/15 22:57:31\n1079,SAMPLE_A,GROUP,128,,2017/02/15 22:57:38\n\n'
ASCII: <class 'str'>
1079,SAMPLE_A,GROUP,001,,2017/02/15 22:57:30
1041,SAMPLE_B,GROUP,023,,2017/02/15 22:57:26
1077,SAMPLE_C,GROUP,005,,2017/02/15 22:57:31
1079,SAMPLE_A,GROUP,128,,2017/02/15 22:57:38
How can I get the csv module to correctly read a file in memory provided by the tar module? Thanks.
回答1:
You just need to use io.StringIO()
to produce a file like object for the csv library to use. For example:
import tarfile
import csv
import io
with tarfile.open('input.rar') as tar:
for member in tar:
if member.isreg(): # Is it a regular file?
print("{} - {} bytes".format(member.name, member.size))
csv_file = io.StringIO(tar.extractfile(member).read().decode('ascii'))
for row in csv.reader(csv_file):
print(row)
回答2:
This question was raised again almost 3 years. Please note that in python: use CSV reader with single file extracted from tarfile a better solution could be found after short discussion:
import tarfile
import csv
import io
with tarfile.open('input.rar') as tar:
for member in tar:
if member.isreg(): # Is it a regular file?
print("{} - {} bytes".format(member.name, member.size))
csv_file = io.TextIOWrapper(tar.extractfile(member), encoding="utf-8")
for row in csv.reader(csv_file):
print(row)
The TextIOWrapper will perform better for larger files because it does not need to consume a complete file at once. In contrast, when tar.extractfile(member).read()
is executed, the complete member file is loaded into memory.
来源:https://stackoverflow.com/questions/43466516/python3-working-with-csv-files-in-tar-files