问题
I have a large text file that I need to parse into a pipe delimited text file using python. The file looks like this (basically):
product/productId: D7SDF9S9
review/userId: asdf9uas0d8u9f
review/score: 5.0
review/some text here
product/productId: D39F99
review/userId: fasd9fasd9f9f
review/score: 4.1
review/some text here
Each record is separated by two newline charters /n
. I have written a parser below.
with open ("largefile.txt", "r") as myfile:
fullstr = myfile.read()
allsplits = re.split("\n\n",fullstr)
articles = []
for i,s in enumerate(allsplits[0:]):
splits = re.split("\n.*?: ",s)
productId = splits[0]
userId = splits[1]
profileName = splits[2]
helpfulness = splits[3]
rating = splits[4]
time = splits[5]
summary = splits[6]
text = splits[7]
fw = open(outnamename,'w')
fw.write(productId+"|"+userID+"|"+profileName+"|"+helpfulness+"|"+rating+"|"+time+"|"+summary+"|"+text+"\n")
return
The problem is the file I am reading in is so large that I run out of memory before it can complete.
I suspect it's bambing out at the allsplits = re.split("\n\n",fullstr)
line.
Can someone let me know of a way to just read in one record at a time, parse it, write it to a file, and then move to the next record?
回答1:
Don't read the whole file into memory in one go; produce records by making use of those newlines. Write the data with the csv module for ease of writing out your pipe-delimited records.
The following code reads the input file line at a time, and writes out CSV rows per record as you go along. It never holds more than one line in memory, plus one record being constructed.
import csv
import re
fields = ('productId', 'userId', 'profileName', 'helpfulness', 'rating', 'time', 'summary', 'text')
with open("largefile.txt", "r") as myfile, open(outnamename,'w', newline='') as fw:
writer = csv.DictWriter(fw, fields, delimiter='|')
record = {}
for line in myfile:
if not line.strip() and record:
# empty line is the end of a record
writer.writerow(record)
record = {}
continue
field, value = line.split(': ', 1)
record[field.partition('/')[-1].strip()] = value.strip()
if record:
# handle last record
writer.writerow(record)
This code does assume that the file contains text before a colon of the form category/key
, so product/productId
, review/userId
, etc. The part after the slash is used for the CSV columns; the fields
list at the top reflects these keys.
Alternatively, you can remove that fields
list and use a csv.writer
instead, gathering the record values in a list instead:
import csv
import re
with open("largefile.txt", "r") as myfile, open(outnamename,'wb') as fw:
writer = csv.writer(fw, delimiter='|')
record = []
for line in myfile:
if not line.strip() and record:
# empty line is the end of a record
writer.writerow(record)
record = []
continue
field, value = line.split(': ', 1)
record.append(value.strip())
if record:
# handle last record
writer.writerow(record)
This version requires that record fields are all present and are written to the file in a fixed order.
回答2:
Use "readline()" to read the fields of a record one by one. Or you can use read(n) to read "n" bytes.
回答3:
Don't read the whole file into memory at once, instead iterate over it line by line, also use Python's csv module to parse the records:
import csv
with open('hugeinputfile.txt', 'rb') as infile, open('outputfile.txt', 'wb') as outfile:
writer = csv.writer(outfile, delimiter='|')
for record in csv.reader(infile, delimiter='\n', lineterminator='\n\n'):
values = [item.split(':')[-1].strip() for item in record[:-1]] + [record[-1]]
writer.writerow(values)
A couple things to note here:
- Use
with
to open files. Why? Because usingwith
ensures that the file isclose()
d, even if an exception interrupts the script.
Thus:
with open('myfile.txt') as f:
do_stuff_to_file(f)
is equivalent to:
f = open('myfile.txt')
try:
do_stuff_to_file(f)
finally:
f.close()
To be continued... (I'm out of time ATM)
来源:https://stackoverflow.com/questions/21653738/parsing-large-9gb-file-using-python