Looking for a more efficient way to reorganize a massive CSV in Python

问题

I've been working on a problem where I have data from a large output .txt file, and now have to parse and reorganize certain values in the the form of a .csv.

I've already written a script that input all the data into a .csv in columns based on what kind of data it is (Flight ID, Latitude, Longitude, etc), but it's not in the correct order. All values are meant to be grouped based on the same Flight ID, in order from earliest time stamp to the latest. Fortunately, my .csv has all values in the correct time order, but not grouped together appropriately according to Flight ID's.

To clear my description up, it looks like this right now,

("Time x" is just to illustrate):

20110117559515, , , , , , , , ,2446,6720,370,42  (Time 0)                               
20110117559572, , , , , , , , ,2390,6274,410,54  (Time 0)                               
20110117559574, , , , , , , , ,2391,6284,390,54  (Time 0)                               
20110117559587, , , , , , , , ,2385,6273,390,54  (Time 0)                               
20110117559588, , , , , , , , ,2816,6847,250,32  (Time 0) 
...

and it's supposed to be ordered like this:

20110117559515, , , , , , , , ,2446,6720,370,42  (Time 0)
20110117559515, , , , , , , , ,24xx,67xx,3xx,42  (Time 1)
20110117559515, , , , , , , , ,24xx,67xx,3xx,42  (Time 2)
20110117559515, , , , , , , , ,24xx,67xx,3xx,42  (Time 3)
20110117559515, , , , , , , , ,24xx,67xx,3xx,42  (Time N)
20110117559572, , , , , , , , ,2390,6274,410,54  (Time 0)
20110117559572, , , , , , , , ,23xx,62xx,4xx,54  (Time 1)
... and so on

There are 1.3 million some rows in the .csv I output to make things easier. I'm 99% confident the logic in the next script I wrote to fix the ordering is correct, but my fear is that it's extremely inefficient. I ended up adding a progress bar just to see if it's making any progress, and unfortunately this is what I see:

Here's my code handling the crunching (skip down to problem area if you like):

## a class I wrote to handle the huge .csv's ##
from BIGASSCSVParser import BIGASSCSVParser               
import collections                                                              


x = open('newtrajectory.csv')  #file to be reordered                                                  
linetlist = []                                                                  
tidict = {}               

'' To save braincells I stored all the required values
   of each line into a dictionary of tuples.
   Index: Tuple ''

for line in x:                                                                  
    y = line.replace(',',' ')                                                   
    y = y.split()                                                               
    tup = (y[0],y[1],y[2],y[3],y[4])                                            
    linetlist.append(tup)                                                       
for k,v in enumerate(linetlist):                                                
    tidict[k] = v                                                               
x.close()                                                                       


trj = BIGASSCSVParser('newtrajectory.csv')                                      
uniquelFIDs = []                                                                
z = trj.column(0)   # List of out of order Flight ID's                                                     
for i in z:         # like in the example above                                                           
    if i in uniquelFIDs:                                                        
        continue                                                                
    else:                                                                       
        uniquelFIDs.append(i)  # Create list of unique FID's to refer to later                                               

queue = []                                                                              
p = collections.OrderedDict()                                                   
for k,v in enumerate(trj.column(0)):                                            
    p[k] = v

All good so far, but it's in this next segment my computer either chokes, or my code just sucks:

for k in uniquelFIDs:                                                           
    list = [i for i, x in p.items() if x == k]                                  
    queue.extend(list)

The idea was that for every unique value, in order, iterate over the 1.3 million values and return, in order, each occurrence's index, and append those values to a list. After that I was just going to read off that large list of indexes and write the contents of that row's data into another .csv file. Ta da! Probably hugely inefficient.

What's wrong here? Is there a more efficient way to do this problem? Is my code flawed, or am I just being cruel to my laptop?

Update:

I've found that with the amount of data I'm crunching, it'll take 9-10 hours. I had half of it correctly spat out in 4.5. An overnight crunch I can get away with for now, but will probably look to use a database or another language next time. I would have if I knew what I was getting into ahead of time, lol.

After adjusting sleep settings for my SSD, it only took 3 hours to crunch.

回答1:

If the CSV file would fit into your RAM (e.g. less than 2GB), then you can just read the whole thing and do a sort on it:

data = list(csv.reader(fn))
data.sort(key=lambda line:line[0])
csv.writer(outfn).writerows(data)

That shouldn't take nearly as long if you don't thrash. Note that .sort is a stable sort, so it will preserve the time order of your file when the keys are equal.

If it won't fit into RAM, you will probably want to do something a bit clever. For example, you can store the file offsets of each line, along with the necessary information from the line (timestamp and flight ID), then sort on those, and write the output file using the line offset information.

回答2:

You can try the UNIX sort utility:

sort -n -s -t, -k1,1 infile.csv > outfile.csv

-t sets the delimiter and -k sets the sort key. -s stabilizes the sort, and -n uses numeric comparison.

来源：https://stackoverflow.com/questions/15148983/looking-for-a-more-efficient-way-to-reorganize-a-massive-csv-in-python

标签

python

algorithm

list

iteration

data-processing