“Large data” work flows using pandas

前端 未结 16 1827
被撕碎了的回忆
被撕碎了的回忆 2020-11-21 07:32

I have tried to puzzle out an answer to this question for many months while learning pandas. I use SAS for my day-to-day work and it is great for it\'s out-of-core support.

16条回答
  •  攒了一身酷
    2020-11-21 07:56

    I recently came across a similar issue. I found simply reading the data in chunks and appending it as I write it in chunks to the same csv works well. My problem was adding a date column based on information in another table, using the value of certain columns as follows. This may help those confused by dask and hdf5 but more familiar with pandas like myself.

    def addDateColumn():
    """Adds time to the daily rainfall data. Reads the csv as chunks of 100k 
       rows at a time and outputs them, appending as needed, to a single csv. 
       Uses the column of the raster names to get the date.
    """
        df = pd.read_csv(pathlist[1]+"CHIRPS_tanz.csv", iterator=True, 
                         chunksize=100000) #read csv file as 100k chunks
    
        '''Do some stuff'''
    
        count = 1 #for indexing item in time list 
        for chunk in df: #for each 100k rows
            newtime = [] #empty list to append repeating times for different rows
            toiterate = chunk[chunk.columns[2]] #ID of raster nums to base time
            while count <= toiterate.max():
                for i in toiterate: 
                    if i ==count:
                        newtime.append(newyears[count])
                count+=1
            print "Finished", str(chunknum), "chunks"
            chunk["time"] = newtime #create new column in dataframe based on time
            outname = "CHIRPS_tanz_time2.csv"
            #append each output to same csv, using no header
            chunk.to_csv(pathlist[2]+outname, mode='a', header=None, index=None)
    

提交回复
热议问题