“Large data” work flows using pandas

前端 未结 16 1818
被撕碎了的回忆
被撕碎了的回忆 2020-11-21 07:32

I have tried to puzzle out an answer to this question for many months while learning pandas. I use SAS for my day-to-day work and it is great for it\'s out-of-core support.

相关标签:
16条回答
  • 2020-11-21 07:56

    I recently came across a similar issue. I found simply reading the data in chunks and appending it as I write it in chunks to the same csv works well. My problem was adding a date column based on information in another table, using the value of certain columns as follows. This may help those confused by dask and hdf5 but more familiar with pandas like myself.

    def addDateColumn():
    """Adds time to the daily rainfall data. Reads the csv as chunks of 100k 
       rows at a time and outputs them, appending as needed, to a single csv. 
       Uses the column of the raster names to get the date.
    """
        df = pd.read_csv(pathlist[1]+"CHIRPS_tanz.csv", iterator=True, 
                         chunksize=100000) #read csv file as 100k chunks
    
        '''Do some stuff'''
    
        count = 1 #for indexing item in time list 
        for chunk in df: #for each 100k rows
            newtime = [] #empty list to append repeating times for different rows
            toiterate = chunk[chunk.columns[2]] #ID of raster nums to base time
            while count <= toiterate.max():
                for i in toiterate: 
                    if i ==count:
                        newtime.append(newyears[count])
                count+=1
            print "Finished", str(chunknum), "chunks"
            chunk["time"] = newtime #create new column in dataframe based on time
            outname = "CHIRPS_tanz_time2.csv"
            #append each output to same csv, using no header
            chunk.to_csv(pathlist[2]+outname, mode='a', header=None, index=None)
    
    0 讨论(0)
  • 2020-11-21 08:00

    There is now, two years after the question, an 'out-of-core' pandas equivalent: dask. It is excellent! Though it does not support all of pandas functionality, you can get really far with it.

    0 讨论(0)
  • 2020-11-21 08:00

    It is worth mentioning here Ray as well,
    it's a distributed computation framework, that has it's own implementation for pandas in a distributed way.

    Just replace the pandas import, and the code should work as is:

    # import pandas as pd
    import ray.dataframe as pd
    
    #use pd as usual
    

    can read more details here:

    https://rise.cs.berkeley.edu/blog/pandas-on-ray/

    0 讨论(0)
  • 2020-11-21 08:01

    I routinely use tens of gigabytes of data in just this fashion e.g. I have tables on disk that I read via queries, create data and append back.

    It's worth reading the docs and late in this thread for several suggestions for how to store your data.

    Details which will affect how you store your data, like:
    Give as much detail as you can; and I can help you develop a structure.

    1. Size of data, # of rows, columns, types of columns; are you appending rows, or just columns?
    2. What will typical operations look like. E.g. do a query on columns to select a bunch of rows and specific columns, then do an operation (in-memory), create new columns, save these.
      (Giving a toy example could enable us to offer more specific recommendations.)
    3. After that processing, then what do you do? Is step 2 ad hoc, or repeatable?
    4. Input flat files: how many, rough total size in Gb. How are these organized e.g. by records? Does each one contains different fields, or do they have some records per file with all of the fields in each file?
    5. Do you ever select subsets of rows (records) based on criteria (e.g. select the rows with field A > 5)? and then do something, or do you just select fields A, B, C with all of the records (and then do something)?
    6. Do you 'work on' all of your columns (in groups), or are there a good proportion that you may only use for reports (e.g. you want to keep the data around, but don't need to pull in that column explicity until final results time)?

    Solution

    Ensure you have pandas at least 0.10.1 installed.

    Read iterating files chunk-by-chunk and multiple table queries.

    Since pytables is optimized to operate on row-wise (which is what you query on), we will create a table for each group of fields. This way it's easy to select a small group of fields (which will work with a big table, but it's more efficient to do it this way... I think I may be able to fix this limitation in the future... this is more intuitive anyhow):
    (The following is pseudocode.)

    import numpy as np
    import pandas as pd
    
    # create a store
    store = pd.HDFStore('mystore.h5')
    
    # this is the key to your storage:
    #    this maps your fields to a specific group, and defines 
    #    what you want to have as data_columns.
    #    you might want to create a nice class wrapping this
    #    (as you will want to have this map and its inversion)  
    group_map = dict(
        A = dict(fields = ['field_1','field_2',.....], dc = ['field_1',....,'field_5']),
        B = dict(fields = ['field_10',......        ], dc = ['field_10']),
        .....
        REPORTING_ONLY = dict(fields = ['field_1000','field_1001',...], dc = []),
    
    )
    
    group_map_inverted = dict()
    for g, v in group_map.items():
        group_map_inverted.update(dict([ (f,g) for f in v['fields'] ]))
    

    Reading in the files and creating the storage (essentially doing what append_to_multiple does):

    for f in files:
       # read in the file, additional options may be necessary here
       # the chunksize is not strictly necessary, you may be able to slurp each 
       # file into memory in which case just eliminate this part of the loop 
       # (you can also change chunksize if necessary)
       for chunk in pd.read_table(f, chunksize=50000):
           # we are going to append to each table by group
           # we are not going to create indexes at this time
           # but we *ARE* going to create (some) data_columns
    
           # figure out the field groupings
           for g, v in group_map.items():
                 # create the frame for this group
                 frame = chunk.reindex(columns = v['fields'], copy = False)    
    
                 # append it
                 store.append(g, frame, index=False, data_columns = v['dc'])
    

    Now you have all of the tables in the file (actually you could store them in separate files if you wish, you would prob have to add the filename to the group_map, but probably this isn't necessary).

    This is how you get columns and create new ones:

    frame = store.select(group_that_I_want)
    # you can optionally specify:
    # columns = a list of the columns IN THAT GROUP (if you wanted to
    #     select only say 3 out of the 20 columns in this sub-table)
    # and a where clause if you want a subset of the rows
    
    # do calculations on this frame
    new_frame = cool_function_on_frame(frame)
    
    # to 'add columns', create a new group (you probably want to
    # limit the columns in this new_group to be only NEW ones
    # (e.g. so you don't overlap from the other tables)
    # add this info to the group_map
    store.append(new_group, new_frame.reindex(columns = new_columns_created, copy = False), data_columns = new_columns_created)
    

    When you are ready for post_processing:

    # This may be a bit tricky; and depends what you are actually doing.
    # I may need to modify this function to be a bit more general:
    report_data = store.select_as_multiple([groups_1,groups_2,.....], where =['field_1>0', 'field_1000=foo'], selector = group_1)
    

    About data_columns, you don't actually need to define ANY data_columns; they allow you to sub-select rows based on the column. E.g. something like:

    store.select(group, where = ['field_1000=foo', 'field_1001>0'])
    

    They may be most interesting to you in the final report generation stage (essentially a data column is segregated from other columns, which might impact efficiency somewhat if you define a lot).

    You also might want to:

    • create a function which takes a list of fields, looks up the groups in the groups_map, then selects these and concatenates the results so you get the resulting frame (this is essentially what select_as_multiple does). This way the structure would be pretty transparent to you.
    • indexes on certain data columns (makes row-subsetting much faster).
    • enable compression.

    Let me know when you have questions!

    0 讨论(0)
  • 2020-11-21 08:03

    One trick I found helpful for large data use cases is to reduce the volume of the data by reducing float precision to 32-bit. It's not applicable in all cases, but in many applications 64-bit precision is overkill and the 2x memory savings are worth it. To make an obvious point even more obvious:

    >>> df = pd.DataFrame(np.random.randn(int(1e8), 5))
    >>> df.info()
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 100000000 entries, 0 to 99999999
    Data columns (total 5 columns):
    ...
    dtypes: float64(5)
    memory usage: 3.7 GB
    
    >>> df.astype(np.float32).info()
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 100000000 entries, 0 to 99999999
    Data columns (total 5 columns):
    ...
    dtypes: float32(5)
    memory usage: 1.9 GB
    
    0 讨论(0)
  • 2020-11-21 08:08

    Why Pandas ? Have you tried Standard Python ?

    The use of standard library python. Pandas is subject to frequent updates, even with the recent release of the stable version.

    Using the standard python library your code will always run.

    One way of doing it is to have an idea of the way you want your data to be stored , and which questions you want to solve regarding the data. Then draw a schema of how you can organise your data (think tables) that will help you query the data, not necessarily normalisation.

    You can make good use of :

    • list of dictionaries to store the data in memory (Think Amazon EC2) or disk, one dict being one row,
    • generators to process the data row after row to not overflow your RAM,
    • list comprehension to query your data,
    • make use of Counter, DefaultDict, ...
    • store your data on your hard drive using whatever storing solution you have chosen, json could be one of them.

    Ram and HDD is becoming cheaper and cheaper with time and standard python 3 is widely available and stable.

    The fondamental question you are trying to solve is "how to query large sets of data ?". The hdfs architecture is more or less what I am describing here (data modelling with data being stored on disk).

    Let's say you have 1000 petabytes of data, there no way you will be able to store it in Dask or Pandas, your best chances here is to store it on disk and process it with generators.

    0 讨论(0)
提交回复
热议问题