“Large data” work flows using pandas

前端 未结 16 1759
被撕碎了的回忆
被撕碎了的回忆 2020-11-21 07:32

I have tried to puzzle out an answer to this question for many months while learning pandas. I use SAS for my day-to-day work and it is great for it\'s out-of-core support.

相关标签:
16条回答
  • 2020-11-21 08:09

    One more variation

    Many of the operations done in pandas can also be done as a db query (sql, mongo)

    Using a RDBMS or mongodb allows you to perform some of the aggregations in the DB Query (which is optimized for large data, and uses cache and indexes efficiently)

    Later, you can perform post processing using pandas.

    The advantage of this method is that you gain the DB optimizations for working with large data, while still defining the logic in a high level declarative syntax - and not having to deal with the details of deciding what to do in memory and what to do out of core.

    And although the query language and pandas are different, it's usually not complicated to translate part of the logic from one to another.

    0 讨论(0)
  • 2020-11-21 08:09

    Consider Ruffus if you go the simple path of creating a data pipeline which is broken down into multiple smaller files.

    0 讨论(0)
  • 2020-11-21 08:17

    I'd like to point out the Vaex package.

    Vaex is a python library for lazy Out-of-Core DataFrames (similar to Pandas), to visualize and explore big tabular datasets. It can calculate statistics such as mean, sum, count, standard deviation etc, on an N-dimensional grid up to a billion (109) objects/rows per second. Visualization is done using histograms, density plots and 3d volume rendering, allowing interactive exploration of big data. Vaex uses memory mapping, zero memory copy policy and lazy computations for best performance (no memory wasted).

    Have a look at the documentation: https://vaex.readthedocs.io/en/latest/ The API is very close to the API of pandas.

    0 讨论(0)
  • 2020-11-21 08:17

    At the moment I am working "like" you, just on a lower scale, which is why I don't have a PoC for my suggestion.

    However, I seem to find success in using pickle as caching system and outsourcing execution of various functions into files - executing these files from my commando / main file; For example i use a prepare_use.py to convert object types, split a data set into test, validating and prediction data set.

    How does your caching with pickle work? I use strings in order to access pickle-files that are dynamically created, depending on which parameters and data sets were passed (with that i try to capture and determine if the program was already run, using .shape for data set, dict for passed parameters). Respecting these measures, i get a String to try to find and read a .pickle-file and can, if found, skip processing time in order to jump to the execution i am working on right now.

    Using databases I encountered similar problems, which is why i found joy in using this solution, however - there are many constraints for sure - for example storing huge pickle sets due to redundancy. Updating a table from before to after a transformation can be done with proper indexing - validating information opens up a whole other book (I tried consolidating crawled rent data and stopped using a database after 2 hours basically - as I would have liked to jump back after every transformation process)

    I hope my 2 cents help you in some way.

    Greetings.

    0 讨论(0)
  • 2020-11-21 08:18

    I spotted this a little late, but I work with a similar problem (mortgage prepayment models). My solution has been to skip the pandas HDFStore layer and use straight pytables. I save each column as an individual HDF5 array in my final file.

    My basic workflow is to first get a CSV file from the database. I gzip it, so it's not as huge. Then I convert that to a row-oriented HDF5 file, by iterating over it in python, converting each row to a real data type, and writing it to a HDF5 file. That takes some tens of minutes, but it doesn't use any memory, since it's only operating row-by-row. Then I "transpose" the row-oriented HDF5 file into a column-oriented HDF5 file.

    The table transpose looks like:

    def transpose_table(h_in, table_path, h_out, group_name="data", group_path="/"):
        # Get a reference to the input data.
        tb = h_in.getNode(table_path)
        # Create the output group to hold the columns.
        grp = h_out.createGroup(group_path, group_name, filters=tables.Filters(complevel=1))
        for col_name in tb.colnames:
            logger.debug("Processing %s", col_name)
            # Get the data.
            col_data = tb.col(col_name)
            # Create the output array.
            arr = h_out.createCArray(grp,
                                     col_name,
                                     tables.Atom.from_dtype(col_data.dtype),
                                     col_data.shape)
            # Store the data.
            arr[:] = col_data
        h_out.flush()
    

    Reading it back in then looks like:

    def read_hdf5(hdf5_path, group_path="/data", columns=None):
        """Read a transposed data set from a HDF5 file."""
        if isinstance(hdf5_path, tables.file.File):
            hf = hdf5_path
        else:
            hf = tables.openFile(hdf5_path)
    
        grp = hf.getNode(group_path)
        if columns is None:
            data = [(child.name, child[:]) for child in grp]
        else:
            data = [(child.name, child[:]) for child in grp if child.name in columns]
    
        # Convert any float32 columns to float64 for processing.
        for i in range(len(data)):
            name, vec = data[i]
            if vec.dtype == np.float32:
                data[i] = (name, vec.astype(np.float64))
    
        if not isinstance(hdf5_path, tables.file.File):
            hf.close()
        return pd.DataFrame.from_items(data)
    

    Now, I generally run this on a machine with a ton of memory, so I may not be careful enough with my memory usage. For example, by default the load operation reads the whole data set.

    This generally works for me, but it's a bit clunky, and I can't use the fancy pytables magic.

    Edit: The real advantage of this approach, over the array-of-records pytables default, is that I can then load the data into R using h5r, which can't handle tables. Or, at least, I've been unable to get it to load heterogeneous tables.

    0 讨论(0)
  • 2020-11-21 08:19

    I think the answers above are missing a simple approach that I've found very useful.

    When I have a file that is too large to load in memory, I break up the file into multiple smaller files (either by row or cols)

    Example: In case of 30 days worth of trading data of ~30GB size, I break it into a file per day of ~1GB size. I subsequently process each file separately and aggregate results at the end

    One of the biggest advantages is that it allows parallel processing of the files (either multiple threads or processes)

    The other advantage is that file manipulation (like adding/removing dates in the example) can be accomplished by regular shell commands, which is not be possible in more advanced/complicated file formats

    This approach doesn't cover all scenarios, but is very useful in a lot of them

    0 讨论(0)
提交回复
热议问题