“Large data” work flows using pandas

前端 未结 16 1817
被撕碎了的回忆
被撕碎了的回忆 2020-11-21 07:32

I have tried to puzzle out an answer to this question for many months while learning pandas. I use SAS for my day-to-day work and it is great for it\'s out-of-core support.

16条回答
  •  花落未央
    2020-11-21 08:18

    I spotted this a little late, but I work with a similar problem (mortgage prepayment models). My solution has been to skip the pandas HDFStore layer and use straight pytables. I save each column as an individual HDF5 array in my final file.

    My basic workflow is to first get a CSV file from the database. I gzip it, so it's not as huge. Then I convert that to a row-oriented HDF5 file, by iterating over it in python, converting each row to a real data type, and writing it to a HDF5 file. That takes some tens of minutes, but it doesn't use any memory, since it's only operating row-by-row. Then I "transpose" the row-oriented HDF5 file into a column-oriented HDF5 file.

    The table transpose looks like:

    def transpose_table(h_in, table_path, h_out, group_name="data", group_path="/"):
        # Get a reference to the input data.
        tb = h_in.getNode(table_path)
        # Create the output group to hold the columns.
        grp = h_out.createGroup(group_path, group_name, filters=tables.Filters(complevel=1))
        for col_name in tb.colnames:
            logger.debug("Processing %s", col_name)
            # Get the data.
            col_data = tb.col(col_name)
            # Create the output array.
            arr = h_out.createCArray(grp,
                                     col_name,
                                     tables.Atom.from_dtype(col_data.dtype),
                                     col_data.shape)
            # Store the data.
            arr[:] = col_data
        h_out.flush()
    

    Reading it back in then looks like:

    def read_hdf5(hdf5_path, group_path="/data", columns=None):
        """Read a transposed data set from a HDF5 file."""
        if isinstance(hdf5_path, tables.file.File):
            hf = hdf5_path
        else:
            hf = tables.openFile(hdf5_path)
    
        grp = hf.getNode(group_path)
        if columns is None:
            data = [(child.name, child[:]) for child in grp]
        else:
            data = [(child.name, child[:]) for child in grp if child.name in columns]
    
        # Convert any float32 columns to float64 for processing.
        for i in range(len(data)):
            name, vec = data[i]
            if vec.dtype == np.float32:
                data[i] = (name, vec.astype(np.float64))
    
        if not isinstance(hdf5_path, tables.file.File):
            hf.close()
        return pd.DataFrame.from_items(data)
    

    Now, I generally run this on a machine with a ton of memory, so I may not be careful enough with my memory usage. For example, by default the load operation reads the whole data set.

    This generally works for me, but it's a bit clunky, and I can't use the fancy pytables magic.

    Edit: The real advantage of this approach, over the array-of-records pytables default, is that I can then load the data into R using h5r, which can't handle tables. Or, at least, I've been unable to get it to load heterogeneous tables.

提交回复
热议问题