“Large data” work flows using pandas

前端 未结 16 1786
被撕碎了的回忆
被撕碎了的回忆 2020-11-21 07:32

I have tried to puzzle out an answer to this question for many months while learning pandas. I use SAS for my day-to-day work and it is great for it\'s out-of-core support.

16条回答
  •  南方客
    南方客 (楼主)
    2020-11-21 08:03

    One trick I found helpful for large data use cases is to reduce the volume of the data by reducing float precision to 32-bit. It's not applicable in all cases, but in many applications 64-bit precision is overkill and the 2x memory savings are worth it. To make an obvious point even more obvious:

    >>> df = pd.DataFrame(np.random.randn(int(1e8), 5))
    >>> df.info()
    
    RangeIndex: 100000000 entries, 0 to 99999999
    Data columns (total 5 columns):
    ...
    dtypes: float64(5)
    memory usage: 3.7 GB
    
    >>> df.astype(np.float32).info()
    
    RangeIndex: 100000000 entries, 0 to 99999999
    Data columns (total 5 columns):
    ...
    dtypes: float32(5)
    memory usage: 1.9 GB
    

提交回复
热议问题