“Large data” work flows using pandas

前端 未结 16 1784
被撕碎了的回忆
被撕碎了的回忆 2020-11-21 07:32

I have tried to puzzle out an answer to this question for many months while learning pandas. I use SAS for my day-to-day work and it is great for it\'s out-of-core support.

16条回答
  •  不知归路
    2020-11-21 08:19

    I think the answers above are missing a simple approach that I've found very useful.

    When I have a file that is too large to load in memory, I break up the file into multiple smaller files (either by row or cols)

    Example: In case of 30 days worth of trading data of ~30GB size, I break it into a file per day of ~1GB size. I subsequently process each file separately and aggregate results at the end

    One of the biggest advantages is that it allows parallel processing of the files (either multiple threads or processes)

    The other advantage is that file manipulation (like adding/removing dates in the example) can be accomplished by regular shell commands, which is not be possible in more advanced/complicated file formats

    This approach doesn't cover all scenarios, but is very useful in a lot of them

提交回复
热议问题