“Large data” work flows using pandas

前端 未结 16 1826
被撕碎了的回忆
被撕碎了的回忆 2020-11-21 07:32

I have tried to puzzle out an answer to this question for many months while learning pandas. I use SAS for my day-to-day work and it is great for it\'s out-of-core support.

16条回答
  •  爱一瞬间的悲伤
    2020-11-21 08:08

    Why Pandas ? Have you tried Standard Python ?

    The use of standard library python. Pandas is subject to frequent updates, even with the recent release of the stable version.

    Using the standard python library your code will always run.

    One way of doing it is to have an idea of the way you want your data to be stored , and which questions you want to solve regarding the data. Then draw a schema of how you can organise your data (think tables) that will help you query the data, not necessarily normalisation.

    You can make good use of :

    • list of dictionaries to store the data in memory (Think Amazon EC2) or disk, one dict being one row,
    • generators to process the data row after row to not overflow your RAM,
    • list comprehension to query your data,
    • make use of Counter, DefaultDict, ...
    • store your data on your hard drive using whatever storing solution you have chosen, json could be one of them.

    Ram and HDD is becoming cheaper and cheaper with time and standard python 3 is widely available and stable.

    The fondamental question you are trying to solve is "how to query large sets of data ?". The hdfs architecture is more or less what I am describing here (data modelling with data being stored on disk).

    Let's say you have 1000 petabytes of data, there no way you will be able to store it in Dask or Pandas, your best chances here is to store it on disk and process it with generators.

提交回复
热议问题