“Large data” work flows using pandas

前端 未结 16 1774
被撕碎了的回忆
被撕碎了的回忆 2020-11-21 07:32

I have tried to puzzle out an answer to this question for many months while learning pandas. I use SAS for my day-to-day work and it is great for it\'s out-of-core support.

16条回答
  •  执笔经年
    2020-11-21 08:17

    At the moment I am working "like" you, just on a lower scale, which is why I don't have a PoC for my suggestion.

    However, I seem to find success in using pickle as caching system and outsourcing execution of various functions into files - executing these files from my commando / main file; For example i use a prepare_use.py to convert object types, split a data set into test, validating and prediction data set.

    How does your caching with pickle work? I use strings in order to access pickle-files that are dynamically created, depending on which parameters and data sets were passed (with that i try to capture and determine if the program was already run, using .shape for data set, dict for passed parameters). Respecting these measures, i get a String to try to find and read a .pickle-file and can, if found, skip processing time in order to jump to the execution i am working on right now.

    Using databases I encountered similar problems, which is why i found joy in using this solution, however - there are many constraints for sure - for example storing huge pickle sets due to redundancy. Updating a table from before to after a transformation can be done with proper indexing - validating information opens up a whole other book (I tried consolidating crawled rent data and stopped using a database after 2 hours basically - as I would have liked to jump back after every transformation process)

    I hope my 2 cents help you in some way.

    Greetings.

提交回复
热议问题