“Large data” work flows using pandas

前端 未结 16 1822
被撕碎了的回忆
被撕碎了的回忆 2020-11-21 07:32

I have tried to puzzle out an answer to this question for many months while learning pandas. I use SAS for my day-to-day work and it is great for it\'s out-of-core support.

相关标签:
16条回答
  • 2020-11-21 08:20

    I know this is an old thread but I think the Blaze library is worth checking out. It's built for these types of situations.

    From the docs:

    Blaze extends the usability of NumPy and Pandas to distributed and out-of-core computing. Blaze provides an interface similar to that of the NumPy ND-Array or Pandas DataFrame but maps these familiar interfaces onto a variety of other computational engines like Postgres or Spark.

    Edit: By the way, it's supported by ContinuumIO and Travis Oliphant, author of NumPy.

    0 讨论(0)
  • 2020-11-21 08:21

    If your datasets are between 1 and 20GB, you should get a workstation with 48GB of RAM. Then Pandas can hold the entire dataset in RAM. I know its not the answer you're looking for here, but doing scientific computing on a notebook with 4GB of RAM isn't reasonable.

    0 讨论(0)
  • 2020-11-21 08:22

    This is the case for pymongo. I have also prototyped using sql server, sqlite, HDF, ORM (SQLAlchemy) in python. First and foremost pymongo is a document based DB, so each person would be a document (dict of attributes). Many people form a collection and you can have many collections (people, stock market, income).

    pd.dateframe -> pymongo Note: I use the chunksize in read_csv to keep it to 5 to 10k records(pymongo drops the socket if larger)

    aCollection.insert((a[1].to_dict() for a in df.iterrows()))
    

    querying: gt = greater than...

    pd.DataFrame(list(mongoCollection.find({'anAttribute':{'$gt':2887000, '$lt':2889000}})))
    

    .find() returns an iterator so I commonly use ichunked to chop into smaller iterators.

    How about a join since I normally get 10 data sources to paste together:

    aJoinDF = pandas.DataFrame(list(mongoCollection.find({'anAttribute':{'$in':Att_Keys}})))
    

    then (in my case sometimes I have to agg on aJoinDF first before its "mergeable".)

    df = pandas.merge(df, aJoinDF, on=aKey, how='left')
    

    And you can then write the new info to your main collection via the update method below. (logical collection vs physical datasources).

    collection.update({primarykey:foo},{key:change})
    

    On smaller lookups, just denormalize. For example, you have code in the document and you just add the field code text and do a dict lookup as you create documents.

    Now you have a nice dataset based around a person, you can unleash your logic on each case and make more attributes. Finally you can read into pandas your 3 to memory max key indicators and do pivots/agg/data exploration. This works for me for 3 million records with numbers/big text/categories/codes/floats/...

    You can also use the two methods built into MongoDB (MapReduce and aggregate framework). See here for more info about the aggregate framework, as it seems to be easier than MapReduce and looks handy for quick aggregate work. Notice I didn't need to define my fields or relations, and I can add items to a document. At the current state of the rapidly changing numpy, pandas, python toolset, MongoDB helps me just get to work :)

    0 讨论(0)
  • 2020-11-21 08:22

    As noted by others, after some years an 'out-of-core' pandas equivalent has emerged: dask. Though dask is not a drop-in replacement of pandas and all of its functionality it stands out for several reasons:

    Dask is a flexible parallel computing library for analytic computing that is optimized for dynamic task scheduling for interactive computational workloads of “Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments and scales from laptops to clusters.

    Dask emphasizes the following virtues:

    • Familiar: Provides parallelized NumPy array and Pandas DataFrame objects
    • Flexible: Provides a task scheduling interface for more custom workloads and integration with other projects.
    • Native: Enables distributed computing in Pure Python with access to the PyData stack.
    • Fast: Operates with low overhead, low latency, and minimal serialization necessary for fast numerical algorithms
    • Scales up: Runs resiliently on clusters with 1000s of cores Scales down: Trivial to set up and run on a laptop in a single process
    • Responsive: Designed with interactive computing in mind it provides rapid feedback and diagnostics to aid humans

    and to add a simple code sample:

    import dask.dataframe as dd
    df = dd.read_csv('2015-*-*.csv')
    df.groupby(df.user_id).value.mean().compute()
    

    replaces some pandas code like this:

    import pandas as pd
    df = pd.read_csv('2015-01-01.csv')
    df.groupby(df.user_id).value.mean()
    

    and, especially noteworthy, provides through the concurrent.futures interface a general infrastructure for the submission of custom tasks:

    from dask.distributed import Client
    client = Client('scheduler:port')
    
    futures = []
    for fn in filenames:
        future = client.submit(load, fn)
        futures.append(future)
    
    summary = client.submit(summarize, futures)
    summary.result()
    
    0 讨论(0)
提交回复
热议问题