“Large data” work flows using pandas

前端未结

关注

 16  1846

被撕碎了的回忆 2020-11-21 07:32

I have tried to puzzle out an answer to this question for many months while learning pandas. I use SAS for my day-to-day work and it is great for it\'s out-of-core support.

16条回答

梦谈多话 (楼主)

2020-11-21 08:22
This is the case for pymongo. I have also prototyped using sql server, sqlite, HDF, ORM (SQLAlchemy) in python. First and foremost pymongo is a document based DB, so each person would be a document (dict of attributes). Many people form a collection and you can have many collections (people, stock market, income).

pd.dateframe -> pymongo Note: I use the chunksize in read_csv to keep it to 5 to 10k records(pymongo drops the socket if larger)
```
aCollection.insert((a[1].to_dict() for a in df.iterrows()))
```
querying: gt = greater than...
```
pd.DataFrame(list(mongoCollection.find({'anAttribute':{'$gt':2887000, '$lt':2889000}})))
```
.find() returns an iterator so I commonly use ichunked to chop into smaller iterators.

How about a join since I normally get 10 data sources to paste together:
```
aJoinDF = pandas.DataFrame(list(mongoCollection.find({'anAttribute':{'$in':Att_Keys}})))
```
then (in my case sometimes I have to agg on aJoinDF first before its "mergeable".)
```
df = pandas.merge(df, aJoinDF, on=aKey, how='left')
```
And you can then write the new info to your main collection via the update method below. (logical collection vs physical datasources).
```
collection.update({primarykey:foo},{key:change})
```
On smaller lookups, just denormalize. For example, you have code in the document and you just add the field code text and do a dict lookup as you create documents.

Now you have a nice dataset based around a person, you can unleash your logic on each case and make more attributes. Finally you can read into pandas your 3 to memory max key indicators and do pivots/agg/data exploration. This works for me for 3 million records with numbers/big text/categories/codes/floats/...

You can also use the two methods built into MongoDB (MapReduce and aggregate framework). See here for more info about the aggregate framework, as it seems to be easier than MapReduce and looks handy for quick aggregate work. Notice I didn't need to define my fields or relations, and I can add items to a document. At the current state of the rapidly changing numpy, pandas, python toolset, MongoDB helps me just get to work :)
0 讨论(0)

查看其它16个回答
发布评论:

提交评论
- 加载中...