Best way to join two large datasets in Pandas

前端未结

关注

 2  727

I\'m downloading two datasets from two different databases that need to be joined. Each of them separately is around 500MB when I store them as CSV. Separately the fit into

相关标签:

2条回答

傲寒

2020-11-28 10:34
This seems like a task that dask was designed for. Essentially, dask can do pandas operations out-of-core, so you can work with datasets that don't fit into memory. The dask.dataframe API is a subset of the pandas API, so there shouldn't be much of a learning curve. See the Dask DataFrame Overview page for some additional DataFrame specific details.
```
import dask.dataframe as dd

# Read in the csv files.
df1 = dd.read_csv('file1.csv')
df2 = dd.read_csv('file2.csv')

# Merge the csv files.
df = dd.merge(df1, df2, how='outer', on=['product','version'])

# Write the output.
df.to_csv('file3.csv', index=False)
```
Assuming that 'product' and 'version' are the only columns, it may be more efficient to replace the merge with:
```
df = dd.concat([df1, df2]).drop_duplicates()
```
I'm not entirely sure if that will be better, but apparently merges that aren't done on the index are "slow-ish" in dask, so it could be worth a try.
0 讨论(0)
发布评论:

提交评论
- 加载中...
逝去的感伤

2020-11-28 10:49
I would recommend you to use RDBMS like MySQL for that...

So you would need to load your CSV files into tables first.

After that you can perform your checks:

which products and versions are in the left table only
```
SELECT a.product, a.version
FROM table_a a
LEFT JOIN table_b b
ON a.product = b.product AND a.version = b.version
WHERE b.product IS NULL;
```
which products and versions are in the right table only
```
SELECT b.product, b.version
FROM table_a a
RIGHT JOIN table_b b
ON a.product = b.product AND a.version = b.version
WHERE a.product IS NULL;
```
in both
```
SELECT a.product, a.version
FROM table_a a
JOIN table_b b
ON a.product = b.product AND a.version = b.version;
```
Configure your MySQL Server, so that it uses at least 2GB of RAM

You may also want to use MyISAM engine for your tables, in this case check this

It might work slower compared to Pandas, but you definitely won't have memory issues.

Another possible solutions:
- increase your RAM
- use Apache Spark SQL (distributed DataFrame) on multiple cluster nodes - it will be much cheaper though to increase your RAM
0 讨论(0)
发布评论:

提交评论
- 加载中...