I\'m downloading two datasets from two different databases that need to be joined. Each of them separately is around 500MB when I store them as CSV. Separately the fit into
This seems like a task that dask was designed for. Essentially, dask
can do pandas
operations out-of-core, so you can work with datasets that don't fit into memory. The dask.dataframe
API is a subset of the pandas
API, so there shouldn't be much of a learning curve. See the Dask DataFrame Overview page for some additional DataFrame specific details.
import dask.dataframe as dd
# Read in the csv files.
df1 = dd.read_csv('file1.csv')
df2 = dd.read_csv('file2.csv')
# Merge the csv files.
df = dd.merge(df1, df2, how='outer', on=['product','version'])
# Write the output.
df.to_csv('file3.csv', index=False)
Assuming that 'product'
and 'version'
are the only columns, it may be more efficient to replace the merge
with:
df = dd.concat([df1, df2]).drop_duplicates()
I'm not entirely sure if that will be better, but apparently merges that aren't done on the index are "slow-ish" in dask
, so it could be worth a try.
I would recommend you to use RDBMS like MySQL for that...
So you would need to load your CSV files into tables first.
After that you can perform your checks:
which products and versions are in the left table only
SELECT a.product, a.version
FROM table_a a
LEFT JOIN table_b b
ON a.product = b.product AND a.version = b.version
WHERE b.product IS NULL;
which products and versions are in the right table only
SELECT b.product, b.version
FROM table_a a
RIGHT JOIN table_b b
ON a.product = b.product AND a.version = b.version
WHERE a.product IS NULL;
in both
SELECT a.product, a.version
FROM table_a a
JOIN table_b b
ON a.product = b.product AND a.version = b.version;
Configure your MySQL Server, so that it uses at least 2GB of RAM
You may also want to use MyISAM engine for your tables, in this case check this
It might work slower compared to Pandas, but you definitely won't have memory issues.
Another possible solutions: