Best way to join two large datasets in Pandas

前端 未结 2 727
眼角桃花
眼角桃花 2020-11-28 10:06

I\'m downloading two datasets from two different databases that need to be joined. Each of them separately is around 500MB when I store them as CSV. Separately the fit into

相关标签:
2条回答
  • 2020-11-28 10:34

    This seems like a task that dask was designed for. Essentially, dask can do pandas operations out-of-core, so you can work with datasets that don't fit into memory. The dask.dataframe API is a subset of the pandas API, so there shouldn't be much of a learning curve. See the Dask DataFrame Overview page for some additional DataFrame specific details.

    import dask.dataframe as dd
    
    # Read in the csv files.
    df1 = dd.read_csv('file1.csv')
    df2 = dd.read_csv('file2.csv')
    
    # Merge the csv files.
    df = dd.merge(df1, df2, how='outer', on=['product','version'])
    
    # Write the output.
    df.to_csv('file3.csv', index=False)
    

    Assuming that 'product' and 'version' are the only columns, it may be more efficient to replace the merge with:

    df = dd.concat([df1, df2]).drop_duplicates()
    

    I'm not entirely sure if that will be better, but apparently merges that aren't done on the index are "slow-ish" in dask, so it could be worth a try.

    0 讨论(0)
  • 2020-11-28 10:49

    I would recommend you to use RDBMS like MySQL for that...

    So you would need to load your CSV files into tables first.

    After that you can perform your checks:

    which products and versions are in the left table only

    SELECT a.product, a.version
    FROM table_a a
    LEFT JOIN table_b b
    ON a.product = b.product AND a.version = b.version
    WHERE b.product IS NULL;
    

    which products and versions are in the right table only

    SELECT b.product, b.version
    FROM table_a a
    RIGHT JOIN table_b b
    ON a.product = b.product AND a.version = b.version
    WHERE a.product IS NULL;
    

    in both

    SELECT a.product, a.version
    FROM table_a a
    JOIN table_b b
    ON a.product = b.product AND a.version = b.version;
    

    Configure your MySQL Server, so that it uses at least 2GB of RAM

    You may also want to use MyISAM engine for your tables, in this case check this

    It might work slower compared to Pandas, but you definitely won't have memory issues.

    Another possible solutions:

    • increase your RAM
    • use Apache Spark SQL (distributed DataFrame) on multiple cluster nodes - it will be much cheaper though to increase your RAM
    0 讨论(0)
提交回复
热议问题