How do I read a large csv file with pandas?

后端 未结 15 1871
隐瞒了意图╮
隐瞒了意图╮ 2020-11-21 07:12

I am trying to read a large csv file (aprox. 6 GB) in pandas and i am getting a memory error:

MemoryError                               Traceback (most recen         


        
15条回答
  •  忘了有多久
    2020-11-21 07:34

    I want to make a more comprehensive answer based off of the most of the potential solutions that are already provided. I also want to point out one more potential aid that may help reading process.

    Option 1: dtypes

    "dtypes" is a pretty powerful parameter that you can use to reduce the memory pressure of read methods. See this and this answer. Pandas, on default, try to infer dtypes of the data.

    Referring to data structures, every data stored, a memory allocation takes place. At a basic level refer to the values below (The table below illustrates values for C programming language):

    The maximum value of UNSIGNED CHAR = 255                                    
    The minimum value of SHORT INT = -32768                                     
    The maximum value of SHORT INT = 32767                                      
    The minimum value of INT = -2147483648                                      
    The maximum value of INT = 2147483647                                       
    The minimum value of CHAR = -128                                            
    The maximum value of CHAR = 127                                             
    The minimum value of LONG = -9223372036854775808                            
    The maximum value of LONG = 9223372036854775807
    

    Refer to this page to see the matching between NumPy and C types.

    Let's say you have an array of integers of digits. You can both theoretically and practically assign, say array of 16-bit integer type, but you would then allocate more memory than you actually need to store that array. To prevent this, you can set dtype option on read_csv. You do not want to store the array items as long integer where actually you can fit them with 8-bit integer (np.int8 or np.uint8).

    Observe the following dtype map.

    Source: https://pbpython.com/pandas_dtypes.html

    You can pass dtype parameter as a parameter on pandas methods as dict on read like {column: type}.

    import numpy as np
    import pandas as pd
    
    df_dtype = {
            "column_1": int,
            "column_2": str,
            "column_3": np.int16,
            "column_4": np.uint8,
            ...
            "column_n": np.float32
    }
    
    df = pd.read_csv('path/to/file', dtype=df_dtype)
    

    Option 2: Read by Chunks

    Reading the data in chunks allows you to access a part of the data in-memory, and you can apply preprocessing on your data and preserve the processed data rather than raw data. It'd be much better if you combine this option with the first one, dtypes.

    I want to point out the pandas cookbook sections for that process, where you can find it here. Note those two sections there;

    • Reading a csv chunk-by-chunk
    • Reading only certain rows of a csv chunk-by-chunk

    Option 3: Dask

    Dask is a framework that is defined in Dask's website as:

    Dask provides advanced parallelism for analytics, enabling performance at scale for the tools you love

    It was born to cover the necessary parts where pandas cannot reach. Dask is a powerful framework that allows you much more data access by processing it in a distributed way.

    You can use dask to preprocess your data as a whole, Dask takes care of the chunking part, so unlike pandas you can just define your processing steps and let Dask do the work. Dask does not apply the computations before it is explicitly pushed by compute and/or persist (see the answer here for the difference).

    Other Aids (Ideas)

    • ETL flow designed for the data. Keeping only what is needed from the raw data.
      • First, apply ETL to whole data with frameworks like Dask or PySpark, and export the processed data.
      • Then see if the processed data can be fit in the memory as a whole.
    • Consider increasing your RAM.
    • Consider working with that data on a cloud platform.

提交回复
热议问题