How do I read a large csv file with pandas?

后端 未结 15 1845
隐瞒了意图╮
隐瞒了意图╮ 2020-11-21 07:12

I am trying to read a large csv file (aprox. 6 GB) in pandas and i am getting a memory error:

MemoryError                               Traceback (most recen         


        
相关标签:
15条回答
  • 2020-11-21 07:36

    If you use pandas read large file into chunk and then yield row by row, here is what I have done

    import pandas as pd
    
    def chunck_generator(filename, header=False,chunk_size = 10 ** 5):
       for chunk in pd.read_csv(filename,delimiter=',', iterator=True, chunksize=chunk_size, parse_dates=[1] ): 
            yield (chunk)
    
    def _generator( filename, header=False,chunk_size = 10 ** 5):
        chunk = chunck_generator(filename, header=False,chunk_size = 10 ** 5)
        for row in chunk:
            yield row
    
    if __name__ == "__main__":
    filename = r'file.csv'
            generator = generator(filename=filename)
            while True:
               print(next(generator))
    
    0 讨论(0)
  • 2020-11-21 07:37

    For large data l recommend you use the library "dask"
    e.g:

    # Dataframes implement the Pandas API
    import dask.dataframe as dd
    df = dd.read_csv('s3://.../2018-*-*.csv')
    

    You can read more from the documentation here.

    Another great alternative would be to use modin because all the functionality is identical to pandas yet it leverages on distributed dataframe libraries such as dask.

    0 讨论(0)
  • 2020-11-21 07:37

    The function read_csv and read_table is almost the same. But you must assign the delimiter “,” when you use the function read_table in your program.

    def get_from_action_data(fname, chunk_size=100000):
        reader = pd.read_csv(fname, header=0, iterator=True)
        chunks = []
        loop = True
        while loop:
            try:
                chunk = reader.get_chunk(chunk_size)[["user_id", "type"]]
                chunks.append(chunk)
            except StopIteration:
                loop = False
                print("Iteration is stopped")
    
        df_ac = pd.concat(chunks, ignore_index=True)
    
    0 讨论(0)
提交回复
热议问题