How do I read a large csv file with pandas?

后端 未结 15 1905
隐瞒了意图╮
隐瞒了意图╮ 2020-11-21 07:12

I am trying to read a large csv file (aprox. 6 GB) in pandas and i am getting a memory error:

MemoryError                               Traceback (most recen         


        
15条回答
  •  無奈伤痛
    2020-11-21 07:32

    You can read in the data as chunks and save each chunk as pickle.

    import pandas as pd 
    import pickle
    
    in_path = "" #Path where the large file is
    out_path = "" #Path to save the pickle files to
    chunk_size = 400000 #size of chunks relies on your available memory
    separator = "~"
    
    reader = pd.read_csv(in_path,sep=separator,chunksize=chunk_size, 
                        low_memory=False)    
    
    
    for i, chunk in enumerate(reader):
        out_file = out_path + "/data_{}.pkl".format(i+1)
        with open(out_file, "wb") as f:
            pickle.dump(chunk,f,pickle.HIGHEST_PROTOCOL)
    

    In the next step you read in the pickles and append each pickle to your desired dataframe.

    import glob
    pickle_path = "" #Same Path as out_path i.e. where the pickle files are
    
    data_p_files=[]
    for name in glob.glob(pickle_path + "/data_*.pkl"):
       data_p_files.append(name)
    
    
    df = pd.DataFrame([])
    for i in range(len(data_p_files)):
        df = df.append(pd.read_pickle(data_p_files[i]),ignore_index=True)
    

提交回复
热议问题