Defining Data Type during csv file import based on column index in pandas

后端 未结 3 492
夕颜
夕颜 2020-12-21 09:36

I need to import a csv file that has 300+ columns, among these columns, only the first column needs to specified as a category, while the rest of the columns should be float

相关标签:
3条回答
  • 2020-12-21 09:57

    read it twice, first time get all the columns, second time, specify dtype when reading.

    import pandas as pd
    import numpy as np
    df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
    df.to_csv('tmp.csv',index=False)
    
    path = 'tmp.csv'
    df =pd.read_csv(path)
    type_dict = {}
    
    for key in df.columns:
        if key == 'A':
            type_dict[key]='category'
        else:
            type_dict[key]=np.float32
    df = pd.read_csv(path,dtype=type_dict)
    print(df.dtypes)
    
    0 讨论(0)
  • 2020-12-21 10:03

    I think the following will serve the purpose:

    df = pd.read_csv(path, low_memory=False, dtype={'Col_A':'category'})
    

    or if you know it will be the first column:

    df = pd.read_csv(path, low_memory=False, dtype={0:'category'})
    
    0 讨论(0)
  • 2020-12-21 10:10

    There are two scenarios:

    1. You know and can therefore specify the optimal type for each column in advance; or
    2. You don't know optimal types in advance and have to convert to optimal types after reading the file.

    Specify in advance

    This is the straightforward case. Use a dictionary:

    type_dict = {'Col_A': 'category', 'Col_B': 'int16',
                 'Col_C': 'float16', 'Col_D': 'float32'}
    
    df = pd.read_csv(myfile, delim_whitespace=True, dtype=type_dict)
    

    If you don't know your column names in advance, just read the columns as an initial step:

    cols = pd.read_csv(myfile, delim_whitespace=True, nrows=0).columns
    # Index(['Col_A', 'Col_B', 'Col_C', 'Col_D'], dtype='object')
    
    type_dict = {'Col_A': 'category', **{col: 'float32' for col in cols[1:]}}
    
    df = pd.read_csv(myfile, delim_whitespace=True, dtype=type_dict)
    

    Specify after reading

    Often you won't know the optimal type beforehand. In this case, you can read in data as normal and perform conversions for int and float explicitly in a subsequent step:

    df = pd.read_csv(myfile, delim_whitespace=True, dtype={'Col_A': 'category'})
    
    cols = {k: df.select_dtypes([k]).columns for k in ('integer', 'float')}
    
    for col_type, col_names in cols.items():
        df[col_names] = df[col_names].apply(pd.to_numeric, downcast=col_type)
    
    print(df.dtypes)
    
    Col_A    category
    Col_B        int8
    Col_C     float32
    Col_D     float32
    dtype: object
    

    Setup used for testing

    from io import StringIO
    
    myfile = StringIO("""Col_A   Col_B   Col_C   Col_D
    001       1       2      1.2
    002       2       3      3.5
    003       3       4.5      7
    004       4       6.5     10""")
    
    0 讨论(0)
提交回复
热议问题