Defining Data Type during csv file import based on column index in pandas

后端未结

关注

 3  493

夕颜 2020-12-21 09:36

I need to import a csv file that has 300+ columns, among these columns, only the first column needs to specified as a category, while the rest of the columns should be float

3条回答

礼貌的吻别 (楼主)

2020-12-21 10:10

There are two scenarios:

You know and can therefore specify the optimal type for each column in advance; or
You don't know optimal types in advance and have to convert to optimal types after reading the file.

Specify in advance

This is the straightforward case. Use a dictionary:

type_dict = {'Col_A': 'category', 'Col_B': 'int16',
             'Col_C': 'float16', 'Col_D': 'float32'}

df = pd.read_csv(myfile, delim_whitespace=True, dtype=type_dict)

If you don't know your column names in advance, just read the columns as an initial step:

cols = pd.read_csv(myfile, delim_whitespace=True, nrows=0).columns
# Index(['Col_A', 'Col_B', 'Col_C', 'Col_D'], dtype='object')

type_dict = {'Col_A': 'category', **{col: 'float32' for col in cols[1:]}}

df = pd.read_csv(myfile, delim_whitespace=True, dtype=type_dict)

Specify after reading

Often you won't know the optimal type beforehand. In this case, you can read in data as normal and perform conversions for int and float explicitly in a subsequent step:

df = pd.read_csv(myfile, delim_whitespace=True, dtype={'Col_A': 'category'})

cols = {k: df.select_dtypes([k]).columns for k in ('integer', 'float')}

for col_type, col_names in cols.items():
    df[col_names] = df[col_names].apply(pd.to_numeric, downcast=col_type)

print(df.dtypes)

Col_A    category
Col_B        int8
Col_C     float32
Col_D     float32
dtype: object

Setup used for testing

from io import StringIO

myfile = StringIO("""Col_A   Col_B   Col_C   Col_D
001       1       2      1.2
002       2       3      3.5
003       3       4.5      7
004       4       6.5     10""")

0 讨论(0)

查看其它3个回答