Defining Data Type during csv file import based on column index in pandas

后端未结

关注

 3  492

I need to import a csv file that has 300+ columns, among these columns, only the first column needs to specified as a category, while the rest of the columns should be float

相关标签:

3条回答

萌比男神i

2020-12-21 09:57

read it twice, first time get all the columns, second time, specify dtype when reading.

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
df.to_csv('tmp.csv',index=False)

path = 'tmp.csv'
df =pd.read_csv(path)
type_dict = {}

for key in df.columns:
    if key == 'A':
        type_dict[key]='category'
    else:
        type_dict[key]=np.float32
df = pd.read_csv(path,dtype=type_dict)
print(df.dtypes)

0 讨论(0)

一向

2020-12-21 10:03
I think the following will serve the purpose:
```
df = pd.read_csv(path, low_memory=False, dtype={'Col_A':'category'})
```
or if you know it will be the first column:
```
df = pd.read_csv(path, low_memory=False, dtype={0:'category'})
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

礼貌的吻别

2020-12-21 10:10

There are two scenarios:

You know and can therefore specify the optimal type for each column in advance; or
You don't know optimal types in advance and have to convert to optimal types after reading the file.

Specify in advance

This is the straightforward case. Use a dictionary:

type_dict = {'Col_A': 'category', 'Col_B': 'int16',
             'Col_C': 'float16', 'Col_D': 'float32'}

df = pd.read_csv(myfile, delim_whitespace=True, dtype=type_dict)

If you don't know your column names in advance, just read the columns as an initial step:

cols = pd.read_csv(myfile, delim_whitespace=True, nrows=0).columns
# Index(['Col_A', 'Col_B', 'Col_C', 'Col_D'], dtype='object')

type_dict = {'Col_A': 'category', **{col: 'float32' for col in cols[1:]}}

df = pd.read_csv(myfile, delim_whitespace=True, dtype=type_dict)

Specify after reading

Often you won't know the optimal type beforehand. In this case, you can read in data as normal and perform conversions for int and float explicitly in a subsequent step:

df = pd.read_csv(myfile, delim_whitespace=True, dtype={'Col_A': 'category'})

cols = {k: df.select_dtypes([k]).columns for k in ('integer', 'float')}

for col_type, col_names in cols.items():
    df[col_names] = df[col_names].apply(pd.to_numeric, downcast=col_type)

print(df.dtypes)

Col_A    category
Col_B        int8
Col_C     float32
Col_D     float32
dtype: object

Setup used for testing

from io import StringIO

myfile = StringIO("""Col_A   Col_B   Col_C   Col_D
001       1       2      1.2
002       2       3      3.5
003       3       4.5      7
004       4       6.5     10""")

0 讨论(0)