I need to import a csv file that has 300+ columns, among these columns, only the first column needs to specified as a category, while the rest of the columns should be float
read it twice, first time get all the columns, second time, specify dtype when reading.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
df.to_csv('tmp.csv',index=False)
path = 'tmp.csv'
df =pd.read_csv(path)
type_dict = {}
for key in df.columns:
if key == 'A':
type_dict[key]='category'
else:
type_dict[key]=np.float32
df = pd.read_csv(path,dtype=type_dict)
print(df.dtypes)
I think the following will serve the purpose:
df = pd.read_csv(path, low_memory=False, dtype={'Col_A':'category'})
or if you know it will be the first column:
df = pd.read_csv(path, low_memory=False, dtype={0:'category'})
There are two scenarios:
This is the straightforward case. Use a dictionary:
type_dict = {'Col_A': 'category', 'Col_B': 'int16',
'Col_C': 'float16', 'Col_D': 'float32'}
df = pd.read_csv(myfile, delim_whitespace=True, dtype=type_dict)
If you don't know your column names in advance, just read the columns as an initial step:
cols = pd.read_csv(myfile, delim_whitespace=True, nrows=0).columns
# Index(['Col_A', 'Col_B', 'Col_C', 'Col_D'], dtype='object')
type_dict = {'Col_A': 'category', **{col: 'float32' for col in cols[1:]}}
df = pd.read_csv(myfile, delim_whitespace=True, dtype=type_dict)
Often you won't know the optimal type beforehand. In this case, you can read in data as normal and perform conversions for int
and float
explicitly in a subsequent step:
df = pd.read_csv(myfile, delim_whitespace=True, dtype={'Col_A': 'category'})
cols = {k: df.select_dtypes([k]).columns for k in ('integer', 'float')}
for col_type, col_names in cols.items():
df[col_names] = df[col_names].apply(pd.to_numeric, downcast=col_type)
print(df.dtypes)
Col_A category
Col_B int8
Col_C float32
Col_D float32
dtype: object
Setup used for testing
from io import StringIO
myfile = StringIO("""Col_A Col_B Col_C Col_D
001 1 2 1.2
002 2 3 3.5
003 3 4.5 7
004 4 6.5 10""")