I have a dataframe with this type of data (too many columns):
col1 int64
col2 int64
col3 category
col4 category
col5 categ
@Quickbeam2k1 ,see below -
dataset=pd.read_csv('Data2.csv')
np.set_printoptions(threshold=np.nan)
X = dataset.iloc[:,:].values
Using sklearn
from sklearn.preprocessing import LabelEncoder
labelencoder_X=LabelEncoder()
X[:,0] = labelencoder_X.fit_transform(X[:,0])
Here multiple columns need to be converted. So, one approach i used is ..
for col_name in df.columns:
if(df[col_name].dtype == 'object'):
df[col_name]= df[col_name].astype('category')
df[col_name] = df[col_name].cat.codes
This converts all string / object type columns to categorical. Then applies codes to each type of category.
For a certain column, if you don't care about the ordering, use this
df['col1_num'] = df['col1'].apply(lambda x: np.where(df['col1'].unique()==x)[0][0])
If you care about the ordering, specify them as a list and use this
df['col1_num'] = df['col1'].apply(lambda x: ['first', 'second', 'third'].index(x))
First, to convert a Categorical column to its numerical codes, you can do this easier with: dataframe['c'].cat.codes
.
Further, it is possible to select automatically all columns with a certain dtype in a dataframe using select_dtypes
. This way, you can apply above operation on multiple and automatically selected columns.
First making an example dataframe:
In [75]: df = pd.DataFrame({'col1':[1,2,3,4,5], 'col2':list('abcab'), 'col3':list('ababb')})
In [76]: df['col2'] = df['col2'].astype('category')
In [77]: df['col3'] = df['col3'].astype('category')
In [78]: df.dtypes
Out[78]:
col1 int64
col2 category
col3 category
dtype: object
Then by using select_dtypes
to select the columns, and then applying .cat.codes
on each of these columns, you can get the following result:
In [80]: cat_columns = df.select_dtypes(['category']).columns
In [81]: cat_columns
Out[81]: Index([u'col2', u'col3'], dtype='object')
In [83]: df[cat_columns] = df[cat_columns].apply(lambda x: x.cat.codes)
In [84]: df
Out[84]:
col1 col2 col3
0 1 0 0
1 2 1 1
2 3 2 0
3 4 0 1
4 5 1 1
For converting categorical data in column C of dataset data, we need to do the following:
from sklearn.preprocessing import LabelEncoder
labelencoder= LabelEncoder() #initializing an object of class LabelEncoder
data['C'] = labelencoder.fit_transform(data['C']) #fitting and transforming the desired categorical column.
One of the simplest ways to convert the categorical variable into dummy/indicator variables is to use get_dummies provided by pandas.
Say for example we have data in which sex
is a categorical value (male & female)
and you need to convert it into a dummy/indicator here is how to do it.
tranning_data = pd.read_csv("../titanic/train.csv")
features = ["Age", "Sex", ] //here sex is catagorical value
X_train = pd.get_dummies(tranning_data[features])
print(X_train)
Age Sex_female Sex_male
20 0 1
33 1 0
40 1 0
22 1 0
54 0 1