I have the following numpy matrix:
M = [
[\'a\', 5, 0.2, \'\'],
[\'a\', 2, 1.3, \'as\'],
[\'b\', 1, 2.3, \'as\'],
]
M = np.array(M)
You can use DictVectorizer
:
from sklearn.feature_extraction import DictVectorizer
import pandas as pd
dv = DictVectorizer(sparse=False)
df = pd.DataFrame(M).convert_objects(convert_numeric=True)
dv.fit_transform(df.to_dict(orient='records'))
array([[ 5. , 0.2, 1. , 0. , 1. , 0. ],
[ 2. , 1.3, 1. , 0. , 0. , 1. ],
[ 1. , 2.3, 0. , 1. , 0. , 1. ]])
dv.feature_names_
holds correspondence to the columns:
[1, 2, '0=a', '0=b', '3=', '3=as']