I am trying to implement the SelectKBest algorithm on my data to get the best features out of it. For this I am first preprocessing my data using DictVectorizer and the data
In addition to the above answers, you may as well try using the storage-friendly LabelBinarizer()
function to build your own custom vectorizer. Here is the code:
from sklearn.preprocessing import LabelBinarizer
def dictsToVecs(list_of_dicts):
X = []
for i in range(len(list_of_dicts[0].keys())):
vals = [list(dict.values())[i] for dict in list_of_dicts]
enc = LabelBinarizer()
vals = enc.fit_transform(vals).tolist()
print(vals)
if len(X) == 0:
X = vals
else:
dummy_res = [X[idx].extend(vals[idx]) for idx, element in enumerate(X)]
return X
Further, in case of distinct train-test data sets, it could be helpful to save the binarizer instances for each item of the dictionaries once fitted at the train time, so as to call the transform()
method by loading these at the test time.