MemoryError in toarray when using DictVectorizer of Scikit Learn

后端未结

关注

 7  2048

無奈伤痛 2021-01-06 05:47

I am trying to implement the SelectKBest algorithm on my data to get the best features out of it. For this I am first preprocessing my data using DictVectorizer and the data

7条回答

星月不相逢 (楼主)

2021-01-06 06:38
In addition to the above answers, you may as well try using the storage-friendly LabelBinarizer() function to build your own custom vectorizer. Here is the code:
```
from sklearn.preprocessing import LabelBinarizer

def dictsToVecs(list_of_dicts):
    X = []

    for i in range(len(list_of_dicts[0].keys())):
        vals = [list(dict.values())[i] for dict in list_of_dicts]

        enc = LabelBinarizer()
        vals = enc.fit_transform(vals).tolist()
        print(vals)
        if len(X) == 0:
            X = vals
        else:
            dummy_res = [X[idx].extend(vals[idx]) for idx, element in enumerate(X)]

    return X
```
Further, in case of distinct train-test data sets, it could be helpful to save the binarizer instances for each item of the dictionaries once fitted at the train time, so as to call the transform() method by loading these at the test time.
0 讨论(0)

查看其它7个回答
发布评论:

提交评论
- 加载中...