MemoryError in toarray when using DictVectorizer of Scikit Learn

后端 未结 7 2048
無奈伤痛
無奈伤痛 2021-01-06 05:47

I am trying to implement the SelectKBest algorithm on my data to get the best features out of it. For this I am first preprocessing my data using DictVectorizer and the data

7条回答
  •  星月不相逢
    2021-01-06 06:38

    In addition to the above answers, you may as well try using the storage-friendly LabelBinarizer() function to build your own custom vectorizer. Here is the code:

    from sklearn.preprocessing import LabelBinarizer
    
    def dictsToVecs(list_of_dicts):
        X = []
    
        for i in range(len(list_of_dicts[0].keys())):
            vals = [list(dict.values())[i] for dict in list_of_dicts]
    
            enc = LabelBinarizer()
            vals = enc.fit_transform(vals).tolist()
            print(vals)
            if len(X) == 0:
                X = vals
            else:
                dummy_res = [X[idx].extend(vals[idx]) for idx, element in enumerate(X)]
    
        return X
    

    Further, in case of distinct train-test data sets, it could be helpful to save the binarizer instances for each item of the dictionaries once fitted at the train time, so as to call the transform() method by loading these at the test time.

提交回复
热议问题