MemoryError in toarray when using DictVectorizer of Scikit Learn

后端未结

关注

 7  2046

I am trying to implement the SelectKBest algorithm on my data to get the best features out of it. For this I am first preprocessing my data using DictVectorizer and the data

相关标签:

7条回答

野性不改

2021-01-06 06:19
While performing fit_transform, instead of passing the whole dictionary to it, create a dictionary with only unique occurences. Here is the an example:

Transform dictionary:

Before
```
[ {A:1,B:22.1,C:Red,D:AB12},
      {A:2,B:23.3,C:Blue,D:AB12},
  {A:3,B:20.2,C:Green,D:AB65},
    ]
```
After
```
    [ {A:1,B:22.1,C:Red,D:AB12},
      {C:Blue},
  {C:Green,D:AB65},
    ]
```
This saves a lot of space.
0 讨论(0)
发布评论:

提交评论
- 加载中...
梦毁少年i

2021-01-06 06:27

I was using DictVectorizer to transform categorical database entries into one hot vectors and was continually getting this memory error. I was making the following fatal flaw: d = DictVectorizer(sparse=False). When I would call d.transform() on some of the fields with 2000 or more categories, python would crash. The solution that worked was to instantiate DictVectorizer with sparse being True, which by the way is default behavior. If you are doing one hot representations of items with many categories, dense arrays are not the most efficient structure to use. Calling .toArray() in this case is very inefficient.

The purpose of the one hot vector in matrix multiplication is to select a row or column from some matrix. This can be done more efficiently simply by using the indices where a 1 exists in the vector. This is an implicit form of multiplication, that requires orders of magnitude less operations than is required of the explicit multiplication.

0 讨论(0)
发布评论:

提交评论
- 加载中...
梦谈多话

2021-01-06 06:36

If your data has high cardinality because it represents text, you can try using a resource-friendlier vectorizer like HashingVectorizer

0 讨论(0)
发布评论:

提交评论
- 加载中...
星月不相逢

2021-01-06 06:38
In addition to the above answers, you may as well try using the storage-friendly LabelBinarizer() function to build your own custom vectorizer. Here is the code:
```
from sklearn.preprocessing import LabelBinarizer

def dictsToVecs(list_of_dicts):
    X = []

    for i in range(len(list_of_dicts[0].keys())):
        vals = [list(dict.values())[i] for dict in list_of_dicts]

        enc = LabelBinarizer()
        vals = enc.fit_transform(vals).tolist()
        print(vals)
        if len(X) == 0:
            X = vals
        else:
            dummy_res = [X[idx].extend(vals[idx]) for idx, element in enumerate(X)]

    return X
```
Further, in case of distinct train-test data sets, it could be helpful to save the binarizer instances for each item of the dictionaries once fitted at the train time, so as to call the transform() method by loading these at the test time.
0 讨论(0)
发布评论:

提交评论
- 加载中...
我在风中等你

2021-01-06 06:39
The problem was toarray(). DictVetorizer from sklearn (which is designed for vectorizing categorical features with high cardinality) outputs sparse matrices by default. You are running out of memory because you require the dense representation by calling fit_transform().toarray().

Just use:
```
quote_data = DV.fit_transform(quote_data)
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
遥遥无期

2021-01-06 06:39

@Serendipity Using the fit_transform function, I also runned into the memory error. And removing a column was in my case not an option. So I removed .toarray() and the code worked fine.

I run two tests using a smaller dataset with and without the .toarray() option and in both cases it produced an identical matrix.

In short, removing .toarray() did the job!

0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 下一页