AttributeError: lower not found; using a Pipeline with a CountVectorizer in scikit-learn

核能气质少年 提交于 2019-12-08 17:26:38

问题


I have a corpus as such:

X_train = [ ['this is an dummy example'] 
      ['in reality this line is very long']
      ...
      ['here is a last text in the training set']
    ]

and some labels:

y_train = [1, 5, ... , 3]

I would like to use Pipeline and GridSearch as follows:

pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('reg', SGDRegressor())
])


parameters = {
    'vect__max_df': (0.5, 0.75, 1.0),
    'tfidf__use_idf': (True, False),
    'reg__alpha': (0.00001, 0.000001),
}

grid_search = GridSearchCV(pipeline, parameters, n_jobs=1, verbose=1)

grid_search.fit(X_train, y_train)

When I run this, I get an error saying AttributeError: lower not found.

I searched and found a question about this error here, which lead me to believe that there was a problem with my text not being tokenized (which sounded like it hit the nail on the head, since I was using a list of list as input data, where each list contained one single unbroken string).

I cooked up a quick and dirty tokenizer to test this theory:

def my_tokenizer(X):
    newlist = []
    for alist in X:
        newlist.append(alist[0].split(' '))
    return newlist

which does what it is supposed to, but when I use it in the arguments to the CountVectorizer:

pipeline = Pipeline([
    ('vect', CountVectorizer(tokenizer=my_tokenizer)),

...I still get the same error as if nothing happened.

I did notice that I can circumvent the error by commenting out the CountVectorizer in my Pipeline. Which is strange...I didn't think you could use the TfidfTransformer() without first having a data structure to transform...in this case the matrix of counts.

Why do I keep getting this error? Actually, it would be nice to know what this error means! (Was lower called to convert the text to lowercase or something? I can't tell from reading the stack trace). Am I misusing the Pipeline...or is the problem really an issue with the arguments to the CountVectorizer alone?

Any advice would be greatly appreciated.


回答1:


It's because your dataset is in wrong format, you should pass "An iterable which yields either str, unicode or file objects" into CountVectorizer's fit function (Or into pipeline, doesn't matter). Not iterable over other iterables with texts (as in your code). In your case List is iterable, and you should pass flat list whose members are strings (not another lists).

i.e. your dataset should look like:

X_train = ['this is an dummy example',
      'in reality this line is very long',
      ...
      'here is a last text in the training set'
    ]

Look at this example, very useful: Sample pipeline for text feature extraction and evaluation



来源:https://stackoverflow.com/questions/33605946/attributeerror-lower-not-found-using-a-pipeline-with-a-countvectorizer-in-scik

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!