Scikit learn - fit_transform on the test set

后端 未结 1 780
礼貌的吻别
礼貌的吻别 2020-12-03 14:17

I am struggling to use Random Forest in Python with Scikit learn. My problem is that I use it for text classification (in 3 classes - positive/negative/neutral) and the feat

相关标签:
1条回答
  • 2020-12-03 15:20

    You are not supposed to do fit_transform on your test data, but only transform. Otherwise, you will get different vectorization than the one used during training.

    For the memory issue, I recommend TfIdfVectorizer, which has numerous options of reducing the dimensionality (by removing rare unigrams etc.).

    UPDATE

    If the only problem is fitting test data, simply split it to small chunks. Instead of something like

    x=vect.transform(test)
    eval(x)
    

    you can do

    K=10
    for i in range(K):
        size=len(test)/K
        x=vect.transform(test[ i*size : (i+1)*size ])
        eval(x)
    

    and record results/stats and analyze them afterwards.

    in particular

    predictions = []
    
    K=10
    for i in range(K):
        size=len(test)/K
        x=vect.transform(test[ i*size : (i+1)*size ])
        predictions += rf.predict(x) # assuming it retuns a list of labels, otherwise - convert it to list
    
    print accuracy_score( predictions, true_labels )
    
    0 讨论(0)
提交回复
热议问题