Pipeline and GridSearch for Doc2Vec

问题

I currently have following script that helps to find the best model for a doc2vec model. It works like this: First train a few models based on given parameters and then test against a classifier. Finally, it outputs the best model and classifier (I hope).

Data

Example data (data.csv) can be downloaded here: https://pastebin.com/takYp6T8 Note that the data has a structure that should make an ideal classifier with 1.0 accuracy.

Script

import sys
import os
from time import time
from operator import itemgetter
import pickle
import pandas as pd
import numpy as np
from argparse import ArgumentParser

from gensim.models.doc2vec import Doc2Vec
from gensim.models import Doc2Vec
import gensim.models.doc2vec
from gensim.models import KeyedVectors
from gensim.models.doc2vec import TaggedDocument, Doc2Vec

from sklearn.base import BaseEstimator
from gensim import corpora

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report


dataset = pd.read_csv("data.csv")

class Doc2VecModel(BaseEstimator):

    def __init__(self, dm=1, size=1, window=1):
        self.d2v_model = None
        self.size = size
        self.window = window
        self.dm = dm

    def fit(self, raw_documents, y=None):
        # Initialize model
        self.d2v_model = Doc2Vec(size=self.size, window=self.window, dm=self.dm, iter=5, alpha=0.025, min_alpha=0.001)
        # Tag docs
        tagged_documents = []
        for index, row in raw_documents.iteritems():
            tag = '{}_{}'.format("type", index)
            tokens = row.split()
            tagged_documents.append(TaggedDocument(words=tokens, tags=[tag]))
        # Build vocabulary
        self.d2v_model.build_vocab(tagged_documents)
        # Train model
        self.d2v_model.train(tagged_documents, total_examples=len(tagged_documents), epochs=self.d2v_model.iter)
        return self

    def transform(self, raw_documents):
        X = []
        for index, row in raw_documents.iteritems():
            X.append(self.d2v_model.infer_vector(row))
        X = pd.DataFrame(X, index=raw_documents.index)
        return X

    def fit_transform(self, raw_documents, y=None):
        self.fit(raw_documents)
        return self.transform(raw_documents)


param_grid = {'doc2vec__window': [2, 3],
              'doc2vec__dm': [0,1],
              'doc2vec__size': [100,200],
              'logreg__C': [0.1, 1],
}

pipe_log = Pipeline([('doc2vec', Doc2VecModel()), ('log', LogisticRegression())])

log_grid = GridSearchCV(pipe_log, 
                        param_grid=param_grid,
                        scoring="accuracy",
                        verbose=3,
                        n_jobs=1)

fitted = log_grid.fit(dataset["posts"], dataset["type"])

# Best parameters
print("Best Parameters: {}\n".format(log_grid.best_params_))
print("Best accuracy: {}\n".format(log_grid.best_score_))
print("Finished.")

I do have following questions regarding my script (I combine them here to avoid three posts with the same code snippet):

What's the purpose of def __init__(self, dm=1, size=1, window=1):? Can I possibly remove this part, somehow (tried unsuccessfully)?
How can I add a RandomForest classifier (or others) to the GridSearch workflow/pipeline?
How could a train/test data split added to the code above, as the current script only trains on the full dataset?

回答1:

1) init() lets you define the parameters you would like your class to take at initialization (equivalent to contructor in java).

Please look at these questions for more details:

Python __init__ and self what do they do?
Python constructors and __init__

2) Why do you want to add the RandomForestClassifier and what will be its input?

Looking at your other two questions, do you want to compare the output of RandomForestClassifier with LogisticRegression here? If so, you are doing good in this question of yours.

3) You have imported the train_test_split, just use it.

X_train, X_test, y_train, y_test = train_test_split(dataset["posts"], dataset["type"])

fitted = log_grid.fit(X_train, y_train)

来源：https://stackoverflow.com/questions/50278744/pipeline-and-gridsearch-for-doc2vec

标签

scikit-learn

pipeline

gensim

grid-search