Which decision_function_shape for sklearn.svm.SVC when using OneVsRestClassifier?

前端 未结 2 542
孤城傲影
孤城傲影 2021-02-04 19:31

I am doing multi-label classification where I am trying to predict correct tags to questions:

(X = questions, y = list of tags for each question from X).

I am wo

2条回答
  •  悲&欢浪女
    2021-02-04 19:54

    The shape of the decision functions are different because ovo trains a classifier for each 2-pair class combination whereas ovr trains one classifier for each class fitted against all other classes.

    The best example I could find can be found here on http://scikit-learn.org:

    SVC and NuSVC implement the “one-against-one” approach (Knerr et al., 1990) for multi- class classification. If n_class is the number of classes, then n_class * (n_class - 1) / 2 classifiers are constructed and each one trains data from two classes. To provide a consistent interface with other classifiers, the decision_function_shape option allows to aggregate the results of the “one-against-one” classifiers to a decision function of shape (n_samples, n_classes)

    >>> X = [[0], [1], [2], [3]]
    >>> Y = [0, 1, 2, 3]
    >>> clf = svm.SVC(decision_function_shape='ovo')
    >>> clf.fit(X, Y) 
    SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
        decision_function_shape='ovo', degree=3, gamma='auto', kernel='rbf',
        max_iter=-1, probability=False, random_state=None, shrinking=True,
        tol=0.001, verbose=False)
    >>> dec = clf.decision_function([[1]])
    >>> dec.shape[1] # 4 classes: 4*3/2 = 6
    6
    >>> clf.decision_function_shape = "ovr"
    >>> dec = clf.decision_function([[1]])
    >>> dec.shape[1] # 4 classes
    4
    

    What does this mean in simple terms?

    To understand what n_class * (n_class - 1) / 2 means, generate two-class combinations using itertools.combinations.

    def ovo_classifiers(classes):
        import itertools
        n_class = len(classes)
        n = n_class * (n_class - 1) / 2
        combos = itertools.combinations(classes, 2)
        return (n, list(combos))
    
    >>> ovo_classifiers(['a', 'b', 'c'])
    (3.0, [('a', 'b'), ('a', 'c'), ('b', 'c')])
    >>> ovo_classifiers(['a', 'b', 'c', 'd'])
    (6.0, [('a', 'b'), ('a', 'c'), ('a', 'd'), ('b', 'c'), ('b', 'd'), ('c', 'd')])
    

    Which estimator should be used for multi-label classification?

    In your situation, you have a question with multiple tags (like here on StackOverflow). If you know your tags (classes) in-advance, I might suggest OneVsRestClassifier(LinearSVC()) but you could try DecisionTreeClassifier or RandomForestClassifier (I think):

    import pandas as pd
    from sklearn.preprocessing import MultiLabelBinarizer
    from sklearn.svm import SVC, LinearSVC
    from sklearn.feature_extraction.text import CountVectorizer
    from sklearn.pipeline import Pipeline
    from sklearn.multiclass import OneVsRestClassifier, OneVsOneClassifier
    
    df = pd.DataFrame({
      'Tags': [['python', 'pandas'], ['c#', '.net'], ['ruby'],
               ['python'], ['c#'], ['sklearn', 'python']],
      'Questions': ['This is a post about python and pandas is great.',
               'This is a c# post and i hate .net',
               'What is ruby on rails?', 'who else loves python',
               'where to learn c#', 'sklearn is a python package for machine learning']},
                      columns=['Questions', 'Tags'])
    
    X = df['Questions']
    mlb = MultiLabelBinarizer()
    y = mlb.fit_transform(df['Tags'].values)
    
    pipeline = Pipeline([
      ('vect', CountVectorizer(token_pattern='|'.join(mlb.classes_))),
      ('linear_svc', OneVsRestClassifier(LinearSVC()))
      ])
    pipeline.fit(X, y)
    
    final = pd.DataFrame(pipeline.predict(X), index=X, columns=mlb.classes_)
    
    def predict(text):
      return pd.DataFrame(pipeline.predict(text), index=text, columns=mlb.classes_)
    
    test = ['is python better than c#', 'should i learn c#',
            'should i learn sklearn or tensorflow',
            'ruby or c# i am a dinosaur',
            'is .net still relevant']
    print(predict(test))
    

    Output:

                                          .net  c#  pandas  python  ruby  sklearn
    is python better than c#                 0   1       0       1     0        0
    should i learn c#                        0   1       0       0     0        0
    should i learn sklearn or tensorflow     0   0       0       0     0        1
    ruby or c# i am a dinosaur               0   1       0       0     1        0
    is .net still relevant                   1   0       0       0     0        0
    

提交回复
热议问题