How to fix “ValueError: Expected 2D array, got 1D array instead” in sklearn/python?

后端 未结 3 820
鱼传尺愫
鱼传尺愫 2021-01-13 21:51

I there. I just started with the machine learning with a simple example to try and learn. So, I want to classify the files in my disk based on the file type by making use of

相关标签:
3条回答
  • 2021-01-13 22:32

    A Simple solution that reshapes it automatically is instead of using:

    X=dataset.iloc[:, 0].values
    

    You can use:

    X=dataset.iloc[:, :-1].values
    

    that is if you only have two column and you are trying to get the first one the code gets all the column except the last one

    0 讨论(0)
  • 2021-01-13 22:44
    X=dataset.iloc[:, 0].values
    y=dataset.iloc[:, 1].values
    
    regressor=LinearRegression()
    X=X.reshape(-1,1)
    regressor.fit(X,y)
    

    I had the following code. The reshape operator is not an inplace operator. So we have to replace it's value by the value after reshaping like given above.

    0 讨论(0)
  • 2021-01-13 22:57

    When passing your input to the classifiers, pass 2D arrays (of shape (M, N) where N >= 1), not 1D arrays (which have shape (N,)). The error message is pretty clear,

    Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

    from sklearn.model_selection import train_test_split
    
    # X.shape should be (N, M) where M >= 1
    X = mydata[['script']]  
    # y.shape should be (N, 1)
    y = mydata['label'] 
    # perform label encoding if "label" contains strings
    # y = pd.factorize(mydata['label'])[0].reshape(-1, 1) 
    X_train, X_test, y_train, y_test = train_test_split(
                          X, y, test_size=0.33, random_state=42)
    ...
    
    clf.fit(X_train, y_train) 
    print(clf.score(X_test, y_test))
    

    Some other helpful tips -

    1. split your data into valid train and test portions. Do not use your training data to test - that leads to inaccurate estimations of your classifier's strength
    2. I'd recommend factorizing your labels, so you're dealing with integers. It's just easier.
    0 讨论(0)
提交回复
热议问题