How to convert a Scikit-learn dataset to a Pandas dataset?

后端 未结 22 1944
清酒与你
清酒与你 2020-11-28 19:10

How do I convert data from a Scikit-learn Bunch object to a Pandas DataFrame?

from sklearn.datasets import load_iris
import pandas as pd
data = load_iris()
p         


        
相关标签:
22条回答
  • 2020-11-28 19:28

    There might be a better way but here is what I have done in the past and it works quite well:

    items = data.items()                          #Gets all the data from this Bunch - a huge list
    mydata = pd.DataFrame(items[1][1])            #Gets the Attributes
    mydata[len(mydata.columns)] = items[2][1]     #Adds a column for the Target Variable
    mydata.columns = items[-1][1] + [items[2][0]] #Gets the column names and updates the dataframe
    

    Now mydata will have everything you need - attributes, target variable and columnnames

    0 讨论(0)
  • 2020-11-28 19:33

    As of version 0.23, you can directly return a DataFrame using the as_frame argument. For example, loading the iris data set:

    from sklearn.datasets import load_iris
    iris = load_iris(as_frame=True)
    df = iris.data
    

    In my understanding using the provisionally release notes, this works for the breast_cancer, diabetes, digits, iris, linnerud, wine and california_houses data sets.

    0 讨论(0)
  • 2020-11-28 19:33

    Here's another integrated method example maybe helpful.

    from sklearn.datasets import load_iris
    iris_X, iris_y = load_iris(return_X_y=True, as_frame=True)
    type(iris_X), type(iris_y)
    

    The data iris_X are imported as pandas DataFrame and the target iris_y are imported as pandas Series.

    0 讨论(0)
  • 2020-11-28 19:34

    TOMDLt's solution is not generic enough for all the datasets in scikit-learn. For example it does not work for the boston housing dataset. I propose a different solution which is more universal. No need to use numpy as well.

    from sklearn import datasets
    import pandas as pd
    
    boston_data = datasets.load_boston()
    df_boston = pd.DataFrame(boston_data.data,columns=boston_data.feature_names)
    df_boston['target'] = pd.Series(boston_data.target)
    df_boston.head()
    

    As a general function:

    def sklearn_to_df(sklearn_dataset):
        df = pd.DataFrame(sklearn_dataset.data, columns=sklearn_dataset.feature_names)
        df['target'] = pd.Series(sklearn_dataset.target)
        return df
    
    df_boston = sklearn_to_df(datasets.load_boston())
    
    0 讨论(0)
  • 2020-11-28 19:34

    This works for me.

    dataFrame = pd.dataFrame(data = np.c_[ [iris['data'],iris['target'] ],
    columns=iris['feature_names'].tolist() + ['target'])
    
    0 讨论(0)
  • 2020-11-28 19:35

    Update: 2020

    You can use the parameter as_frame=True to get pandas dataframes.

    If as_frame parameter available (eg. load_iris)

    from sklearn import datasets
    X,y = datasets.load_iris(return_X_y=True) # numpy arrays
    
    dic_data = datasets.load_iris(as_frame=True)
    print(dic_data.keys())
    
    df = dic_data['frame'] # pandas dataframe data + target
    df_X = dic_data['data'] # pandas dataframe data only
    ser_y = dic_data['target'] # pandas series target only
    dic_data['target_names'] # numpy array
    
    

    If as_frame parameter NOT available (eg. load_boston)

    from sklearn import datasets
    
    fnames = [ i for i in dir(datasets) if 'load_' in i]
    print(fnames)
    
    fname = 'load_boston'
    loader = getattr(datasets,fname)()
    df = pd.DataFrame(loader['data'],columns= loader['feature_names'])
    df['target'] = loader['target']
    df.head(2)
    
    0 讨论(0)
提交回复
热议问题