How do I convert data from a Scikit-learn Bunch object to a Pandas DataFrame?
from sklearn.datasets import load_iris
import pandas as pd
data = load_iris()
p
There might be a better way but here is what I have done in the past and it works quite well:
items = data.items() #Gets all the data from this Bunch - a huge list
mydata = pd.DataFrame(items[1][1]) #Gets the Attributes
mydata[len(mydata.columns)] = items[2][1] #Adds a column for the Target Variable
mydata.columns = items[-1][1] + [items[2][0]] #Gets the column names and updates the dataframe
Now mydata will have everything you need - attributes, target variable and columnnames
As of version 0.23, you can directly return a DataFrame using the as_frame
argument.
For example, loading the iris data set:
from sklearn.datasets import load_iris
iris = load_iris(as_frame=True)
df = iris.data
In my understanding using the provisionally release notes, this works for the breast_cancer, diabetes, digits, iris, linnerud, wine and california_houses data sets.
Here's another integrated method example maybe helpful.
from sklearn.datasets import load_iris
iris_X, iris_y = load_iris(return_X_y=True, as_frame=True)
type(iris_X), type(iris_y)
The data iris_X are imported as pandas DataFrame and the target iris_y are imported as pandas Series.
TOMDLt's solution is not generic enough for all the datasets in scikit-learn. For example it does not work for the boston housing dataset. I propose a different solution which is more universal. No need to use numpy as well.
from sklearn import datasets
import pandas as pd
boston_data = datasets.load_boston()
df_boston = pd.DataFrame(boston_data.data,columns=boston_data.feature_names)
df_boston['target'] = pd.Series(boston_data.target)
df_boston.head()
As a general function:
def sklearn_to_df(sklearn_dataset):
df = pd.DataFrame(sklearn_dataset.data, columns=sklearn_dataset.feature_names)
df['target'] = pd.Series(sklearn_dataset.target)
return df
df_boston = sklearn_to_df(datasets.load_boston())
This works for me.
dataFrame = pd.dataFrame(data = np.c_[ [iris['data'],iris['target'] ],
columns=iris['feature_names'].tolist() + ['target'])
You can use the parameter as_frame=True
to get pandas dataframes.
from sklearn import datasets
X,y = datasets.load_iris(return_X_y=True) # numpy arrays
dic_data = datasets.load_iris(as_frame=True)
print(dic_data.keys())
df = dic_data['frame'] # pandas dataframe data + target
df_X = dic_data['data'] # pandas dataframe data only
ser_y = dic_data['target'] # pandas series target only
dic_data['target_names'] # numpy array
from sklearn import datasets
fnames = [ i for i in dir(datasets) if 'load_' in i]
print(fnames)
fname = 'load_boston'
loader = getattr(datasets,fname)()
df = pd.DataFrame(loader['data'],columns= loader['feature_names'])
df['target'] = loader['target']
df.head(2)