How do I convert data from a Scikit-learn Bunch object to a Pandas DataFrame?
from sklearn.datasets import load_iris
import pandas as pd
data = load_iris()
p
from sklearn.datasets import load_iris
import pandas as pd
data = load_iris()
df = pd.DataFrame(data.data, columns=data.feature_names)
df.head()
This tutorial maybe of interest: http://www.neural.cz/dataset-exploration-boston-house-pricing.html
Other way to combine features and target variables can be using np.column_stack
(details)
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
data = load_iris()
df = pd.DataFrame(np.column_stack((data.data, data.target)), columns = data.feature_names+['target'])
print(df.head())
Result:
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) target
0 5.1 3.5 1.4 0.2 0.0
1 4.9 3.0 1.4 0.2 0.0
2 4.7 3.2 1.3 0.2 0.0
3 4.6 3.1 1.5 0.2 0.0
4 5.0 3.6 1.4 0.2 0.0
If you need the string label for the target
, then you can use replace
by convertingtarget_names
to dictionary
and add a new column:
df['label'] = df.target.replace(dict(enumerate(data.target_names)))
print(df.head())
Result:
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) target label
0 5.1 3.5 1.4 0.2 0.0 setosa
1 4.9 3.0 1.4 0.2 0.0 setosa
2 4.7 3.2 1.3 0.2 0.0 setosa
3 4.6 3.1 1.5 0.2 0.0 setosa
4 5.0 3.6 1.4 0.2 0.0 setosa
The API is a little cleaner than the responses suggested. Here, using as_frame
and being sure to include a response column as well.
import pandas as pd
from sklearn.datasets import load_wine
features, target = load_wine(as_frame=True).data, load_wine(as_frame=True).target
df = features
df['target'] = target
df.head(2)
Basically what you need is the "data", and you have it in the scikit bunch, now you need just the "target" (prediction) which is also in the bunch.
So just need to concat these two to make the data complete
data_df = pd.DataFrame(cancer.data,columns=cancer.feature_names)
target_df = pd.DataFrame(cancer.target,columns=['target'])
final_df = data_df.join(target_df)
Manually, you can use pd.DataFrame
constructor, giving a numpy array (data
) and a list of the names of the columns (columns
).
To have everything in one DataFrame, you can concatenate the features and the target into one numpy array with np.c_[...]
(note the []
):
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
# save load_iris() sklearn dataset to iris
# if you'd like to check dataset type use: type(load_iris())
# if you'd like to view list of attributes use: dir(load_iris())
iris = load_iris()
# np.c_ is the numpy concatenate function
# which is used to concat iris['data'] and iris['target'] arrays
# for pandas column argument: concat iris['feature_names'] list
# and string list (in this case one string); you can make this anything you'd like..
# the original dataset would probably call this ['Species']
data1 = pd.DataFrame(data= np.c_[iris['data'], iris['target']],
columns= iris['feature_names'] + ['target'])
from sklearn.datasets import load_iris
import pandas as pd
iris_dataset = load_iris()
datasets = pd.DataFrame(iris_dataset['data'], columns =
iris_dataset['feature_names'])
target_val = pd.Series(iris_dataset['target'], name =
'target_values')
species = []
for val in target_val:
if val == 0:
species.append('iris-setosa')
if val == 1:
species.append('iris-versicolor')
if val == 2:
species.append('iris-virginica')
species = pd.Series(species)
datasets['target'] = target_val
datasets['target_name'] = species
datasets.head()