问题
My question is about preprocessing csv files before inputing them into a neural network.
I want to build a deep neural network for the famous iris dataset using tflearn in python 3.
Dataset: http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data
I'm using tflearn to load the csv file. However, the classes column of my data set has words such as iris-setosa, iris-versicolor, iris-virginica.
Nueral networks work only with numbers. So, I have to find a way to change the classes from words to numbers. Since it is a very small dataset, I can do it manually using Excel/text editor. I manually assigned numbers for different classes.
But, I can't possibly do it for every dataset I work with. So, I tried using pandas to perform one hot encoding.
preprocess_data = pd.read_csv("F:\Gautam\.....\Dataset\iris_data.csv")
preprocess_data = pd.get_dummies(preprocess_data)
But now, I can't use this piece of code:
data, labels = load_csv('filepath', categorical_labels=True,
n_classes=3)
'filepath' should only be a directory to the csv file, not any variable like preprocess_data.
Original Dataset:
Sepal Length Sepal Width Petal Length Petal Width Class
89 5.5 2.5 4.0 1.3 iris-versicolor
85 6.0 3.4 4.5 1.6 iris-versicolor
31 5.4 3.4 1.5 0.4 iris-setosa
52 6.9 3.1 4.9 1.5 iris-versicolor
111 6.4 2.7 5.3 1.9 iris-virginica
Manually modified dataset:
Sepal Length Sepal Width Petal Length Petal Width Class
89 5.5 2.5 4.0 1.3 1
85 6.0 3.4 4.5 1.6 1
31 5.4 3.4 1.5 0.4 0
52 6.9 3.1 4.9 1.5 1
111 6.4 2.7 5.3 1.9 2
Here's my code which runs perfectly, but, I have modified the dataset manually.
import numpy as np
import pandas as pd
import tflearn
from tflearn.layers.core import input_data, fully_connected
from tflearn.layers.estimator import regression
from tflearn.data_utils import load_csv
data_source = 'F:\Gautam\.....\Dataset\iris_data.csv'
data, labels = load_csv(data_source, categorical_labels=True,
n_classes=3)
network = input_data(shape=[None, 4], name='InputLayer')
network = fully_connected(network, 9, activation='sigmoid', name='Hidden_Layer_1')
network = fully_connected(network, 3, activation='softmax', name='Output_Layer')
network = regression(network, batch_size=1, optimizer='sgd', learning_rate=0.2)
model = tflearn.DNN(network)
model.fit(data, labels, show_metric=True, run_id='iris_dataset', validation_set=0.1, n_epoch=2000)
I want to know if there's any other built-in function in tflearn (or in any other module, for that matter) that I can use to modify the value of my classes from words to numbers. I don't think manually modifying the datasets would be productive.
I'm a beginner in tflearn and neural networks also. Any help would be appreciated. Thanks.
回答1:
Use label encoder from sklearn
library:
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
df = pd.read_csv('iris_data.csv',header=None)
df.columns=[Sepal Length,Sepal Width,Petal Length,Petal Width,Class]
enc=LabelEncoder()
df['Class']=enc.fit_transform(df['Class'])
print df.head(5)
if you want One-hot encoding
then first you need to labelEncode then do OneHotEncoding :
enc=LabelEncoder()
enc_1=OneHotEncoder()
df['Class']=enc.fit_transform(df['Class'])
df['Class']=enc_1.fit_transform([df['Class']]).toarray()
print df.head(5)
These encoders first sort the words in alphabetical order then assign them labels. If you want to see which label is assigned to which class, do:
for k in list(enc.classes_) :
print 'name ::{}, label ::{}'.format(k,enc.transform([k]))
If you want to save this dataframe as a csv file, do:
df.to_csv('Processed_Irisdataset.csv',sep=',')
回答2:
The simpliest solution is map
by dict
of all possible values:
df['Class'] = df['Class'].map({'iris-versicolor': 1, 'iris-setosa': 0, 'iris-virginica': 2})
print (df)
Sepal Length Sepal Width Petal Length Petal Width Class
0 89 5.5 2.5 4.0 1.3 1
1 85 6.0 3.4 4.5 1.6 1
2 31 5.4 3.4 1.5 0.4 0
3 52 6.9 3.1 4.9 1.5 1
4 111 6.4 2.7 5.3 1.9 2
If want generate dictionary
by all unique values:
d = {v:k for k, v in enumerate(df['Class'].unique())}
print (d)
{'iris-versicolor': 0, 'iris-virginica': 2, 'iris-setosa': 1}
df['Class'] = df['Class'].map(d)
print (df)
Sepal Length Sepal Width Petal Length Petal Width Class
0 89 5.5 2.5 4.0 1.3 0
1 85 6.0 3.4 4.5 1.6 0
2 31 5.4 3.4 1.5 0.4 1
3 52 6.9 3.1 4.9 1.5 0
4 111 6.4 2.7 5.3 1.9 2
来源:https://stackoverflow.com/questions/50129485/preprocessing-csv-files-to-use-with-tflearn