问题
I am using datatable dataframe. How can I split the dataframe into train and test dataset?
Similarly to pandas dataframe, I tried to use train_test_split(dt_df,classes)
from sklearn.model_selection, but it doesn't work and I get error.
import datatable as dt
import numpy as np
from sklearn.model_selection import train_test_split
dt_df = dt.fread(csv_file_path)
classe = dt_df[:, "classe"])
del dt_df[:, "classe"])
X_train, X_test, y_train, y_test = train_test_split(dt_df, classe, test_size=test_size)
I get the following error : TypeError: Column selector must be an integer or a string, not <class 'numpy.ndarray'>
I try a work around method by converting the dataframe to numpy array:
classe = np.ravel(dt_df[:, "classe"])
dt_df = dt_df.to_numpy()
Like that it works, but, I don't know if there is a way allowing the train_test_split
working correctly like in pandas dataframe.
Edit 1: The csv file contain as columns strings, and the values are unsigned int. Using print(dt_df)
we get :
| CCC CCG CCU CCA CGC CGG CGU CGA CUC CUG … ---- + --- --- --- --- --- --- --- --- --- --- 0 | 0 0 0 0 2 0 1 0 0 1 … 1 | 0 0 0 0 1 0 2 1 0 1 … 2 | 0 0 0 1 1 0 1 0 1 2 … 3 | 0 0 0 1 1 0 1 0 1 2 … 4 | 0 0 0 1 1 0 1 0 1 2 … 5 | 0 0 0 1 1 0 1 0 1 2 … 6 | 0 0 0 1 0 0 3 0 0 2 … 7 | 0 0 0 1 1 0 0 0 1 2 … 8 | 0 0 0 1 1 0 1 0 1 2 … 9 | 0 0 1 0 1 0 1 0 1 3 … 10 | 0 0 1 0 1 0 1 0 1 3 … ...
Thanks for you help.
回答1:
i don't know about a function that can split dt
. but you can us
dt_df = df.read_csv(csv_file_path)
classe = dt_df[:, "classe"])
del dt_df[:, "classe"])
X_train, X_test, y_train, y_test = train_test_split(dt_df, classe, test_size=test_size)
and then convert the DataFame
to DataTable
by:
X_train = dt.Frame(X_train)
X_test = dt.Frame(X_test)
回答2:
The solution I use to split datatable dataframe into train and test dataset in python using train_test_split(dt_df,classes)
from sklearn.model_selection is to convert the datatable dataframe to numpy as I mentioned in my question post, or to pandas dataframe as commented by @Manoor Hassan (to and back again):
source code before split method:
import datatable as dt
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import ExtraTreesClassifier
dt_df = dt.fread(csv_file_path)
classe = np.ravel(dt_df[:, "classe"])
del dt_df[:, "classe"])
source code after split method:
ExTrCl = ExtraTreesClassifier()
ExTrCl.fit(X_train, y_train)
pred_test = ExTrCl.predict(X_test)
method 1: convert to numpy
# source code before split method
dt_df = dt_df.to_numpy()
X_train, X_test, y_train, y_test = train_test_split(dt_df, classe, test_size=test_size)
# source code after split method
method 2: convert to numpy and return back to datatable dataframe after the split:
# source code before split method
dt_df = dt_df.to_numpy()
X_train, X_test, y_train, y_test = train_test_split(dt_df, classe, test_size=test_size)
X_train = dt.Frame(X_train)
# source code after split method
method 3: convert to pandas dataframe
# source code before split method
dt_df = dt_df.to_pandas()
X_train, X_test, y_train, y_test = train_test_split(dt_df, classe, test_size=test_size)
# source code after split method
These 3 methods work fine, but there is a difference in the time performance of the train (ExTrCl.fit) and the prediction (ExTrCl.predict), for a csv file of about 500 Mo I have these results:
T convert T.train T.pred M1 to_numpy 3 85 0.5 M2 to_numpy and back 3.5 29 0.5 M3 to pandas 4 37 4
来源:https://stackoverflow.com/questions/63022043/how-to-split-datatable-dataframe-into-train-and-test-dataset-in-python