How to split datatable dataframe into train and test dataset in python

问题

I am using datatable dataframe. How can I split the dataframe into train and test dataset?
Similarly to pandas dataframe, I tried to use train_test_split(dt_df,classes) from sklearn.model_selection, but it doesn't work and I get error.

import datatable as dt
import numpy as np
from sklearn.model_selection import train_test_split

dt_df = dt.fread(csv_file_path)
classe = dt_df[:, "classe"])
del dt_df[:, "classe"])

X_train, X_test, y_train, y_test = train_test_split(dt_df, classe, test_size=test_size)

I get the following error : TypeError: Column selector must be an integer or a string, not <class 'numpy.ndarray'>

I try a work around method by converting the dataframe to numpy array:

classe = np.ravel(dt_df[:, "classe"])
dt_df = dt_df.to_numpy()

Like that it works, but, I don't know if there is a way allowing the train_test_split working correctly like in pandas dataframe.

Edit 1: The csv file contain as columns strings, and the values are unsigned int. Using print(dt_df) we get :

     | CCC  CCG  CCU  CCA  CGC  CGG  CGU  CGA  CUC  CUG  …  
---- + ---  ---  ---  ---  ---  ---  ---  ---  ---  ---     
   0 |   0    0    0    0    2    0    1    0    0    1  …  
   1 |   0    0    0    0    1    0    2    1    0    1  …  
   2 |   0    0    0    1    1    0    1    0    1    2  …  
   3 |   0    0    0    1    1    0    1    0    1    2  …  
   4 |   0    0    0    1    1    0    1    0    1    2  …  
   5 |   0    0    0    1    1    0    1    0    1    2  …  
   6 |   0    0    0    1    0    0    3    0    0    2  …  
   7 |   0    0    0    1    1    0    0    0    1    2  …  
   8 |   0    0    0    1    1    0    1    0    1    2  …  
   9 |   0    0    1    0    1    0    1    0    1    3  …  
  10 |   0    0    1    0    1    0    1    0    1    3  …  
      ...

Thanks for you help.

回答1:

i don't know about a function that can split dt. but you can us

dt_df = df.read_csv(csv_file_path)
classe = dt_df[:, "classe"])
del dt_df[:, "classe"])

X_train, X_test, y_train, y_test = train_test_split(dt_df, classe, test_size=test_size)

and then convert the DataFame to DataTable by:

X_train = dt.Frame(X_train)
X_test = dt.Frame(X_test)

回答2:

The solution I use to split datatable dataframe into train and test dataset in python using train_test_split(dt_df,classes) from sklearn.model_selection is to convert the datatable dataframe to numpy as I mentioned in my question post, or to pandas dataframe as commented by @Manoor Hassan (to and back again):

source code before split method:

import datatable as dt
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import ExtraTreesClassifier

dt_df = dt.fread(csv_file_path)

classe = np.ravel(dt_df[:, "classe"])
del dt_df[:, "classe"])

source code after split method:

ExTrCl = ExtraTreesClassifier()
ExTrCl.fit(X_train, y_train)
pred_test = ExTrCl.predict(X_test)

method 1: convert to numpy

# source code before split method

dt_df = dt_df.to_numpy()

X_train, X_test, y_train, y_test = train_test_split(dt_df, classe, test_size=test_size)

# source code after split method

method 2: convert to numpy and return back to datatable dataframe after the split:

# source code before split method

dt_df = dt_df.to_numpy()

X_train, X_test, y_train, y_test = train_test_split(dt_df, classe, test_size=test_size)

X_train = dt.Frame(X_train)

# source code after split method

method 3: convert to pandas dataframe

# source code before split method

dt_df = dt_df.to_pandas()

X_train, X_test, y_train, y_test = train_test_split(dt_df, classe, test_size=test_size)

# source code after split method

These 3 methods work fine, but there is a difference in the time performance of the train (ExTrCl.fit) and the prediction (ExTrCl.predict), for a csv file of about 500 Mo I have these results:

                       T convert    T.train     T.pred
M1 to_numpy             3           85          0.5
M2 to_numpy and back    3.5         29          0.5
M3 to pandas            4           37          4

来源：https://stackoverflow.com/questions/63022043/how-to-split-datatable-dataframe-into-train-and-test-dataset-in-python

标签

python

pandas

dataframe

train-test-split