ML model is failing to impute values

问题

I've tried creating an ML model to make some predictions, but I keep running into a stumbling block. Namely, the code seems to be ignoring the imputation instructions I give it, resulting in the following error:

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

Here's my code:

import pandas as pd
import numpy as np
from sklearn.ensemble import AdaBoostRegressor
from category_encoders import CatBoostEncoder
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer

data = pd.read_csv("data.csv",index_col=("Unnamed: 0"))
y = data.Installs
x = data.drop("Installs",axis=1)


strat = ["mean","median","most_frequent","constant"]
num_imp = SimpleImputer(strategy=strat[0])
obj_imp = SimpleImputer(strategy=strat[2])

# Set up the scaler
sc = StandardScaler()

# Set up Encoders
cb = CatBoostEncoder()
oh = OneHotEncoder(sparse=True)


# Set up columns
obj = list(x.select_dtypes(include="object"))
num = list(x.select_dtypes(exclude="object"))


cb_col = [i for i in obj if len(x[i].unique())>30]
oh_col = [i for i in obj if len(x[i].unique())<10]

# First Pipeline
imp = make_pipeline((num_imp))
enc_cb = make_pipeline((obj_imp),(cb))
enc_oh = make_pipeline((obj_imp),(oh))

# Col Transformation
col = make_column_transformer((imp,num),
                              (sc,num),
                              (enc_oh,oh_col),
                              (enc_cb,cb_col))
model = AdaBoostRegressor(random_state=(0))

run = make_pipeline((col),(model))
run.fit(x,y)

And here's a link to the data used in the code for reproduction purposes. Can you tell what's wrong? Thanks for your time.

回答1:

If you examine your dataset, there are Nan values in some fields, such as the Rating field. This explains the Input error. Handling missing data is up to you, and there are many approaches to handling missing data. You can consult this pandas doc to help you handling such missing data.

回答2:

Your numeric scaling transformer is probably the one complaining: you haven't imputed before the StandardScaler is applied. Probably you wanted something like this:

imp_sc = make_pipeline((num_imp),(sc))

# Col Transformation
col = make_column_transformer((imp_sc,num),
                              (enc_oh,oh_col),
                              (enc_cb,cb_col))

来源：https://stackoverflow.com/questions/64539168/ml-model-is-failing-to-impute-values

标签

python

pandas

scikit-learn

data-science

valueerror