ML model is failing to impute values

送分小仙女□ 提交于 2020-12-15 06:08:30

问题


I've tried creating an ML model to make some predictions, but I keep running into a stumbling block. Namely, the code seems to be ignoring the imputation instructions I give it, resulting in the following error:

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

Here's my code:

import pandas as pd
import numpy as np
from sklearn.ensemble import AdaBoostRegressor
from category_encoders import CatBoostEncoder
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer

data = pd.read_csv("data.csv",index_col=("Unnamed: 0"))
y = data.Installs
x = data.drop("Installs",axis=1)


strat = ["mean","median","most_frequent","constant"]
num_imp = SimpleImputer(strategy=strat[0])
obj_imp = SimpleImputer(strategy=strat[2])

# Set up the scaler
sc = StandardScaler()

# Set up Encoders
cb = CatBoostEncoder()
oh = OneHotEncoder(sparse=True)


# Set up columns
obj = list(x.select_dtypes(include="object"))
num = list(x.select_dtypes(exclude="object"))


cb_col = [i for i in obj if len(x[i].unique())>30]
oh_col = [i for i in obj if len(x[i].unique())<10]

# First Pipeline
imp = make_pipeline((num_imp))
enc_cb = make_pipeline((obj_imp),(cb))
enc_oh = make_pipeline((obj_imp),(oh))

# Col Transformation
col = make_column_transformer((imp,num),
                              (sc,num),
                              (enc_oh,oh_col),
                              (enc_cb,cb_col))
model = AdaBoostRegressor(random_state=(0))

run = make_pipeline((col),(model))
run.fit(x,y)

And here's a link to the data used in the code for reproduction purposes. Can you tell what's wrong? Thanks for your time.


回答1:


If you examine your dataset, there are Nan values in some fields, such as the Rating field. This explains the Input error. Handling missing data is up to you, and there are many approaches to handling missing data. You can consult this pandas doc to help you handling such missing data.




回答2:


Your numeric scaling transformer is probably the one complaining: you haven't imputed before the StandardScaler is applied. Probably you wanted something like this:

imp_sc = make_pipeline((num_imp),(sc))

# Col Transformation
col = make_column_transformer((imp_sc,num),
                              (enc_oh,oh_col),
                              (enc_cb,cb_col))


来源:https://stackoverflow.com/questions/64539168/ml-model-is-failing-to-impute-values

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!