fill missing values (nan) by regression of other columns

烈酒焚心 提交于 2020-06-17 05:28:26

问题


I've got a dataset containing a lot of missing values (NAN). I want to use linear or multilinear regression in python and fill all the missing values. You can find the dataset here: Dataset

I have used f_regression(X_train, Y_train) to select which feature should I use. first of all I convert df['country'] to dummy then used important features then I have used regression but the results Not good.

I have defined following functions to select features and missing values:

def select_features(target,df):
    '''Get dataset and terget and print which features are important.'''
    df_dummies = pd.get_dummies(df,prefix='',prefix_sep='',drop_first=True)
    df_nonan = df_dummies.dropna()

    X = df_nonan.drop([target],axis=1)
    Y = df_nonan[target]
    X = pd.get_dummies(X)

    X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.30, random_state=40)

    f,pval = f_regression(X_train, Y_train)
    inds = np.argsort(pval)[::1]
    results = pd.DataFrame(np.vstack((f[inds],pval[inds])), columns=X_train.columns[inds], index=['f_values','p_values']).iloc[:,:15]
    print(results)

And I have defined following function to predict missing values.

def train(target,features,df,deg=1):
    '''Get dataset, target and features and predict nan in target column'''

    df_dummies = pd.get_dummies(df,prefix='',prefix_sep='',drop_first=True)
    df_nonan = df_dummies[[*features,target]].dropna()

    X = df_nonan.drop([target],axis=1)
    Y = df_nonan[target]

    pol = PolynomialFeatures(degree=deg)
    X=X[features]

    X = pd.get_dummies(X)
    X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.40, random_state=40)
    X_test, X_val, Y_test, Y_val = train_test_split(X_test, Y_test, test_size=0.50, random_state=40)
    # X_train.shape, X_test.shape, Y_train.shape, Y_test.shape
    X_train_n = pol.fit_transform(X_train)
    reg = linear_model.Lasso()
    reg.fit(X_train_n,Y_train);
    X_test_n = pol.fit_transform(X_test)

    Y_predtrain = reg.predict(X_train_n)
    print('train',r2_score(Y_train, Y_predtrain))
    Y_pred = reg.predict(X_test_n)
    print('test',r2_score(Y_test, Y_pred))
    # val
    X_val_n = pol.fit_transform(X_val)
    X_val_n.shape,X_train_n.shape,X_test_n.shape
    Y_valpred = reg.predict(X_val_n)
    print('val',r2_score(Y_val, Y_valpred))
    X_names = X.columns.values
    X_new = df_dummies[X_names].dropna()
    X_new = X_new[df_dummies[target].isna()]
    X_new_n = pol.fit_transform(X_new)
    Y_new = df_dummies.loc[X_new.index,target]

    Y_new = reg.predict(X_new_n)
    Y_new = pd.Series(Y_new, index=X_new.index)
    Y_new.head()
    return Y_new, X_names, X_new.index

Then I am using these functions to fill nan for features with p_values<0.05. But I am not sure is it a good way or not. With this way many missing remain unpredicted.

来源:https://stackoverflow.com/questions/58435338/fill-missing-values-nan-by-regression-of-other-columns

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!