Scaling production data

て烟熏妆下的殇ゞ 提交于 2020-06-26 12:51:22

问题


I have a dataset, say Data, which consists of categorical and numerical variables. After cleaning them, I have scaled only the numerical variables (guess catgorical must not be scaled) using

Data <- Data %>% dplyr::mutate_if(is.numeric, ~scale(.) %>% as.vector)

I then split it randomly to 70-30 percentage using

set.seed(123)
sample_size = floor(0.70*nrow(Data))
xyz <- sample(seq_len(nrow(Data)),size = sample_size)
Train_Set <- Join[xyz,]
Test_Set <- Join[-xyz,]

I have built a classification model using ranger, say model_rang, using Train_Set and tested on it using Test_Set.

If a new data, say new_data, arrives for production, after cleaning it, is it enough to scale it the above way? I mean

new_data <- new_data %>% dplyr::mutate_if(is.numeric, ~scale(.) %>% as.vector)

and then use it to predict the outcome using (there are two classes 0 and 1 and 1 is of interest)

probabilities <- as.data.frame(predict(model_rang, data = new_data, num.trees = 5000, type='response', verbose = TRUE)$predictions)
caret::confusionMatrix(table(max.col(probabilities) - 1,new_data$Class), positive='1')

Is the scale done properly as in Data or am I missing any crucial stuff in the production data?

Or, must I scale Train_Set separately and take the standard deviation of each variable and associated mean to scale Test_Set, and when new data during production arrives, the old standard deviation and mean from Train_Set be applied to every new data set?


回答1:


When you scale the data, you subtract the mean off it and divide by the standard deviation. The mean and standard deviation in your new data might not be the same as that in the (training data) used to construct your model.

Imagine in your random forest, one variable was split at 0.555 (scaled data) and now in your new data, the standard deviation is lower, values that would be below 0.555 are now over, and will be classified into a different class.

One thing you can do is store the attributes like the post you pointed to:

set.seed(111)

data = data.frame(A=sample(letters[1:3],100,replace=TRUE),
B=runif(100),C=rnorm(100))

num_cols = names(which(sapply(data,is.numeric)))

scale_params = attributes(scale(data[,num_cols]))[c("scaled:center","scaled:scale")]

newdata = data.frame(A=sample(letters[1:3],100,replace=TRUE),
B=runif(100),C=rnorm(100))

newdata[,num_cols] = scale(newdata[,num_cols],
center=scale_params[[1]],scale=scale_params[[2]])


来源:https://stackoverflow.com/questions/62209496/scaling-production-data

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!