Scaling production data

问题

I have a dataset, say Data, which consists of categorical and numerical variables. After cleaning them, I have scaled only the numerical variables (guess catgorical must not be scaled) using

Data <- Data %>% dplyr::mutate_if(is.numeric, ~scale(.) %>% as.vector)

I then split it randomly to 70-30 percentage using

set.seed(123)
sample_size = floor(0.70*nrow(Data))
xyz <- sample(seq_len(nrow(Data)),size = sample_size)
Train_Set <- Join[xyz,]
Test_Set <- Join[-xyz,]

I have built a classification model using ranger, say model_rang, using Train_Set and tested on it using Test_Set.

If a new data, say new_data, arrives for production, after cleaning it, is it enough to scale it the above way? I mean

new_data <- new_data %>% dplyr::mutate_if(is.numeric, ~scale(.) %>% as.vector)

and then use it to predict the outcome using (there are two classes 0 and 1 and 1 is of interest)

probabilities <- as.data.frame(predict(model_rang, data = new_data, num.trees = 5000, type='response', verbose = TRUE)$predictions)
caret::confusionMatrix(table(max.col(probabilities) - 1,new_data$Class), positive='1')

Is the scale done properly as in Data or am I missing any crucial stuff in the production data?

Or, must I scale Train_Set separately and take the standard deviation of each variable and associated mean to scale Test_Set, and when new data during production arrives, the old standard deviation and mean from Train_Set be applied to every new data set?

回答1:

When you scale the data, you subtract the mean off it and divide by the standard deviation. The mean and standard deviation in your new data might not be the same as that in the (training data) used to construct your model.

Imagine in your random forest, one variable was split at 0.555 (scaled data) and now in your new data, the standard deviation is lower, values that would be below 0.555 are now over, and will be classified into a different class.

One thing you can do is store the attributes like the post you pointed to:

set.seed(111)

data = data.frame(A=sample(letters[1:3],100,replace=TRUE),
B=runif(100),C=rnorm(100))

num_cols = names(which(sapply(data,is.numeric)))

scale_params = attributes(scale(data[,num_cols]))[c("scaled:center","scaled:scale")]

newdata = data.frame(A=sample(letters[1:3],100,replace=TRUE),
B=runif(100),C=rnorm(100))

newdata[,num_cols] = scale(newdata[,num_cols],
center=scale_params[[1]],scale=scale_params[[2]])

来源：https://stackoverflow.com/questions/62209496/scaling-production-data

标签

dataframe

classification

scale

scaling