问题
In R, how can I set weights for particular variables and not observations in lm()
function?
Context is as follows. I'm trying to build personal ranking system for particular products, say, for phones. I can build linear model based on price as dependent variable and other features such as screen size, memory, OS and so on as independent variables. I can then use it to predict phone real cost (as opposed to declared price), thus finding best price/goodness coefficient. This is what I have already done.
Now I want to "highlight" some features that are important for me only. For example, I may need a phone with large memory, thus I want to give it higher weight so that linear model is optimized for memory variable.
lm()
function in R has weights
parameter, but these are weights for observations and not variables (correct me if this is wrong). I also tried to play around with formula, but got only interpreter errors. Is there a way to incorporate weights for variables in lm()
?
Of course, lm()
function is not the only option. If you know how to do it with other similar solutions (e.g. glm()
), this is pretty fine too.
UPD. After few comments I understood that the way I was thinking about the problem is wrong. Linear model, obtained by call to lm()
, gives optimal coefficients for training examples, and there's no way (and no need) to change weights of variables, sorry for confusion I made. What I'm actually looking for is the way to change coefficients in existing linear model to manually make some parameters more important than others. Continuing previous example, let's say we've got following formula for price:
price = 300 + 30 * memory + 56 * screen_size + 12 * os_android + 9 * os_win8
This formula describes best possible linear model for dependence between price and phone parameters. However, now I want to manually change number 30 in front of memory
variable to, say, 60, so it becomes:
price = 300 + 60 * memory + 56 * screen_size + 12 * os_android + 9 * os_win8
Of course, this formula doesn't reflect optimal relationship between price and phone parameters any more. Also dependent variable doesn't show actual price, just some value of goodness, taking into account that memory is twice more important for me than for average person (based on coefficients from first formula). But this value of goodness (or, more precisely, value of fraction goodness/price
) is just what I need - having this I can find best (in my opinion) phone with best price.
Hope all of this makes sense. Now I have one (probably very simple) question. How can I manually set coefficients in existing linear model, obtained with lm()
? That is, I'm looking for something like:
coef(model)[2] <- 60
This code doesn't work of course, but you should get the idea. Note: it is obviously possible to just double values in memory
column in data frame, but I'm looking for more elegant solution, affecting model, not data.
回答1:
The following code is a bit complicated because lm()
minimizes residual sum of squares and with a fixed, non optimal coefficient it is no longed minimal, so that would be against what lm()
is trying to do and the only way is to fix all the rest coefficients too.
To do that, we have to know coefficients of the unrestricted model first. All the adjustments have to be done by changing formula of your model, e.g. we have
price ~ memory + screen_size
, and of course there is a hidden intercept. Now neither changing the data directly nor using I(c*memory)
is good idea. I(c*memory)
is like temporary change of data too, but to change only one coefficient by transforming the variables would be much more difficult.
So first we change price ~ memory + screen_size
to price ~ offset(c1*memory) + offset(c2*screen_size)
. But we haven't modified the intercept, which now would try to minimize residual sum of squares and possibly become different than in original model. The final step is to remove the intercept and to add a new, fake variable, i.e. which has the same number of observations as other variables:
price ~ offset(c1*memory) + offset(c2*screen_size) + rep(c0, length(memory)) - 1
# Function to fix coefficients
setCoeffs <- function(frml, weights, len){
el <- paste0("offset(", weights[-1], "*",
unlist(strsplit(as.character(frml)[-(1:2)], " +\\+ +")), ")")
el <- c(paste0("offset(rep(", weights[1], ",", len, "))"), el)
as.formula(paste(as.character(frml)[2], "~",
paste(el, collapse = " + "), " + -1"))
}
# Example data
df <- data.frame(x1 = rnorm(10), x2 = rnorm(10, sd = 5),
y = rnorm(10, mean = 3, sd = 10))
# Writing formula explicitly
frml <- y ~ x1 + x2
# Basic model
mod <- lm(frml, data = df)
# Prime coefficients and any modifications. Note that "weights" contains
# intercept value too
weights <- mod$coef
# Setting coefficient of x1. All the rest remain the same
weights[2] <- 3
# Final model
mod2 <- update(mod, setCoeffs(frml, weights, nrow(df)))
# It is fine that mod2 returns "No coefficients"
Also, probably you are going to use mod2
only for forecasting (actually I don't know where else it could be used now) so that could be made in a simpler way, without setCoeffs
:
# Data for forecasting with e.g. price unknown
df2 <- data.frame(x1 = rpois(10, 10), x2 = rpois(5, 5), y = NA)
mat <- model.matrix(frml, model.frame(frml, df2, na.action = NULL))
# Forecasts
rowSums(t(t(mat) * weights))
回答2:
It looks like you are doing optimization, not model fitting (though there can be optimization within model fitting). You probably want something like the optim
function or look into linear or quadratic programming (linprog
and quadprog
packages).
If you insist on using modeling tools like lm
then use the offset
argument in the formula to specify your own multiplyer rather than computing one.
来源:https://stackoverflow.com/questions/12323859/how-to-manually-set-coefficients-for-variables-in-linear-model