问题
A data frame which has invalid characters in the column names is causing an error in rlm().
Taking a deeper look, it appears that within rlm() the variable xvars
contains the names of the formula's explanatory variables, but it puts backticks around the offending names. Then when xvars is used as an index to a data frame, namesly mf[xvars]
it causes the following error:
Error in `[.data.frame`(mf, xvars) : undefined columns selected
Is this the expected behavior? (I realize the keyword phrase invalid characters). Curiously, calling lm() on the same model and dataframe causes no problems.
# SAMPLE DATA
mydf <- data.frame(matrix(rnorm(36),ncol=6))
colnames(mydf) <- c("y", "x1", "x2", "x1^2", "x2^2", "x1:x2")
rlm(y~., data=mydf) # Error
lm(y~., data=mydf) # No Problem
# Clean up column names
colnames(mydf) <- make.names(colnames(mydf))
rlm(y~., data=mydf) # No Problem
Taking a look at MASS:::rlm.formula
, it appears the error is
caused by mf[xvars]
in the following lines:
xlev <- if (length(xvars) > 0L) {
xlev <- lapply(mf[xvars], levels)
xlev[!sapply(xlev, is.null)]
}
Any thoughts why the backticks are being added but then causing an error?
Additional Info
I copied the rlm() function, added dput(mf)
& dput(xvars)
and got the following values. Note that the value of xvars is different than the names assigned above (ie, backticks are added). Also, the names of mf are the same as the names given above.
# dput yielded
mf <- structure(list(y = c(-0.242914027018629, 0.724255425682537, -0.0578467214604185, -0.274193999595702, -0.38985000750839, 0.406046200943395), x1 = c(1.53071709960635, -1.87493297716611, 1.0936519723035, -0.977011182431237, -0.510890461021046, 1.20136627562427), x2 = c(-0.801995963036553, 1.30590232081605, 0.635922235436178, -1.86824341731708, -2.76797814532917, -0.497992681627495), `x1^2` = c(0.914146279518207, 0.103458073891876, -1.29818230391818, -0.629048606358592, 1.71534374557621, 0.922690967521984), `x2^2` = c(-0.0879726513660469, 1.05299413769867, 1.01955640371072, 0.546413685721721, 0.947757793667223, -0.0998700630220064), `x1:x2` = c(-0.757490494166813, 1.31307393014016, 1.90233916482184, 0.68844011701049, -1.28717997826724, -0.581800325341162)), .Names = c("y", "x1", "x2", "x1^2", "x2^2", "x1:x2"), terms = y ~ x1 + x2 + `x1^2` + `x2^2` + `x1:x2`, row.names = c(NA, 6L), class = "data.frame")
xvars <- c("x1", "x2", "`x1^2`", "`x2^2`", "`x1:x2`")
mf[xvars]
# Error in `[.data.frame`(mf, xvars) : undefined columns selected
# Removing the backticks from xvars eliminates the error.
xvars <- sapply(xvars, function(x) gsub("`", "", x))
mf[xvars2] # No Error
回答1:
Your issue boils down to the fact you are using non-syntatic variable names.
These should be used with caution, and without expectation that package authors will be able to anticipate any issues that may arise.
To quote from the help for formula
Variable names can be quoted by backticks
like this
in formulae, although there is no guarantee that all code using formulae will accept such non-syntactic names.
The issue in how xvars
is created rlm.formula
xvars <- as.character(attr(mt, "variables"))[-1L]
and then the use later on
xlev <- if (length(xvars) > 0L) {
xlev <- lapply(mf[xvars], levels)
xlev[!sapply(xlev, is.null)]
}
Which, as you show, does not work
This will create quoted back-ticked variables for non-syntatic names. If they are already backticked, then they will create double back-ticked names
i.e. if the column name was "x1^2"
, the element in xvar
becomes "`x1^2`"
.
This fails with [.data.frame
for example
x <- data.frame(`a` = 1)
> x[,'`a`']
Error in `[.data.frame`(x, , "`a`") : undefined columns selected
Because the column name is 'a'
not `a`
If you backtick the column name
i.e. if the column name was "`x1^2`"
, the element in xvar
becomes "``x1^2``"
.
which again is not a column in your data.frame
The reason lm
works is that it does not attempt this definition and use of xvars
, instead it uses model.matrix
to define the design matrix x
directly to pass to lm.fit
If you want to fit the model y ~ x1 + x2 + x1:x2 +x1^2 + y1^2
then you can using
rlm(y ~ x1*x2 + I(x1^2) + I(x2^2)
In this case you only need three columns in your data.frame (or objects in your evaluation environment) y
, x1
and x2
. as the I()
function allows to perform arithmetic operations on a variable, as I
is parsed as a symbol by terms.formula
来源:https://stackoverflow.com/questions/13327287/invalid-characters-causing-error-in-rlm