In R formulas, why do I have to use the I() function on power terms, like y ~ I(x^3)

一曲冷凌霜 提交于 2019-11-27 11:07:05

The issue here is how formulas are interpreted. The infix operators "+", "*", ":" and "^" have entirely different meanings than when used with numeric vectors. In a formula the tilde separates the left hand side from the right hand side. In formulas the ^ operator is for constructing interactions so that x = x^2 = x^3 rather than the perhaps expected mathematical power. (A variable interacting with itself is just the same variable.) If you had typed (x+y)^2 the R interpreter would have produced (for its own good internal use), not a mathematical: x^2 +2xy +y^2 , but rather a symbolic: x + y +x:y where x:y is an interaction term.

?formula

The I() function acts to convert the argument to "as.is", i.e. what you expect. So I(x^2) would return a vector of values raised to the second power.

The ~ should be thought of as saying "is distributed as" or "is dependent on" when seen in regression functions. It implies an error term in model descriptions which will generally be labelled "(Intercept)" and the function context and arguments may also further determine a link function such as log() or logit().

The "+" symbol in a formula is not really adding two variables but is usually an implicit request to calculate a regression coefficient(s) for that variable in the context of the rest of the variables that are on the RHS of a formula. The regression functions use `model.matrix and that function will recognize the presence of factors or character vectors in the formula and build a matrix that expand the levels of the discrete components of the formula.

In plot()-ting functions it basically reverses the usual ( x, y ) order of arguments that the plot function usually takes. There was a plot.formula method written so that formulas could be used as a more "mathematical" mode of communicating with R. In the graphics::plot.formula, curve, and 'lattice' and 'ggplot' functions, it governs how multiple factors or numeric vectors are displayed and "facetted".

I learned later that ~ is actually an infix (or prefix) primitive function that creates an R 'call' which can be accessed with list extraction operators. All of that is hidden from the typical user, but it can be a facility used by more advanced function authors.

The overloading of the "+" operator is discussed in the comments below and is also done in the plotting packages: ggplot2 and gridExtra where is it separating functions that deliver object results, so it acting and as a pass-through and layering operator. The aggregation functions that have a formula method use "+" as an "arrangement" and grouping operator.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!