问题
I have a variable x
that is between 0 and 1, or (0,1].
I want to generate 10 dummy variables for 10 deciles of variable x
. For example x_0_10
takes value 1 if x is between 0 and 0.1, x_10_20
takes value 1 if x is between 0.1 and 0.2, ...
The Stata code to do above is something like this:
forval p=0(10)90 {
local Next=`p'+10
gen x_`p'_`Next'=0
replace x_`p'_`Next'=1 if x<=`Next'/100 & x>`p'/100
}
Now, I am new at R and I wonder how I can do above in R?
回答1:
cut
is your friend here; its output is a factor
, which, when used in models, R will auto-expand into the 10 dummy variables.
set.seed(2932)
x = runif(1e4)
y = 3 + 4 * x + rnorm(1e4)
x_cut = cut(x, 0:10/10, include.lowest = TRUE)
summary(lm(y ~ x_cut))
# Call:
# lm(formula = y ~ x_cut)
#
# Residuals:
# Min 1Q Median 3Q Max
# -3.7394 -0.6888 0.0028 0.6864 3.6742
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 3.16385 0.03243 97.564 <2e-16 ***
# x_cut(0.1,0.2] 0.43932 0.04551 9.654 <2e-16 ***
# x_cut(0.2,0.3] 0.85555 0.04519 18.933 <2e-16 ***
# x_cut(0.3,0.4] 1.26441 0.04588 27.556 <2e-16 ***
# x_cut(0.4,0.5] 1.66181 0.04495 36.970 <2e-16 ***
# x_cut(0.5,0.6] 2.04538 0.04574 44.714 <2e-16 ***
# x_cut(0.6,0.7] 2.44771 0.04533 53.999 <2e-16 ***
# x_cut(0.7,0.8] 2.80875 0.04591 61.182 <2e-16 ***
# x_cut(0.8,0.9] 3.22323 0.04545 70.919 <2e-16 ***
# x_cut(0.9,1] 3.60092 0.04564 78.897 <2e-16 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 1.011 on 9990 degrees of freedom
# Multiple R-squared: 0.5589, Adjusted R-squared: 0.5585
# F-statistic: 1407 on 9 and 9990 DF, p-value: < 2.2e-16
See ?cut
for more customizations
You can also pass cut
directly in the RHS of the formula, which would make using predict
a bit easier:
reg = lm(y ~ cut(x, 0:10/10, include.lowest = TRUE))
idx = sample(length(x), 500)
plot(x[idx], y[idx])
x_grid = seq(0, 1, length.out = 500L)
lines(x_grid, predict(reg, data.frame(x = x_grid)),
col = 'red', lwd = 3L, type = 's')
回答2:
This won't fit well into a comment, but for the record, the Stata code can be simplified down to
forval p = 0/9 {
gen x_`p' = x > `p'/10 & `x' <= (`p' + 1)/10
}
Note that -- contrary to the OP's claim -- values of x
exactly zero will be mapped to zero for all these variables, both on their code and on mine (which is intended to be a simplification of their code, not a correct way to do it, modulo a difference of taste on variable names). That follows from the fact that 0 is not greater than 0. Again, values that are exactly 0.1, 0.2, 0.3, will in principle go in the lower bin, not the higher bin, but that is complicated by the fact that most multiples of 0.1 don't have exact binary representations (0.5 is clearly an exception).
Indeed, depending on details about their set-up that the OP doesn't tell us, indicator variables (dummy variables, in their terminology) may well be available in Stata without a loop or made quite unnecessary by factor variable notation. In that respect Stata is closer to R than may at first appear.
While not answering the question directly, the signal here to Stata and R users alike is that Stata need not be so awkward as might be inferred from the code in the question.
来源:https://stackoverflow.com/questions/60358675/r-equivalent-of-statas-for-loop-over-macros