Find top deciles from dataframe by group

前端未结

关注

 3  2019

I am attempting to create new variables using a function and lapply rather than working right in the data with loops. I used to use Stata and would have solved this

相关标签:

3条回答

心在旅途

2021-01-23 08:14
Stick to your Stata instincts and use a single data set:
```
require(data.table)
DT <- data.table(data)

DT[,r:=rank(v2)/.N,by=v1]
```
You can see the result by typing DT.

From here, you can group the within-v1 rank, r, if you want to. Following Stata idioms...
```
DT[,g:={
  x = rep(0,.N)
  x[r>.8] = 20
  x[r>.9] = 10
  x
}]
```
This is like gen and then two replace ... if statements. Again, you can see the result with DT.

Finally, you can subset with
```
DT[g>0]
```
which gives
```
   custID v1 v2     r  g
1:      1  A 30 1.000 10
2:      2  A 29 0.900 20
3:      1  B 20 0.975 10
4:      2  B 19 0.875 20
5:      6  B 20 0.975 10
6:      7  B 19 0.875 20
```
These steps can also be chained together:
```
DT[,r:=rank(v2)/.N,by=v1][,g:={x = rep(0,.N);x[r>.8] = 20;x[r>.9] = 10;x}][g>0]
```
(Thanks to @ExperimenteR:)

To rearrange for the desired output in the OP, with values of v1 in columns, use dcast:
```
dcast(
  DT[,r:=rank(v2)/.N,by=v1][,g:={x = rep(0,.N);x[r>.8] = 20;x[r>.9] = 10;x}][g>0], 
  custID~v1)
```
Currently, dcast requires the latest version of data.table, available (I think) from Github.
0 讨论(0)
发布评论:

提交评论
- 加载中...
借酒劲吻你

2021-01-23 08:19
The idiomatic way to do this kind of thing in R would be to use a combination of split and lapply. You're halfway there with your use of lapply; you just need to use split as well.
```
lapply(split(data, data$v1), function(df) {
    cutoff <- quantile(df$v2, c(0.8, 0.9))
    top_pct <- ifelse(df$v2 > cutoff[2], 10, ifelse(df$v2 > cutoff[1], 20, NA))
    na.omit(data.frame(id=df$custID, top_pct))
})
```
Finding quantiles is done with quantile.
0 讨论(0)
发布评论:

提交评论
- 加载中...

北恋

2021-01-23 08:37

You don't need the function pf to achieve what you want. Try dplyr/tidyr combo

library(dplyr)
library(tidyr)
data %>% 
    group_by(v1) %>% 
    arrange(desc(v2))%>%
    mutate(n=n()) %>% 
    filter(row_number() <= round(n * .2)) %>% 
    mutate(top_pct= ifelse(row_number()<=round(n* .1), 10, 20)) %>%
    select(custID, top_pct) %>% 
    spread(v1,  top_pct)
#  custID  A  B
#1      1 10 10
#2      2 20 20
#3      6 NA 10
#4      7 NA 20

0 讨论(0)