Aggregating mixed data by factor column

前端未结

关注

 1  1854

旧时难觅i 2021-01-26 07:38

For the past week I have been trying to aggregate my dataset that consists of different weight measurements in different months accompanied by a large volume of background varia

1条回答

面向向阳花 (楼主)

2021-01-26 07:59
You could write your own functions and then use lapply. First, write a function to find the most frequent level in a factor variable
```
getmode <- function(v) {
  levels(v)[which.max(table(v))]
}
```
Then write a function to return either the mean or mode depending on the type of variable passed to it
```
my_summary <- function(x, id, ...){
  if (is.numeric(x)) {
    return(tapply(x, id, mean))
  }  
  if (is.factor(x)) {
    return(tapply(x, id, getmode))
  }  
}
```
Finally, use lapply to calculate the summaries
```
data.frame(lapply(df, my_summary, id = df$IDnumber))
  IDnumber Gender   Weight LikesSoda
1        1   Male 81.33333        No
2        2 Female 68.00000       Yes
3        3 Female 52.00000       Yes
```
If there might be two or more levels in a factor with the same, maximum frequency then which.max will just return the first one. I understand from your comment that you just want to know how many of them there are, so one option might be to amend the getmode function slightly, so it adds an asterisk to the level when there is a tie:
```
getmode <- function(v) {
  tab <- table(v)
  if (sum(tab %in% max(tab)) > 1)  return(paste(levels(v)[which.max(tab)], '*'))
  levels(v)[which.max(tab)]
}
```
(Changing your sample data so there is one Female and one Male with IDnumber == "2")
```
data.frame(lapply(df, my_summary, id = df$IDnumber))

  IDnumber   Gender   Weight LikesSoda
1        1     Male 81.33333        No
2        2 Female * 68.00000       Yes
3        3   Female 52.00000       Yes
```
I'm afraid that's a bit of a messy 'solution', but if you just want to get an idea of how common that issue is, perhaps it will be sufficient for your needs.
0 讨论(0)
发布评论:

提交评论
- 加载中...