For the past week I have been trying to aggregate my dataset that consists of different weight measurements in different months accompanied by a large volume of background varia
You could write your own functions and then use lapply
. First, write a function to find the most frequent level in a factor variable
getmode <- function(v) {
levels(v)[which.max(table(v))]
}
Then write a function to return either the mean or mode depending on the type of variable passed to it
my_summary <- function(x, id, ...){
if (is.numeric(x)) {
return(tapply(x, id, mean))
}
if (is.factor(x)) {
return(tapply(x, id, getmode))
}
}
Finally, use lapply
to calculate the summaries
data.frame(lapply(df, my_summary, id = df$IDnumber))
IDnumber Gender Weight LikesSoda
1 1 Male 81.33333 No
2 2 Female 68.00000 Yes
3 3 Female 52.00000 Yes
If there might be two or more levels in a factor with the same, maximum frequency then which.max
will just return the first one. I understand from your comment that you just want to know how many of them there are, so one option might be to amend the getmode
function slightly, so it adds an asterisk to the level when there is a tie:
getmode <- function(v) {
tab <- table(v)
if (sum(tab %in% max(tab)) > 1) return(paste(levels(v)[which.max(tab)], '*'))
levels(v)[which.max(tab)]
}
(Changing your sample data so there is one Female and one Male with IDnumber == "2")
data.frame(lapply(df, my_summary, id = df$IDnumber))
IDnumber Gender Weight LikesSoda
1 1 Male 81.33333 No
2 2 Female * 68.00000 Yes
3 3 Female 52.00000 Yes
I'm afraid that's a bit of a messy 'solution', but if you just want to get an idea of how common that issue is, perhaps it will be sufficient for your needs.