问题
I apologize in advanced if this is somewhat of a noob question but I looked in the forum and couldn't find a way to search what I am trying to do. I have a training set and I am trying to find a way to reduce the number of levels I have for my categorical variables (In the example below the category is the state). I would like to map the state to the mean or rate of the level. My training set would look like the following once input into a data frame:
state class mean
1 CA 1 0
2 AZ 1 0
3 NY 0 0
4 CA 0 0
5 NY 0 0
6 AZ 0 0
7 AZ 1 0
8 AZ 0 0
9 CA 0 0
10 VA 1 0
I would like the third column in my data frame to be the mean of the first column(state) based on the class variable. so the mean for CA rows will be 0.333 ... so that the mean column could be used as a replacement for the state column Is there some good way of doing this without writing an explicit loop in R?
How does one go about mapping new levels (example new states) if my training set didn't include them? Any link to approaches in R would be greatly appreciated.
回答1:
This is really what the ave
function was designed for. It can really be used to construct any functional result by category, but its default funciton is mean hence the name, ie, ave-(rage):
dfrm$mean <- with( dfrm, ave( class, state ) ) #FUN=mean is the default "setting"
回答2:
library(plyr)
join(data,ddply(data,.(state),summarise,mean=mean(class)),by=("state"),type="left")
来源:https://stackoverflow.com/questions/8735283/create-aggregate-column-based-on-variables-with-r