How do you map every combination of categorical variables in R?

前端 未结 4 564
盖世英雄少女心
盖世英雄少女心 2021-01-07 00:32

Using R, is there a way to take a data set and map out every possible combination of every categorical variable?

For example, let\'s say I had 10,000 rows of custom

相关标签:
4条回答
  • 2021-01-07 00:57

    That there is:

    expand.grid(gender = c("male", "female"), tShirtSize = c("xs", "s","m","l","xl"))

    Will return all the combinations in a dataframe. For the summary stats, try aggregate, e.g:

    country = sample(c("america", "canadian"), 30, replace = TRUE)
    gender = sample(c("male", "female"), 30, replace = TRUE)
    x = abs(rnorm(30) * 1000)
    aggregate(data.frame(x), by = list(country, gender), FUN = mean)
    

    I run into errors if there are columns with strings in the dataframe, so I'd subset out the columns with numeric values.

    0 讨论(0)
  • 2021-01-07 00:58

    Create some dummy data:

    dataset <- data.frame(
        spend=10*runif(100),
        email=sample(c("yahoo","gmail","hotmail","aol"),100,replace=TRUE),
        browser=sample(c("Mozilla","IE","Chrome","Opera"),100,replace=TRUE),
        country=sample(c("USA","Canada","China","Australia",
          "Egypt","S.Korea","Brazil"),100,replace=TRUE))
    

    Average the spend per combination:

    with(dataset,by(spend,list(email,browser,country),mean))
    

    Note the NAs for combinations without entries.

    Or turn this into a three-dimensional array:

    as.table(with(dataset,by(spend,list(email,browser,country),mean)))
    
    0 讨论(0)
  • 2021-01-07 01:03

    Here's a method that utilizes dplyr

    require(magrittr)
    require(dplyr)    
    
    set.seed(123)
    dat = data.frame(email=sample(c("yahoo", "gmail"), 10000, replace=T),
                     browser=sample(c("mozilla", "ie"), 10000, replace=T),
                     country=sample(c("usa", "canada"), 10000, replace=T),
                     money=runif(10000))  
    dat %>%
      group_by(email, browser, country) %>%
      summarize(mean = mean(money))
    # email browser country      mean
    # 1 gmail      ie  canada 0.5172424
    # 2 gmail      ie     usa 0.4921908
    # 3 gmail mozilla  canada 0.4934892
    # 4 gmail mozilla     usa 0.4993923
    # 5 yahoo      ie  canada 0.5013214
    # 6 yahoo      ie     usa 0.5098280
    # 7 yahoo mozilla  canada 0.4985357
    # 8 yahoo mozilla     usa 0.4919743
    

    EDIT: if you want to pass a list into group_by(), you'll need to use the not-non-standard evaluation counterpart, regroup(). For example,

    mylist <- list("email", "browser", "country")
    dat %>%
      regroup(mylist) %>%
      summarize(mean = mean(money))
    

    also see dplyr: How to use group_by inside a function?

    0 讨论(0)
  • 2021-01-07 01:13

    You can do this with aggregate:

    set.seed(144)
    dat = data.frame(email=sample(c("yahoo", "gmail"), 10000, replace=T),
                     browser=sample(c("mozilla", "ie"), 10000, replace=T),
                     country=sample(c("usa", "canada"), 10000, replace=T),
                     money=runif(10000))
    aggregate(dat$money, by=list(browser=dat$browser, email=dat$email,
                                 country=dat$country), mean)
    #   browser email country         x
    # 1      ie gmail  canada 0.4905588
    # 2 mozilla gmail  canada 0.5064342
    # 3      ie yahoo  canada 0.4894398
    # 4 mozilla yahoo  canada 0.4959031
    # 5      ie gmail     usa 0.5069363
    # 6 mozilla gmail     usa 0.5088138
    # 7      ie yahoo     usa 0.4957478
    # 8 mozilla yahoo     usa 0.4993698
    

    To get multiple columns like mean and count together, you can do:

    res = aggregate(dat$money, by=list(browser=dat$browser, email=dat$email,
                                       country=dat$country),
                    FUN=function(x) c(mean=mean(x), count=length(x)))
    res
    #   browser email country       x.mean      x.count
    # 1      ie gmail  canada    0.4905588 1261.0000000
    # 2 mozilla gmail  canada    0.5064342 1227.0000000
    # 3      ie yahoo  canada    0.4894398 1267.0000000
    # 4 mozilla yahoo  canada    0.4959031 1253.0000000
    # 5      ie gmail     usa    0.5069363 1240.0000000
    # 6 mozilla gmail     usa    0.5088138 1236.0000000
    # 7      ie yahoo     usa    0.4957478 1213.0000000
    # 8 mozilla yahoo     usa    0.4993698 1303.0000000
    
    0 讨论(0)
提交回复
热议问题