Generate a dummy-variable

前端 未结 17 1099
遇见更好的自我
遇见更好的自我 2020-11-21 11:41

I have trouble generating the following dummy-variables in R:

I\'m analyzing yearly time series data (time period 1948-2009). I have two questions:

  1. <
17条回答
  •  暗喜
    暗喜 (楼主)
    2020-11-21 12:24

    For the usecase as presented in the question, you can also just multiply the logical condition with 1 (or maybe even better, with 1L):

    # example data
    df1 <- data.frame(yr = 1951:1960)
    
    # create the dummies
    df1$is.1957 <- 1L * (df1$yr == 1957)
    df1$after.1957 <- 1L * (df1$yr >= 1957)
    

    which gives:

    > df1
         yr is.1957 after.1957
    1  1951       0          0
    2  1952       0          0
    3  1953       0          0
    4  1954       0          0
    5  1955       0          0
    6  1956       0          0
    7  1957       1          1
    8  1958       0          1
    9  1959       0          1
    10 1960       0          1
    

    For the usecases as presented in for example the answers of @zx8754 and @Sotos, there are still some other options which haven't been covered yet imo.

    1) Make your own make_dummies-function

    # example data
    df2 <- data.frame(id = 1:5, year = c(1991:1994,1992))
    
    # create a function
    make_dummies <- function(v, prefix = '') {
      s <- sort(unique(v))
      d <- outer(v, s, function(v, s) 1L * (v == s))
      colnames(d) <- paste0(prefix, s)
      d
    }
    
    # bind the dummies to the original dataframe
    cbind(df2, make_dummies(df2$year, prefix = 'y'))
    

    which gives:

      id year y1991 y1992 y1993 y1994
    1  1 1991     1     0     0     0
    2  2 1992     0     1     0     0
    3  3 1993     0     0     1     0
    4  4 1994     0     0     0     1
    5  5 1992     0     1     0     0
    

    2) use the dcast-function from either data.table or reshape2

     dcast(df2, id + year ~ year, fun.aggregate = length)
    

    which gives:

      id year 1991 1992 1993 1994
    1  1 1991    1    0    0    0
    2  2 1992    0    1    0    0
    3  3 1993    0    0    1    0
    4  4 1994    0    0    0    1
    5  5 1992    0    1    0    0
    

    However, this will not work when there are duplicate values in the column for which the dummies have to be created. In the case a specific aggregation function is needed for dcast and the result of of dcast need to be merged back to the original:

    # example data
    df3 <- data.frame(var = c("B", "C", "A", "B", "C"))
    
    # aggregation function to get dummy values
    f <- function(x) as.integer(length(x) > 0)
    
    # reshape to wide with the cumstom aggregation function and merge back to the original
    merge(df3, dcast(df3, var ~ var, fun.aggregate = f), by = 'var', all.x = TRUE)
    

    which gives (note that the result is ordered according to the by column):

      var A B C
    1   A 1 0 0
    2   B 0 1 0
    3   B 0 1 0
    4   C 0 0 1
    5   C 0 0 1
    

    3) use the spread-function from tidyr (with mutate from dplyr)

    library(dplyr)
    library(tidyr)
    
    df2 %>% 
      mutate(v = 1, yr = year) %>% 
      spread(yr, v, fill = 0)
    

    which gives:

      id year 1991 1992 1993 1994
    1  1 1991    1    0    0    0
    2  2 1992    0    1    0    0
    3  3 1993    0    0    1    0
    4  4 1994    0    0    0    1
    5  5 1992    0    1    0    0
    

提交回复
热议问题