Dummyfication of a column/variable [duplicate]

问题

I'm designing a neural Network in R. For that I have to prepare my data and have imported a table.

For example:

      time    hour Money day
1:  20000616    1  9.35   5
2:  20000616    2  6.22   5 
3:  20000616    3  10.65  5
4:  20000616    4  11.42  5
5:  20000616    5  10.12  5
6:  20000616    6  7.32   5

Now I need a dummyfication. My final table should look like this:

      time    Money day  1   2   3   4   5   6   
1:  20000616  9.35   5   1   0   0   0   0   0
2:  20000616  6.22   5   0   1   0   0   0   0
3:  20000616  10.65  5   0   0   1   0   0   0
4:  20000616  11.42  5   0   0   0   1   0   0
5:  20000616  10.12  5   0   0   0   0   1   0
6:  20000616  7.32   5   0   0   0   0   0   1

Is there an easy way/smart way to transform my table into the new layout? Or programmatically in R? I need to do this in R, not before the Import.

Thanks in advance

回答1:

You can easily make dummy variables by using the dummies package.

library(dummies)

df <- data.frame(
  time = c(20000616, 20000616, 20000616, 20000616, 20000616, 20000616), 
  hour = c(1, 2, 3, 4, 5, 6), 
  Money = c(9.35, 6.22, 10.65, 11.42, 10.12, 7.32), 
  day = c(5, 5, 5, 5, 5, 5))

# Specify the categorical variables in the dummy.data.frame function.
df_dummy <- dummy.data.frame(df, names=c("hour"), sep="_")
names(df_dummy) <- c("time", 1:6, "Money", "day")
df_dummy <- df_dummy[c("time", "Money", "day", 1:6)]
df_dummy
# time Money day 1 2 3 4 5 6
# 1 20000616  9.35   5 1 0 0 0 0 0
# 2 20000616  6.22   5 0 1 0 0 0 0
# 3 20000616 10.65   5 0 0 1 0 0 0
# 4 20000616 11.42   5 0 0 0 1 0 0
# 5 20000616 10.12   5 0 0 0 0 1 0
# 6 20000616  7.32   5 0 0 0 0 0 1

回答2:

A possible solution with data.table (which you are apperently using):

dt[dcast(dt, hour ~ hour, value.var = 'hour', fun = length), on = .(hour)]

which gives:

       time hour Money day 1 2 3 4 5 6
1: 20000616    1  9.35   5 1 0 0 0 0 0
2: 20000616    2  6.22   5 0 1 0 0 0 0
3: 20000616    3 10.65   5 0 0 1 0 0 0
4: 20000616    4 11.42   5 0 0 0 1 0 0
5: 20000616    5 10.12   5 0 0 0 0 1 0
6: 20000616    6  7.32   5 0 0 0 0 0 1

I suppose that in your real dataset you will have more variation in time's and day's, you can then adapt the code to:

dt[dcast(dt, time + day + hour ~ hour, value.var = 'hour', fun = length)
   , on = .(time, day, hour)]

Used data:

dt <- fread(' time    hour Money day
20000616    1  9.35   5
20000616    2  6.22   5 
20000616    3  10.65  5
20000616    4  11.42  5
20000616    5  10.12  5
20000616    6  7.32   5')

回答3:

The base solution could be the following:

dat <- data.frame(time = c(20000616, 20000616, 20000616, 20000616, 20000616, 20000616), 
hour = c(1, 2, 3, 4, 5, 6), 
Money = c(9.35, 6.22, 10.65, 11.42, 10.12, 7.32), 
day = c(5, 5, 5, 5, 5, 5) )

dat$dummy_day <- factor(dat$day, levels = 1:7)

model.matrix(~time + hour + Money + day + dummy_day, dat, 
             contrasts = list(dummy_day = "contr.SAS"))

It returns a matrix:

  (Intercept)     time hour Money day dummy_day1 dummy_day2 dummy_day3 dummy_day4 dummy_day5 dummy_day6
1           1 20000616    1  9.35   5          0          0          0          0          1          0
2           1 20000616    2  6.22   5          0          0          0          0          1          0
3           1 20000616    3 10.65   5          0          0          0          0          1          0
4           1 20000616    4 11.42   5          0          0          0          0          1          0
5           1 20000616    5 10.12   5          0          0          0          0          1          0
6           1 20000616    6  7.32   5          0          0          0          0          1          0
attr(,"assign")
 [1] 0 1 2 3 4 5 5 5 5 5 5
attr(,"contrasts")
attr(,"contrasts")$dummy_day
[1] "contr.SAS"

回答4:

You should divide your goal into smaller doable problems.

Create matrix of 0's
Fill diagonal with 1's
Add matrix to your original data

# 0. Create data 
df <- mtcars[1:6, 1:4]

                   mpg cyl disp  hp
Mazda RX4         21.0   6  160 110
Mazda RX4 Wag     21.0   6  160 110
Datsun 710        22.8   4  108  93
Hornet 4 Drive    21.4   6  258 110
Hornet Sportabout 18.7   8  360 175
Valiant           18.1   6  225 105

# 1. Create matrix of 0's
foo <- matrix(rep(0, nrow(df) ^ 2), nrow(df))

# 2. Fill diagonal
diag(foo) <- 1

# 3. Combine with original data
cbind(df, foo)

                   mpg cyl disp  hp 1 2 3 4 5 6
Mazda RX4         21.0   6  160 110 1 0 0 0 0 0
Mazda RX4 Wag     21.0   6  160 110 0 1 0 0 0 0
Datsun 710        22.8   4  108  93 0 0 1 0 0 0
Hornet 4 Drive    21.4   6  258 110 0 0 0 1 0 0
Hornet Sportabout 18.7   8  360 175 0 0 0 0 1 0
Valiant           18.1   6  225 105 0 0 0 0 0 1

回答5:

Some others have mentioned using model.matrix to get the design matrix. This is a good solution. But I find that I usually want to customize how the missing values are treated or how I can collapse rare levels. So, here is a an alternative function that you can customize.

```

    one_hot_encode <- function(DT, cols_to_encode, include_last = TRUE
                               , protected_NA_val = 'NA_MISSING'
    ) {
        for (col in cols_to_encode) {
            level_freq <- DT[, sort(table(get(col), useNA = 'ifany')
                                    , decreasing = TRUE)]
            level_names <- names(level_freq)
            level_names[is.na(level_names)] <- protected_NA_val
            if (!include_last) {
                level_names <- level_names[-length(level_names)]
            }
            for (lev in level_names) {
                new_col_name <- paste('ONE_HOT', col, lev, sep = '_')
                DT[, (new_col_name) := 0]
                if (lev == protected_NA_val) {
                    DT[is.na(get(col)), (new_col_name) := 1]
                } else {
                    DT[get(col) == lev, (new_col_name) := 1]
                }
            }
        }
        return(DT)
    }

```

So that, applying this function to your dataset becomes:

```

    DT <- data.table(
        time = c(20000616, 20000616, 20000616, 20000616, 20000616, 20000616)
        , hour = c(1, 2, 3, 4, 5, 6)
        , money = c(9.35, 6.22, 10.65, 11.42, 10.12, 7.32)
        , day = c(5, 5, 5, 5, 5, 5)
    )
    DT <- one_hot_encode(DT, 'hour')

```

来源：https://stackoverflow.com/questions/48630405/dummyfication-of-a-column-variable

标签

data.table

time-series

dummy-variable