问题
I'm designing a neural Network in R. For that I have to prepare my data and have imported a table.
For example:
time hour Money day
1: 20000616 1 9.35 5
2: 20000616 2 6.22 5
3: 20000616 3 10.65 5
4: 20000616 4 11.42 5
5: 20000616 5 10.12 5
6: 20000616 6 7.32 5
Now I need a dummyfication. My final table should look like this:
time Money day 1 2 3 4 5 6
1: 20000616 9.35 5 1 0 0 0 0 0
2: 20000616 6.22 5 0 1 0 0 0 0
3: 20000616 10.65 5 0 0 1 0 0 0
4: 20000616 11.42 5 0 0 0 1 0 0
5: 20000616 10.12 5 0 0 0 0 1 0
6: 20000616 7.32 5 0 0 0 0 0 1
Is there an easy way/smart way to transform my table into the new layout? Or programmatically in R? I need to do this in R, not before the Import.
Thanks in advance
回答1:
You can easily make dummy variables by using the dummies
package.
library(dummies)
df <- data.frame(
time = c(20000616, 20000616, 20000616, 20000616, 20000616, 20000616),
hour = c(1, 2, 3, 4, 5, 6),
Money = c(9.35, 6.22, 10.65, 11.42, 10.12, 7.32),
day = c(5, 5, 5, 5, 5, 5))
# Specify the categorical variables in the dummy.data.frame function.
df_dummy <- dummy.data.frame(df, names=c("hour"), sep="_")
names(df_dummy) <- c("time", 1:6, "Money", "day")
df_dummy <- df_dummy[c("time", "Money", "day", 1:6)]
df_dummy
# time Money day 1 2 3 4 5 6
# 1 20000616 9.35 5 1 0 0 0 0 0
# 2 20000616 6.22 5 0 1 0 0 0 0
# 3 20000616 10.65 5 0 0 1 0 0 0
# 4 20000616 11.42 5 0 0 0 1 0 0
# 5 20000616 10.12 5 0 0 0 0 1 0
# 6 20000616 7.32 5 0 0 0 0 0 1
回答2:
A possible solution with data.table
(which you are apperently using):
dt[dcast(dt, hour ~ hour, value.var = 'hour', fun = length), on = .(hour)]
which gives:
time hour Money day 1 2 3 4 5 6 1: 20000616 1 9.35 5 1 0 0 0 0 0 2: 20000616 2 6.22 5 0 1 0 0 0 0 3: 20000616 3 10.65 5 0 0 1 0 0 0 4: 20000616 4 11.42 5 0 0 0 1 0 0 5: 20000616 5 10.12 5 0 0 0 0 1 0 6: 20000616 6 7.32 5 0 0 0 0 0 1
I suppose that in your real dataset you will have more variation in time
's and day
's, you can then adapt the code to:
dt[dcast(dt, time + day + hour ~ hour, value.var = 'hour', fun = length)
, on = .(time, day, hour)]
Used data:
dt <- fread(' time hour Money day
20000616 1 9.35 5
20000616 2 6.22 5
20000616 3 10.65 5
20000616 4 11.42 5
20000616 5 10.12 5
20000616 6 7.32 5')
回答3:
The base solution could be the following:
dat <- data.frame(time = c(20000616, 20000616, 20000616, 20000616, 20000616, 20000616),
hour = c(1, 2, 3, 4, 5, 6),
Money = c(9.35, 6.22, 10.65, 11.42, 10.12, 7.32),
day = c(5, 5, 5, 5, 5, 5) )
dat$dummy_day <- factor(dat$day, levels = 1:7)
model.matrix(~time + hour + Money + day + dummy_day, dat,
contrasts = list(dummy_day = "contr.SAS"))
It returns a matrix:
(Intercept) time hour Money day dummy_day1 dummy_day2 dummy_day3 dummy_day4 dummy_day5 dummy_day6
1 1 20000616 1 9.35 5 0 0 0 0 1 0
2 1 20000616 2 6.22 5 0 0 0 0 1 0
3 1 20000616 3 10.65 5 0 0 0 0 1 0
4 1 20000616 4 11.42 5 0 0 0 0 1 0
5 1 20000616 5 10.12 5 0 0 0 0 1 0
6 1 20000616 6 7.32 5 0 0 0 0 1 0
attr(,"assign")
[1] 0 1 2 3 4 5 5 5 5 5 5
attr(,"contrasts")
attr(,"contrasts")$dummy_day
[1] "contr.SAS"
回答4:
You should divide your goal into smaller doable problems.
- Create matrix of 0's
- Fill diagonal with 1's
- Add matrix to your original data
# 0. Create data
df <- mtcars[1:6, 1:4]
mpg cyl disp hp Mazda RX4 21.0 6 160 110 Mazda RX4 Wag 21.0 6 160 110 Datsun 710 22.8 4 108 93 Hornet 4 Drive 21.4 6 258 110 Hornet Sportabout 18.7 8 360 175 Valiant 18.1 6 225 105
# 1. Create matrix of 0's
foo <- matrix(rep(0, nrow(df) ^ 2), nrow(df))
# 2. Fill diagonal
diag(foo) <- 1
# 3. Combine with original data
cbind(df, foo)
mpg cyl disp hp 1 2 3 4 5 6 Mazda RX4 21.0 6 160 110 1 0 0 0 0 0 Mazda RX4 Wag 21.0 6 160 110 0 1 0 0 0 0 Datsun 710 22.8 4 108 93 0 0 1 0 0 0 Hornet 4 Drive 21.4 6 258 110 0 0 0 1 0 0 Hornet Sportabout 18.7 8 360 175 0 0 0 0 1 0 Valiant 18.1 6 225 105 0 0 0 0 0 1
回答5:
Some others have mentioned using model.matrix
to get the design matrix. This is a good solution. But I find that I usually want to customize how the missing values are treated or how I can collapse rare levels. So, here is a an alternative function that you can customize.
```
one_hot_encode <- function(DT, cols_to_encode, include_last = TRUE
, protected_NA_val = 'NA_MISSING'
) {
for (col in cols_to_encode) {
level_freq <- DT[, sort(table(get(col), useNA = 'ifany')
, decreasing = TRUE)]
level_names <- names(level_freq)
level_names[is.na(level_names)] <- protected_NA_val
if (!include_last) {
level_names <- level_names[-length(level_names)]
}
for (lev in level_names) {
new_col_name <- paste('ONE_HOT', col, lev, sep = '_')
DT[, (new_col_name) := 0]
if (lev == protected_NA_val) {
DT[is.na(get(col)), (new_col_name) := 1]
} else {
DT[get(col) == lev, (new_col_name) := 1]
}
}
}
return(DT)
}
```
So that, applying this function to your dataset becomes:
```
DT <- data.table(
time = c(20000616, 20000616, 20000616, 20000616, 20000616, 20000616)
, hour = c(1, 2, 3, 4, 5, 6)
, money = c(9.35, 6.22, 10.65, 11.42, 10.12, 7.32)
, day = c(5, 5, 5, 5, 5, 5)
)
DT <- one_hot_encode(DT, 'hour')
```
来源:https://stackoverflow.com/questions/48630405/dummyfication-of-a-column-variable