问题
I have a data table and I want to create a new column for each unique day, and then assign a 1 in each row where the day matches the column name
I have done this using a for loop but I was wondering if there was any way to optimise it using data.table and set?
Here is an example
dt <- data.table(Week_Day = c("Monday", "Tuesday", "Wednesday",
"Thursday", "Friday", "Saturday", "Sunday"))
Day <- unique(dt$Week_Day)
for (i in 1:length(Day)) {
if (Day[i] != "Sunday") {
dt[, Day[i] := ifelse(Week_Day == Day[i], 1, 0)]
}
}
my table is 298k rows and although it doesn't take long to execute (below), its part of a long script and I have quite a few inefficient loops so I am trying to get the overall run time down.
Time to run:
user system elapsed
0.99 0.06 1.05
Thanks in advance.
回答1:
Here's a different approach that, performs better - on my machine - than the original approach in the question
1) Get unique days except Sunday
Day <- setdiff(dt$Week_Day, "Sunday")
2) Initialize new columns with 0:
dt[, (Day) := 0L]
3) Update with 1s by reference in a loop:
for(x in Day) {
set(dt, i = which(dt[["Week_Day"]] == x), j = x, value = 1L)
}
Simple performance comparison:
dt1 <- data.table(Week_Day = sample(c("Monday", "Tuesday", "Wednesday",
"Thursday", "Friday", "Saturday", "Sunday"), 3e5, TRUE))
dt2 <- copy(dt1)
system.time({
Day <- setdiff(unique(dt$Week_Day), "Sunday")
dt1[, (Day) := 0L]
for(x in Day) {
set(dt1, i = which(dt1[["Week_Day"]] == x), j = x, value = 1L)
}
})
# User System verstrichen
# 0.029 0.003 0.032
system.time({
Day <- unique(dt$Week_Day)
for (i in 1:length(Day)) {
if (Day[i] != "Sunday") {
dt2[, Day[i] := ifelse(Week_Day == Day[i], 1L, 0L)]
}
}
})
# User System verstrichen
# 0.138 0.070 0.210
all.equal(dt1, dt2)
#[1] TRUE
回答2:
Here is one attempt at a speed up:
Day <- unique(dt$Week_Day)
setkey(dt, Week_Day)
# create columns of 0s
dt[, (Day) := 0L]
for (i in seq_along(head(Day, -1))) {
dt[Day[i], Day[i] := 1L]
}
This implements a couple of the data.table
speed ups including binary search in the second chain and the elimination of ifelse
with replacement by reference.
来源:https://stackoverflow.com/questions/38526086/speed-up-this-loop-to-create-dummy-columns-with-data-table-and-set-in-r