问题
I have asked this question a long time ago, but haven't found the answer yet. I do not know if this is legit in stackoverflow, but I repost it.
I have a data.table in R and I want to create a new column that finds the interval for every price of the respective year/month.
Reproducible example:
set.seed(100)
DT <- data.table(year=2000:2009, month=1:10, price=runif(5*26^2)*100)
intervals <- list(year=2000:2009, month=1:10, interval = sort(round(runif(9)*100)))
intervals <- replicate(10, (sample(10:100,100, replace=T)))
intervals <- t(apply(intervals, 1, sort))
intervals.dt <- data.table(intervals)
intervals.dt[, c("year", "month") := list(rep(2000:2009, each=10), 1:10)]
setkey(intervals.dt, year, month)
setkey(DT, year, month)
I have just tried:
- merging the
DT
andintervals.dt
data.tables by month/year, - creating a new
intervalsstring
column consisting of all the V* columns to one column string, (not very elegant, I admit), and finally - substringing it to a vector, so as I can use it in
findInterval()
but the solution does not work for every row (!)
So, after:
DT <- merge(DT, intervals.dt)
DT <- DT[, intervalsstring := paste(V1, V2, V3, V4, V5, V6, V7, V8, V9, V10)]
DT <- DT[, c("V1", "V2", "V3", "V4", "V5", "V6", "V7", "V8", "V9", "V10") := NULL]
DT[, interval := findInterval(price, strsplit(intervalsstring, " ")[[1]])]
I get
> DT
year month price intervalsstring interval
1: 2000 1 30.776611 12 21 36 46 48 51 63 72 91 95 2
2: 2000 1 62.499648 12 21 36 46 48 51 63 72 91 95 6
3: 2000 1 53.581115 12 21 36 46 48 51 63 72 91 95 6
4: 2000 1 48.830599 12 21 36 46 48 51 63 72 91 95 5
5: 2000 1 33.066053 12 21 36 46 48 51 63 72 91 95 2
---
3376: 2009 10 33.635924 12 40 45 48 50 65 75 90 96 97 2
3377: 2009 10 38.993769 12 40 45 48 50 65 75 90 96 97 3
3378: 2009 10 75.065820 12 40 45 48 50 65 75 90 96 97 8
3379: 2009 10 6.277403 12 40 45 48 50 65 75 90 96 97 0
3380: 2009 10 64.189162 12 40 45 48 50 65 75 90 96 97 7
which is correct for the first rows, but not for the last (or other) rows.
For example, for the row 3380, the price ~64.19 should be in the 5th interval and not the 7th. I guess my mistake is that by my last command, finding Intervals relies only on the first row of intervalsstring
.
Thank you!
回答1:
Your main problem is that you just didn't do findInterval
for each group. But I also don't see the point of making that large merged data.table
, or the paste
/strsplit
business. This is what I would do:
DT[, interval := findInterval(price,
intervals.dt[.BY][, V1:V10, with = F]),
by = .(year, month)][]
# year month price interval
# 1: 2000 1 30.776611 2
# 2: 2000 1 62.499648 6
# 3: 2000 1 53.581115 6
# 4: 2000 1 48.830599 5
# 5: 2000 1 33.066053 2
# ---
#3376: 2009 10 33.635924 1
#3377: 2009 10 38.993769 1
#3378: 2009 10 75.065820 7
#3379: 2009 10 6.277403 0
#3380: 2009 10 64.189162 5
Note that intervals.dt[.BY]
is a keyed subset.
来源:https://stackoverflow.com/questions/22671400/findinterval-with-varying-intervals-in-data-table-r