问题
I am comparing two years worth of daily soil moisture (SM) measurements. In one year, SM ranged from 0 to 0.6.
In the other year, which had more rain, SM ranged from 0 to 0.8. Amongst the data, I also have some NA's
, where the SM sensor did not work for some reason.
Let's re-create something similar:
library(data.table)
set.seed(24)
dt1 <- data.table(date=seq(as.Date("2015-01-01"), length.out=365, by="1 day"),
sm=sample(c(NA, runif(10, min=0, max=0.6)), 365, replace = TRUE))
dt2 <- data.table(date=seq(as.Date("2015-01-01"), length.out=365, by="1 day"),
sm=sample(c(NA, runif(10, min=0, max=0.8)), 365, replace = TRUE))
I am trying to compare both datasets based on the proportion of values between classes of SM in each month.
The classes I am interested in are seq(0, 0.8, by=0.2)
. I also need to count the proportion of failed measurements (NA
) per month.
I managed to do that by using akrun
's helpful answer here:
R - Calculate percentage of occurrences in data.table by month
tmp1 <- dt1[, n := .N, month(date)][, .(perc=100 * .N/n[1]),
by=.(month=month(date),
grp=cut(sm, breaks=seq(0, 0.8, by=0.2),
labels = c('0-0.2', '0.2-0.4', '0.4-0.6', '0.6-0.8')))]
tmp2 <- dt2[, n := .N, month(date)][, .(perc=100 * .N/n[1]),
by=.(month=month(date),
grp=cut(sm, breaks=seq(0, 0.8, by=0.2),
labels = c('0-0.2', '0.2-0.4', '0.4-0.6', '0.6-0.8')))]
However, the output is not exactly what I expect. Since values in dt1
range only from 0 to 0.6, there is no 0.6-0.8
category at all in the resulting data table tmp1
.
It looks like cut
ignores the last category (0.6-0.8
) because there is no SM measurement within that range. This makes my comparison really inconvenient, because I don't have the same groups in the resulting data tables tmp1
and tmp2
.
Does anybody know how to fix this, i.e. how to "force" cut
to consider values outside the break range? I need all SM categories in both tmp1
and tmp2
, even if their count is 0.
Just as a reference, this issue does not happen if we use table
, which always shows all categories even if their count is zero:
t1 <- runif(10, 0, 0.6)
t2 <- runif(10, 0, 0.8)
table(cut(t1, breaks=seq(0, 0.8, by=0.2)))
(0,0.2] (0.2,0.4] (0.4,0.6] (0.6,0.8]
5 3 2 0
table(cut(t2, breaks=seq(0, 0.8, by=0.2)))
(0,0.2] (0.2,0.4] (0.4,0.6] (0.6,0.8]
1 3 2 4
Any input appreciated.
回答1:
Use CJ
to count all levels, even those that don't appear in the table:
f = function(d){
# create month column
d[, month := month(date)]
# roll to make cut-group column
mdt = data.table(sm = c(NA, seq(0, .8, by=.2)))
d[, lb := mdt[.SD, on=.(sm), roll=TRUE, x.sm]]
# join with CJ to ensure all levels are present
res = d[CJ(month = month, lb = mdt$sm, unique = TRUE), on=.(month, lb), .N, by=.EACHI]
# rescale to monthly pct
res[, pct := N/sum(N), by=month][]
}
# try it
f(dt1)
f(dt2)
You could also do this with cut
. The important thing is how you're tabulating results, not how you're grouping them...
来源:https://stackoverflow.com/questions/48836376/r-overcome-cut-ignoring-values-outside-of-range-in-data-table