问题
In a attempt to make a test 2d density plot with ggplot2, I used the code snippet:
ggplot(df, aes(x = S1.x, y = S1.y)) + geom_point() + geom_density_2d()
and I got the error: "Computation failed in stat_density2d()
: bandwidths must be strictly positive"
My dataframe looks like this:
> df
transcriptID S1.x S1.y S2.x S2.y
DQ459412 0.000000 0.000000 0.000000 0.000000
DQ459413 1.584963 2.358379 4.392317 3.085722
DQ459415 0.000000 0.000000 0.000000 0.000000
DQ459418 0.000000 0.000000 0.000000 0.000000
DQ459419 0.000000 0.000000 4.000000 2.891544
DQ459420 0.000000 0.000000 0.000000 0.000000
Also, var(df[,"S1.x"]) > 0
and var(df[,"S1.y"]) > 0
.
Fig 1 - 2d density plot with error
However, I got a density plot without error by running:
ggplot(df, aes(x = S2.x, y = S2.y)) + geom_point() + geom_density_2d()
Fig 2 - density plot without error
How do I address the error in Fig 1?
回答1:
So the real problem is that the S1.x
and S1.y
values only have one non-zero value in their columns. And it turns out that geom_density_2d
can't really estimate a density with only a value or two. But read on...
Update:
This question has been asked before, and the answers are usually that you need to have non-zero variance in your data columns. But you do have non-zero variance, so why isn't it working?
- Looking at the internals of
geom_density_2d
we see that it uses theMASS::kde2d
package function to calculate the distribution. - Looking at
kde2d
we see that it usesMASS::bandwidth.nrd(df$x)
to get an estimate of the bandwidth. - Looking at the help (which has the code) for
bandwidth.nrd
we see it uses a rule of thumb that gets thequantile
of the distribution, and subtracts the 2nd quantile from the 1st quantile to get a bandwidth estimate. - Doing a quantile on your original data we see that the quantiles of the data were zero.
- And running
MASS::kde2d
on your original data with thatbandwidth.nrd
estimate of the bandwidth gives you the same error:
library(MASS) nn <- c("DQ459412","DQ459413","DQ459415","DQ459418","DQ459419","DQ459420") s1x <- c(0,1.584963,0,0,0,0) s1y <- c(0,2.358379,0,0,0,0) s2x <- c(0,4.392317,0,0,4,0) s2y <- c(0,3.085722,0,0,2.891544,0) df <- data.frame(transcriptID=nn,S1.x=s1x,S1.y=s1y,S2.x=s2x,S2.y=s2y)
> quantile(df$s1x)
0% 25% 50% 75% 100%
0.000000 0.000000 0.000000 0.000000 1.584963
> quantile(df$s1y)
0% 25% 50% 75% 100%
0.000000 0.000000 0.000000 0.000000 2.358379
h <- c(MASS::bandwidth.nrd(df$x), MASS::bandwidth.nrd(df$y)) dens <- MASS::kde2d(df$s1x, df$s1y, h = h, n = n, lims = c(0,1,0,1))
Error in MASS::kde2d(df$s1x, df$s1y, h = h, n = n, lims = c(0, 1, 0, 1)) : bandwidths must be strictly positive
So the real criteria for using geom_density_2D
is that both the x- and the y-data needs to have a non-zero gap between their 1st and 2nd quantiles.
Now to fix it, if I make a small modification - replacing one of the zeros with 0.1, like this:
nn <- c("DQ459412","DQ459413","DQ459415","DQ459418","DQ459419","DQ459420")
s1x <- c(0,1.584963,0,0,0.1,0)
s1y <- c(0,2.358379,0,0,0.1,0)
s2x <- c(0,4.392317,0,0,4,0)
s2y <- c(0,3.085722,0,0,2.891544,0)
df <- data.frame(transcriptID=nn,S1.x=s1x,S1.y=s1y,S2.x=s2x,S2.y=s2y)
print(df)
yielding:
transcriptID S1.x S1.y S2.x S2.y
1 DQ459412 0.000000 0.000000 0.000000 0.000000
2 DQ459413 1.584963 2.358379 4.392317 3.085722
3 DQ459415 0.000000 0.000000 0.000000 0.000000
4 DQ459418 0.000000 0.000000 0.000000 0.000000
5 DQ459419 0.100000 0.100000 4.000000 2.891544
6 DQ459420 0.000000 0.000000 0.000000 0.000000
Then I get this plot instead of your error.
You can let that 0.1
value approach zero, eventually it will not be able to calculate a distribution anymore and you will get your error again.
One general way to deal with this situation is to add a very small quantity of noise to your data, kind of simulating the fact that any meaningful calculation based on a real measurement from a continuous distribution should be impervious to that small quantity of noise.
Hope that helps.
回答2:
The answer of @Mike Wise is pretty solid indeed and my answer is somewhat complementary to it. Actually, the bandwidth.nrd
function computes the difference between the 3rd and the 1st quantile not the 2nd and 1st (code from the function):
r <- quantile(distances, c(0.25, 0.75))
Instead of adding random noise to your data, I would suggest to precompute the bandwidths yourself and pass them to the function, testing for non-zero values like so:
kde2d(df$s1x, df$s1y,
h = c(ifelse(bandwidth.nrd(df$s1x) == 0, 0.1, bandwidth.nrd(df$s1x)),
ifelse(bandwidth.nrd(df$s1y) == 0, 0.1, bandwidth.nrd(df$s1y))))
Hope this helps.
来源:https://stackoverflow.com/questions/53075331/error-using-geom-density-2d-in-r-computation-failed-in-stat-density2d-b