Calculate probability of value based on 2D density plot in R

问题

I'm looking to work out a function to calculate the likelihood of a certain combination for B and R. The current illustration of the data looks like so:

ggplot(df, aes(R,B)) +
geom_bin2d(binwidth = c(1,1))

Is there a way to calculate the probabilities of each combination (e.g. R = 23, B = 30) based on these two discrete correlated variables that are positively skewed?

Could it be possible to use the stat_density_2d to solve or could there be a better way?

Thanks.

回答1:

stat_density_2d uses MASS::kde2d under the hood. I imagine there are slicker ways to do this, but we can feed the data into that function and convert it into tidy data to get a smoothed version for that type of estimate.

First, some data like yours:

library(tidyverse)
set.seed(42)
df <- tibble(
  R = rlnorm(1E4, 0, 0.2) * 100,
  B = R * rnorm(1E4, 1, 0.2)
)

ggplot(df, aes(R,B)) +
  geom_bin2d(binwidth = c(1,1))

Here's running the density and converting into a tibble with the same coordinates as the data. (Are there better ways to do this?)

n = 201 # arbitrary grid size, chosen to be 1 more than the range below 
        #   so the breaks are at integers
smooth <- MASS::kde2d(df$R, df$B, lims = c(0, 200, 0, 200),
                      # h = c(20,20),  # could tweak bandwidth here 
                      n = n) 
df_smoothed <- smooth$z %>% 
  as_tibble() %>%
  pivot_longer(cols = everything(), names_to = "col", values_to = "val") %>% 
  mutate(R = rep(smooth$x, each = n), # EDIT: fixed, these were swapped
         B = rep(smooth$y, n))

df_smoothed now holds all the coordinates from 0:200 in the R and B dimensions, with the probability of each combination in the val column. These add up to 1, of nearly so (99.6% in this case). I think the remaining smidgen is the probabilities of coordinates outside the specified range.

sum(df_smoothed$val)
#[1] 0.9960702

The chances of any particular combination are just the density value at that point. So the chance of R = 70 and B = 100 would be 0.013%.

df_smoothed %>%
  filter(R == 70, B == 100)
## A tibble: 1 x 4
#  col        val     R     B
#  <chr>    <dbl> <int> <int>
#1 V101   0.0000345    70   100

The chance of R between 50-100 and B between 50-100 would be 36.9%:

df_smoothed %>%
  filter(R %>% between(50, 100),
         B %>% between(50, 100)) %>%
  summarize(total_val = sum(val))
## A tibble: 1 x 1
#total_val
#<dbl>
#  1     0.369

Here's how the smooth and the original data look together:

ggplot() +
  geom_tile(data = df_smoothed, aes(R, B, alpha = val), fill = "red") +
  geom_point(data = df %>% sample_n(500), aes(R, B), size = 0.2, alpha = 1/5)

回答2:

If it's only about plotting, one could simply turn off the contours and use geom = raster like suggested in the ggplot2 reference.

Thanks to @JonSpring for the sample data!

library(tidyverse)

df <- tibble(
  R = rlnorm(1E4, 0, 0.2) * 100,
  B = R * rnorm(1E4, 1, 0.2)
)

ggplot(df, aes(R,B)) +
  stat_density2d(geom = 'raster', aes(fill = stat(density)), contour = FALSE)

^{Created on 2019-12-28 by the reprex package (v0.3.0)}

来源：https://stackoverflow.com/questions/59507974/calculate-probability-of-value-based-on-2d-density-plot-in-r

标签

ggplot2

probability-density