How to reproduce smoothScatter's outlier plotting in ggplot?

前端 未结 2 942
终归单人心
终归单人心 2021-02-02 17:35

I am trying to get something like what the smoothScatter function does, only in ggplot. I have figured out everything except for plotting the N most sparse points.

2条回答
  •  终归单人心
    2021-02-02 17:51

    Here is a solution to calculate the sparseness of each (bivariate) observation in the data first (or respectively after the transformation of your choice is applied).

    Let's first calculate the most likeliest density value for each observation based on the density calculated from KernSmooth::bkde2D, which is called for convenience via grDevices:::.smoothScatterCalcDensity to make a suitable guess for binwidth if none is provided. This function is useful for other problems as well.

    densVals <- function(x, y = NULL, nbin = 128, bandwidth, range.x) {
      dat <- cbind(x, y)
      # limit dat to strictly finite values
      sel <- is.finite(x) & is.finite(y)
      dat.sel <- dat[sel, ]
      # density map with arbitrary graining along x and y
      map   <- grDevices:::.smoothScatterCalcDensity(dat.sel, nbin, bandwidth)
      map.x <- findInterval(dat.sel[, 1], map$x1)
      map.y <- findInterval(dat.sel[, 2], map$x2)
      # weighted mean of the fitted density map according to how close x and y are
      # to the arbitrary grain of the map
      den <- mapply(function(x, y) weighted.mean(x = c(
        map$fhat[x, y], map$fhat[x + 1, y + 1],
        map$fhat[x + 1, y], map$fhat[x, y + 1]), w = 1 / c(
        map$x1[x] + map$x2[y], map$x1[x + 1] + map$x2[y + 1],
        map$x1[x + 1] + map$x2[y], map$x1[x] + map$x2[y + 1])),
        map.x, map.y)
      # replace missing density estimates with NaN
      res <- rep(NaN, length(sel))
      res[sel] <- den
      res
    }
    

    I use the weighted mean as a (linear) approximation for the ‘true’ density value. Probably, a simple look-up would do as well.

    Here is the actual calculation.

    mydata <- data.frame(x = exp(rnorm(10000)), y = exp(rnorm(10000)))
    # the transformation applied will affect the local density estimate
    mydata$point_density <- densVals(log10(mydata$x), log10(mydata$y))
    

    Now, let's plot. (Building on Troy's answer.)

    library(ggplot2)
    
    ggplot(mydata, aes(x = x, y = y)) +
      stat_density2d(geom = "raster", aes(fill = ..density.. ^ 0.25), contour = FALSE) +
      scale_x_log10() + scale_y_log10() +
      scale_fill_gradientn(colours = colorRampPalette(c("white", blues9))(256)) +
      # select only the 100 sparesest points
      geom_point(data = dplyr::top_n(mydata, 100, -point_density), size = .5)
    

    (final plot) -- Sorry, not allowed to embed images yet.

    No overplotting required. :)

提交回复
热议问题