Integrating ggplot2 with user-defined stat_function()

前端 未结 1 1752
感情败类
感情败类 2021-01-07 17:30

I\'m trying to superimpose a mixed distribution plot with a plot of identified component distributions, using ggplot2 package and a us

相关标签:
1条回答
  • 2021-01-07 18:02

    Finally I have figured out how to do what I wanted and reworked my solution. I've adapted parts of answers by @Spacedman and @jlhoward for this question (which I haven't seen at the time of posting my question): Any suggestions for how I can plot mixEM type data using ggplot2. However, my solution is a little different. On one hand, I've used @Spacedman's approach of using stat_function() - the same idea I've tried to use in my original version - I like it better than the alternative, which seems a bit too complex (while more flexible). On the other hand, similarly to @jlhoward's approach, I've simplified parameter passing. I've also introduced some visual improvements, such as automatic selection of differentiated colors for the easier component distributions identification. For my EDA, I've refactored this code as an R module. However, there is still one issue, which I'm still trying to figure out: why the component distribution plots are located below the expected density plots, as shown below. Any advice on this issue will be much appreciated!

    UPDATE: Finally, I've figured out the issue with scaling and updated the code and the figure accordingly - the y values need to be multiplied by the value of binwidth (in this case, it's 0.5) to account for the number of observations per bin.

    enter image description here

    Here's the complete reworked reproducible solution:

    library(ggplot2)
    library(RColorBrewer)
    library(mixtools)
    
    NUM_COMPONENTS <- 2
    
    set.seed(12345) # for reproducibility
    
    data <- faithful$waiting # use R built-in data
    
    # extract 'k' components from mixed distribution 'data'
    mix.info <- normalmixEM(data, k = NUM_COMPONENTS,
                            maxit = 100, epsilon = 0.01)
    summary(mix.info)
    
    numComponents <- length(mix.info$sigma)
    message("Extracted number of component distributions: ",
            numComponents)
    
    calc.components <- function(x, mix, comp.number) {
      mix$lambda[comp.number] *
        dnorm(x, mean = mix$mu[comp.number], sd = mix$sigma[comp.number])
    }
    
    g <- ggplot(data.frame(x = data)) +
      geom_histogram(aes(x = data, y = 0.5 * ..density..),
                     fill = "white", color = "black", binwidth = 0.5)
    
    # we could select needed number of colors randomly:
    #DISTRIB_COLORS <- sample(colors(), numComponents)
    
    # or, better, use a palette with more color differentiation:
    DISTRIB_COLORS <- brewer.pal(numComponents, "Set1")
    
    distComps <- lapply(seq(numComponents), function(i)
      stat_function(fun = calc.components,
                    arg = list(mix = mix.info, comp.number = i),
                    geom = "line", # use alpha=.5 for "polygon"
                    size = 2,
                    color = DISTRIB_COLORS[i]))
    print(g + distComps)
    
    0 讨论(0)
提交回复
热议问题