How to plot the difference between two density distributions

问题

I've trained a model to predict a certain variable. When I now use this model to predict said value and compare this predictions to the actual values, I get the two following distributions.

The corresponding R Data Frame looks as follows:

x_var | kind
3.532 | actual
4.676 | actual
...
3.12 | predicted
6.78 | predicted

These two distributions obviously have slightly different means, quantiles, etc. What I would now like to do is combine these two distributions into one (especially as they are fairly similar), but not like in the following thread.

Instead, I would like to plot one density function that shows the difference between the actual and predicted values and enables me to say e.g. 50% of the predictions are within -X% and +Y% of the actual values.

I've tried just plotting the difference between predicted-actual and also the difference compared to the mean in the respective group. However, neither approach has produced my desired result. With the plotted distribution, it is especially important to be able to make above statement, i.e. 50% of the predictions are within -X% and +Y% of the actual values. How can this be achieved?

回答1:

Let's consider the two distributions as df_actual, df_predicted, then calculate

# dataframe with difference between two distributions
df_diff <- data.frame(x = df_predicted$x - df_actual$x, y = df_predicted$y - df_actual$y)

Then find the relative % difference by :

x_diff = mean(( df_diff$x - df_actual$x) / df_actual $x) * 100
y_diff = mean(( df_diff$y - df_actual$y) / df_actual $y) * 100

This will give you % prediction whether +/- in x as well as y. This is my opinion and also follow this thread for displaying and measuring area between two distribution curves.

I hope this helps.

回答2:

ParthChaudhary is right - rather than subtracting the distributions, you want to analyze the distribution of differences. But take care to subtract the values within corresponding pairs, or otherwise the actual - predicted differences will be overshadowed by the variance of actual (and predicted) alone. I.e., if you have something like:

x y type
0 10.9 actual
1 15.7 actual
2 25.3 actual
...
0 10 predicted
1 17 predicted
2 23 predicted
...

you would merge(df[df$type=="actual",], df[df$type=="predicted",], by="x"), then calculate and plot y.x-y.y.

回答3:

To better quantify whether the differences between your predicted and actual distributions are significant, you could consider using the Kolmogorov-Smirnov test in R, available via the function ks.test

来源：https://stackoverflow.com/questions/44066444/how-to-plot-the-difference-between-two-density-distributions

标签

ggplot2

density-plot