问题
I'm running a binomial GLM to predict the probability of a species occurrence, where I am training on one dataset and testing the model on another dataset:
TrainingData<-read.csv("TrainingData.csv")[,-1]
TrainingData[,1]<-as.factor(TrainingData[,1])
TrainingData[,4]<-as.factor(TrainingData[,4])
TestData<-read.csv("TestData.csv")[,-1]
TestData[,1]<-as.factor(TestData[,1])
TestData[,4]<-as.factor(TestData[,4])
mod<-glm(presence~var1+var2+var3, family=binomial, data=TrainingData)
probs=predict(mod, TestData, type="response")
What is the best way (or function) to create response curves to plot the relationship between the probability of presence and each predictor variable?
Thanks!
回答1:
The marginal probabilities can be calculated from predict.glm with type = "terms",
since each of the terms are calculated with the remaining variables set at their mean values.
This is converted back to a probabilty scale with plogis(term + intercept).
Second, because your data set contains and combination of continuous values and factors for your predictor variables, separate plots were made for each type and combined with grid.arrange.
Although this answers your question directly based on the glm model you presented, I would still recommend examining the spatial autocorrelation of both your predictor and response variables, as this could have a likely impact on your final model.
library(reshape2)
library(dplyr)
library(tidyr)
library(ggplot2)
library(gridExtra)
TrainingData <- read.csv("~/Downloads/TrainingData.csv", header = TRUE)
TrainingData[['presence']] <- as.factor(TrainingData[['presence']])
TrainingData[['var3']] <- as.factor(TrainingData[['var3']])
TrainingData[['X']] <- NULL # Not used in the model
TestData <- read.csv("~/Downloads/TestData.csv", header = TRUE)
TestData[['presence']] <- as.factor(TestData[['presence']])
TestData[['var3']] <- as.factor(TestData[['var3']])
TestData[['X']] <- NULL
Presence/Absence model
mod <- glm(presence ~ var1 + var2 + var3, family = binomial, data = TrainingData)
Get predicted probabilities for each of the centered variables (i.e remaining variables set to their mean).
mod_terms <- predict(mod, newdata = TestData, type = "terms")
mod_prob <- data.frame(idx = 1:nrow(TestData), plogis(mod_terms +
attr(mod_terms, "constant")))
mod_probg <- mod_prob %>% gather(variable, probability, -idx)
Melt the Test data into long format
TestData['idx'] <- 1:nrow(TestData) # Add index to data
TestData[['X']] <- NULL # Drop the X variable since it was not used in the model
data_long <- melt(TestData, id = c("presence","idx"))
data_long[['value']] <- as.numeric(data_df[['value']])
Merge Testdata with predictions and separate the data containing continuous (var1 and var2) and factors (var3).
# Merge Testdata with predictions
data_df <- merge(data_long, mod_probg, by = c("idx", "variable"))
data_df <- data_df %>% arrange(variable, value)
data_continuous <- data_df %>% filter(., variable != "var3") %>%
transform(value = as.numeric(value)) %>% arrange(variable, value)
data_factor <- data_df %>% filter(., variable == "var3") %>%
transform(value = as.factor(value))%>%
arrange(idx)
ggplot output
g_continuous <- ggplot(data_continuous, aes(x = value, y = probability)) + geom_point()+
facet_wrap(~variable, scales = "free_x")
g_factor <- ggplot(data = data_factor, aes(x = value, y = probability)) + geom_boxplot() +
facet_wrap(~variable)
grid.arrange(g_continuous, g_factor, nrow = 1)
来源:https://stackoverflow.com/questions/32915628/best-way-to-create-response-curves-for-a-glm-species-distribution-model-in-r