问题
I want to predict values for my Pop_avg
field in my unsurveyed areas based on surveyed areas. I am using randomForest based on a suggestion to my earlier question.
My surveyed areas:
> surveyed <- read.csv("summer_surveyed.csv", header = T)
> surveyed_1 <- surveyed[, -c(1,2,3,5,6,7,9,10,11,12,13,15)]
> head(surveyed_1, n=1)
VEGETATION Pop_avg Acres_1
1 Acer rubrum-Vaccinium corymbosum-Amelanchier spp. 0 27.68884
My unsurveyed areas:
> unsurveyed <- read.csv("summer_unsurveyed.csv", header = T)
> unsurveyed_1 <- unsurveyed[, -c(2,3,5,6,7,9,10,11,12,13,15)]
> head(unsurveyed_1, n=1)
OBJECTID VEGETATION Pop_avg Acres_1
13 Acer rubrum-Vaccinium corymbosum-Amelanchier spp. 0 4.787381
I then removed rows from unsurveyed_1
that contained vegetation types not found in surveyed_1
and dropped the unused feature levels.
> setdiff(unsurveyed_1$VEGETATION, surveyed_1$VEGETATION)
> unsurveyed_1 <- unsurveyed_1[!unsurveyed_1$VEGETATION == "Typha (angustifolia, latifolia) - (Schoenoplectus spp.) Eastern Herbaceous Vegetation", ]
> unsurveyed_1 <- unsurveyed_1[!unsurveyed_1$VEGETATION == "Acer rubrum- Nyssa sylvatica saturated forest alliance",]
> unsurveyed_1 <- unsurveyed_1[!unsurveyed_1$VEGETATION == "Prunus serotina",]
> unsurveyed_drop <- droplevels(unsurveyed_1)
Next I ran randomForest and predict and added the output to unsurveyed_drop
:
> surveyed_pred <- randomForest(Pop_avg ~
+ VEGETATION+Acres_1,
+ data = surveyed_1,
+ importance = TRUE)
> summer_results <- predict(surveyed_pred, unsurveyed_drop,type="response",
+ norm.votes=TRUE, predict.all=F, proximity=FALSE, nodes=FALSE)
> summer_all <- cbind(unsurveyed_drop, summer_results)
> head(summer_all, n=1)
OBJECTID VEGETATION Pop_avg Acres_1 summer_results
13 Acer rubrum-Vaccinium corymbosum-Amelanchier spp. 0 4.787381 0.120077
I would like to estimate values for the column Pop_avg
in summer_all
. I am assuming that I need to use the proportions generated in summer_results
, but I'm unsure how I would do this. Thanks for any help or further suggestions.
More information:
I am looking to get predicted count data for Pop_avg
based on Vegetation
and Acres_1
. I am not sure if/how to use the probabalities in my output summer_results
to achieve this or if I need to alter my model or try a different method.
E2
The reason I didn't think the output was right is because Pop_avg
ranges anywhere from .333 and up (where there were deer seen) which is Population
divided by 3. And Population
ranges from 1 and up (i.e. 10, 20...). When I ran the model trying to predict either one I get similar numbers that range from .9xx to 2 or 3.xxx especially when I ran it with Population
. Which didn't seem right.
DATA:
summer_surveyed_sample
summer_unsurveyed_sample
回答1:
My problem lied within my training model. I figured out that I needed to use a subset of my surveyed data where Population
> 0 to get more accurate predictions.
> surveyed_1 <- surveyed_1[c(surveyed_1$Population > 0),]
> surveyed_drop <- droplevels(surveyed_1)
> surveyed_pred <- randomForest(Population ~
VEGETATION+Acres_1,
data = surveyed_drop,
importance = TRUE)
来源:https://stackoverflow.com/questions/34864182/predict-estimate-values-using-randomforest-in-r