Here's a little piece of code I wrote to report variables with missing values from a data frame. I'm trying to think of a more elegant way to do this, one that perhaps returns a data.frame, but I'm stuck:
for (Var in names(airquality)) {
missing <- sum(is.na(airquality[,Var]))
if (missing > 0) {
print(c(Var,missing))
}
}
Edit: I'm dealing with data.frames with dozens to hundreds of variables, so it's key that we only report variables with missing values.
Just use sapply
> sapply(airquality, function(x) sum(is.na(x)))
Ozone Solar.R Wind Temp Month Day
37 7 0 0 0 0
You could also use apply
or colSums
on the matrix created by is.na()
> apply(is.na(airquality),2,sum)
Ozone Solar.R Wind Temp Month Day
37 7 0 0 0 0
> colSums(is.na(airquality))
Ozone Solar.R Wind Temp Month Day
37 7 0 0 0 0
We can use map_df
with purrr.
library(mice)
library(purrr)
# map_df with purrr
map_df(airquality, function(x) sum(is.na(x)))
# A tibble: 1 × 6
# Ozone Solar.R Wind Temp Month Day
# <int> <int> <int> <int> <int> <int>
# 1 37 7 0 0 0 0
More succinct-: sum(is.na(x[1]))
That is
x[1]
Look at the first columnis.na()
true if it'sNA
sum()
TRUE
is1
,FALSE
is0
My new favourite for (not too wide) data are methods from excellent naniar package. Not only you get frequencies but also patterns of missingness:
library(naniar)
library(UpSetR)
riskfactors %>%
as_shadow_upset() %>%
upset()
It's often useful to see where the missings are in relation to non missing which can be achieved by plotting scatter plot with missings:
ggplot(airquality,
aes(x = Ozone,
y = Solar.R)) +
geom_miss_point()
Or for categorical variables:
gg_miss_fct(x = riskfactors, fct = marital)
These examples are from package vignette that lists other interesting visualizations.
Another graphical alternative - plot_missing
function from excellent DataExplorer
package:
Docs also points out to the fact that you can save this results for additional analysis with missing_data <- plot_missing(data)
.
Another function that would help you look at missing data would be df_status from funModeling library
library(funModeling)
iris.2 is the iris dataset with some added NAs.You can replace this with your dataset.
df_status(iris.2)
This will give you the number and percentage of NAs in each column.
For one more graphical solution, visdat
package offers vis_miss
.
library(visdat)
vis_miss(airquality)
Very similar to Amelia
output with a small difference of giving %s on missings out of the box.
If you want to do it for particular column, then you can also use this
length(which(is.na(airquality[1])==T))
I think the Amelia library does a nice job in handling missing data also includes a map for visualizing the missing rows.
install.packages("Amelia")
library(Amelia)
missmap(airquality)
You can also run the following code will return the logic values of na
row.has.na <- apply(training, 1, function(x){any(is.na(x))})
ExPanDaR’s package function prepare_missing_values_graph
can be used to explore panel data:
来源:https://stackoverflow.com/questions/8317231/elegant-way-to-report-missing-values-in-a-data-frame