Elegant way to report missing values in a data.frame

问题

Here's a little piece of code I wrote to report variables with missing values from a data frame. I'm trying to think of a more elegant way to do this, one that perhaps returns a data.frame, but I'm stuck:

for (Var in names(airquality)) {
    missing <- sum(is.na(airquality[,Var]))
    if (missing > 0) {
        print(c(Var,missing))
    }
}

Edit: I'm dealing with data.frames with dozens to hundreds of variables, so it's key that we only report variables with missing values.

回答1:

Just use sapply

> sapply(airquality, function(x) sum(is.na(x)))
  Ozone Solar.R    Wind    Temp   Month     Day 
     37       7       0       0       0       0

You could also use apply or colSums on the matrix created by is.na()

> apply(is.na(airquality),2,sum)
  Ozone Solar.R    Wind    Temp   Month     Day 
     37       7       0       0       0       0
> colSums(is.na(airquality))
  Ozone Solar.R    Wind    Temp   Month     Day 
     37       7       0       0       0       0

回答2:

We can use map_df with purrr.

library(mice)
library(purrr)

# map_df with purrr
map_df(airquality, function(x) sum(is.na(x)))
# A tibble: 1 × 6
# Ozone Solar.R  Wind  Temp Month   Day
# <int>   <int> <int> <int> <int> <int>
# 1    37       7     0     0     0     0

回答3:

My new favourite for (not too wide) data are methods from excellent naniar package. Not only you get frequencies but also patterns of missingness:

library(naniar)
library(UpSetR)

riskfactors %>%
  as_shadow_upset() %>%
  upset()

It's often useful to see where the missings are in relation to non missing which can be achieved by plotting scatter plot with missings:

ggplot(airquality,
       aes(x = Ozone,
           y = Solar.R)) +
 geom_miss_point()

Or for categorical variables:

gg_miss_fct(x = riskfactors, fct = marital)

These examples are from package vignette that lists other interesting visualizations.

回答4:

summary(airquality)

already gives you this information

The VIM packages also offers some nice missing data plot for data.frame

library("VIM")
aggr(airquality)

回答5:

More succinct-: sum(is.na(x[1]))

That is

x[1] Look at the first column
is.na() true if it's NA
sum() TRUE is 1, FALSE is 0

回答6:

Another graphical alternative - plot_missing function from excellent DataExplorer package:

Docs also points out to the fact that you can save this results for additional analysis with missing_data <- plot_missing(data).

回答7:

Another function that would help you look at missing data would be df_status from funModeling library

library(funModeling)

iris.2 is the iris dataset with some added NAs.You can replace this with your dataset.

df_status(iris.2)

This will give you the number and percentage of NAs in each column.

回答8:

For one more graphical solution, visdat package offers vis_miss.

library(visdat)
vis_miss(airquality)

Very similar to Amelia output with a small difference of giving %s on missings out of the box.

回答9:

I think the Amelia library does a nice job in handling missing data also includes a map for visualizing the missing rows.

install.packages("Amelia")
library(Amelia)
missmap(airquality)

You can also run the following code will return the logic values of na

row.has.na <- apply(training, 1, function(x){any(is.na(x))})

回答10:

Another graphical and interactive way is to use is.na10 function from heatmaply library:

library(heatmaply)

heatmaply(is.na10(airquality), grid_gap = 1, 
          showticklabels = c(T,F),
            k_col =3, k_row = 3,
            margins = c(55, 30), 
            colors = c("grey80", "grey20"))

Probably won't work well with large datasets..

回答11:

If you want to do it for particular column, then you can also use this

length(which(is.na(airquality[1])==T))

回答12:

ExPanDaR’s package function prepare_missing_values_graph can be used to explore panel data:

来源：https://stackoverflow.com/questions/8317231/elegant-way-to-report-missing-values-in-a-data-frame

标签

dataframe

missing-data