Inspecting and visualizing gaps/blanks and structure in large dataframes

前端未结

关注

 4  1790

遥遥无期

I have a large dataframe (400000 x 50) that I want to visually inspect for structure and blanks/gaps.

Is there an existing library or ggplot2 function, that can spit

For the interested reader...

I present first a baby example on 50 rows, then an example on the 400,000 rows.

For what it's worth, I second the comment by @cmbarbu about visually looking at 400K rows on the same plot being limited by a screen that at best has 2K pixels in height, so some kind of breaking apart across pages might be beneficial to prevent overplotting. I include an attempt at this breaking apart by making a PDF document with 400 rows in 1000 plots/pages.

I do not know of a function that will render the requested plot with a data.frame being an input. My approach will make a matrix mask of the data.frame and then use lasagna() from the lasagnar package on github. lasagna() is a wrapper for the function image( t(X)[, (nrow(X):1)] ) where X is a matrix. This call reorders the rows so that they match the order of the data.frame, and the wrapper allows the ability to toggle grid lines and add legends (legend=TRUE will invoke image.plot( t(X)[, (nrow(X):1)] ) -- however, in the example below I explicitly add a legend not using image.plot()).

libraries for the task

library(fields)
library(colorspace)  
library(lubridate)
library(devtools)
install_github("swihart/lasagnar")
library(lasagnar)

create a sample dataframe of 50 rows (baby example before 400K example)

df.in <- data.frame(date=seq(ymd('2012-04-07'),ymd('2013-03-22'), 
                    by = '1 week'),
           col1=rnorm(50),
           col2=rnorm(50),
           col3=rnorm(50),
           col4=rnorm(50),
           col5=as.factor(c("A","B")),
           col6=as.factor(c("MS","PHD")),
           col7=rnorm(50),
           col8=(c("cherlene","randy")),
           col9=rnorm(50),
           stringsAsFactors=FALSE)

induce missingness

df.in[19:23  , 2:4  ] <- NA
df.in[c(7, 9),      ] <- NA
df.in[2:30   , 4    ] <- NA
df.in[10     , 7    ] <- NA
df.in[14     , 6:10 ] <- NA

check structure

str(df.in)

prep the mask matrix

mat.out <- matrix(NA, nrow=nrow(df.in), ncol=ncol(df.in))

then cycle through columns for types; apply is.na() at the end

## red for dates
mat.out[,sapply(df.in,is.POSIXct)] <- 1
## blue for factors
mat.out[,sapply(df.in,is.factor)] <- 2
## green for characters
mat.out[,sapply(df.in,is.character)] <- 3
## white for numeric
mat.out[,sapply(df.in,is.numeric)] <- 4
## black for NA
mat.out[is.na(df.in)] <- 5

row names might be nice for tracing back to the original data

row.names(mat.out) <- 1:nrow(df.in)

render { lasagna(X) is a wrapper for image( t(X)[, (nrow(X):1)] ) }

lasagna(mat.out, col=c("red","blue","green","white","black"), 
        cex=0.67, main="")

enter image description here

legends are possible:

lasagna(mat.out, col=c("red","blue","green","white","black"), 
        cex=.67, main="")
legend("bottom", fill=c("red","blue","green","white","black"),
        legend=c("dates", "factors", "characters", "numeric", "NA"), 
        horiz=T, xpd=NA, inset=c(-.15), border="black")

enter image description here

turn gridlines off with gridlines=FALSE

lasagna(mat.out, col=c("red","blue","green","white","black"), 
        cex=.67, main="", gridlines=FALSE)
legend("bottom", fill=c("red","blue","green","white","black"),
        legend=c("dates", "factors", "characters", "numeric", "NA"), 
        horiz=T, xpd=NA, inset=c(-.15), border="black")

enter image description here

Let's do an example of OP data size: 400,000 rows x 50 cols

create a sample dataframe

df2.10 <- data.frame(date=seq(ymd('2012-04-07'),ymd('2013-03-22'), 
                    by = '1 week'),
           col1=rnorm(400000),
           col2=rnorm(400000),
           col3=rnorm(400000),
           col4=rnorm(400000),
           col5=as.factor(c("A","B")),
           col6=as.factor(c("MS","PHD")),
           col7=rnorm(400000),
           col8=(c("cherlene","randy")),
           col9=rnorm(400000),
           stringsAsFactors=FALSE)

induce missingness

df2.10[c(19:23), c(2:4)  ] <- NA
df2.10[c(7,  9),         ] <- NA
df2.10[c(2:30), 4        ] <- NA
df2.10[10     , 7        ] <- NA
df2.10[14     , c(6:10)  ] <- NA    
df2.10[c(450:750), ] <- NA
df2.10[c(399990:399999), ] <- NA

cbind into 50 column wide df; check structure

df2.in <- cbind(df2.10, df2.10, df2.10, df2.10, df2.10)
str(df2.in)

prep the mask matrix

mat.out <- matrix(NA, nrow=nrow(df2.in), ncol=ncol(df2.in))

then cycle through columns for types; apply is.na() at the end

## red for dates
mat.out[,sapply(df2.in,is.POSIXct)] <- 1
## blue for factors
mat.out[,sapply(df2.in,is.factor)] <- 2
## green for characters
mat.out[,sapply(df2.in,is.character)] <- 3
## white for numeric
mat.out[,sapply(df2.in,is.numeric)] <- 4
## black for NA
mat.out[is.na(df2.in)] <- 5

row names might be nice for tracing back to the original data

row.names(mat.out) <- 1:nrow(df2.in)

render { lasagna_plain(X) has no gridelines or rownames }

pdf("pages1000.pdf")
  system.time(
    for(i in 1:1000){
        lasagna_plain(mat.out[((i-1)*400+1):(400*i),],
                      col=c("red","blue","green","white","black"), cex=1, 
                      main=paste0("rows: ", (i-1)*400+1,  " - ",  (400*i)))
    }
  )
dev.off()

The for-loop completed 40 seconds on my machine, and the PDF very shortly thereafter. Now just page down after standardizing the page size in the PDF viewer, viewing pages/plots such as these:

enter image description here

0 讨论(0)

爱一瞬间的悲伤

2021-01-02 22:44

Assuming that the blank/gaps you are talking about are missing values (NA)

image(t(as.matrix(is.na(df))))

0 讨论(0)
发布评论:

提交评论
- 加载中...
忘掉有多难

2021-01-02 22:48

You may want to have a look at the tabplot package. With such a big data.frame it will take a while to load, but it should also correctly identify missing values. More info here.

Here's an image example using the diamond data.frame.

EDIT

I just saw that you said your df has 50 columns. I've used tabplot on df's that size and find the resolution of information limited by the screen breadth. The row count can also be an issue, but I personally find more information is lost if the df is too wide. Thus, may I suggest you parse it into 3 separate df (for example using dplyr) and then run them through the tableplot() function of tabplot or similar.

0 讨论(0)
发布评论:

提交评论
- 加载中...
轻奢々

2021-01-02 22:51
Give this a shot.
```
require(Amelia)
data(freetrade)
missmap(freetrade)
```
It won't do the red, blue green, but it gets your grid. I'd also give the VIM package a shot as it provides numerous options for visualizing missing data.

http://www.statistik.tuwien.ac.at/forschung/CS/CS-2008-1complete.pdf
0 讨论(0)
发布评论:

提交评论
- 加载中...