I have a large dataframe (400000 x 50) that I want to visually inspect for structure and blanks/gaps.
Is there an existing library or ggplot2 function, that can spit
Have you tried dfviewr
in lasagnar
? The following reproduces the desired graphic for the 50 row x 10 column df.in
in the package:
library(devtools)
install_github("swihart/lasagnar")
library(lasagnar)
dfviewr(df=df.in)
## also try:
##dfviewr(df=df.in, legend=FALSE)
##dfviewr(df=df.in, gridlines=FALSE)
So, to be fair, dfviewr
didn't exist at the time of the question, but to see some of the ideas that led to its development and how to actually visualize 400,000 rows, see the for-loop at the very bottom, and don't be too foolhardy and run the function on df2.in
(400,000 x 50):
## Do not run:
## system.time(dfviewr(df=df2.in, gridlines=FALSE)) ## 10 minutes before useRaster=TRUE
## 2 minutes after
Also, tabplot:::tableplot()
doesn't seem to support dates or characters:
library(tabplot)
tableplot(df.in)
produces:
Error in ff(initdata = initdata, length = length, levels = levels, ordered = ordered, : vmode 'character' not implemented
and so we eliminate the character column (#9):
tableplot(df.in[,c(-9)])
which produces:
Error in UseMethod("as.hi") :
no applicable method for 'as.hi' applied to an object of class "c('POSIXct', 'POSIXt')"
so we eliminate the first column (Date) as well:
tableplot(df.in[,c(-1,-9)])
and get
And for the 400,000 by 50 df2.in
without the Date or character columns, the image rendering was quite quick (6 seconds):
system.time(tableplot(df2.in[,c(-(1+seq(0,40,10)), -(9+seq(0,40,10))) ]))
I present first a baby example on 50 rows, then an example on the 400,000 rows.
For what it's worth, I second the comment by @cmbarbu about visually looking at 400K rows on the same plot being limited by a screen that at best has 2K pixels in height, so some kind of breaking apart across pages might be beneficial to prevent overplotting. I include an attempt at this breaking apart by making a PDF document with 400 rows in 1000 plots/pages.
I do not know of a function that will render the requested plot with a data.frame being an input. My approach will make a matrix mask of the data.frame and then use lasagna()
from the lasagnar package on github. lasagna()
is a wrapper for the function image( t(X)[, (nrow(X):1)] )
where X
is a matrix. This call reorders the rows so that they match the order of the data.frame, and the wrapper allows the ability to toggle grid lines and add legends (legend=TRUE will invoke image.plot( t(X)[, (nrow(X):1)] )
-- however, in the example below I explicitly add a legend not using image.plot()).
library(fields)
library(colorspace)
library(lubridate)
library(devtools)
install_github("swihart/lasagnar")
library(lasagnar)
df.in <- data.frame(date=seq(ymd('2012-04-07'),ymd('2013-03-22'),
by = '1 week'),
col1=rnorm(50),
col2=rnorm(50),
col3=rnorm(50),
col4=rnorm(50),
col5=as.factor(c("A","B")),
col6=as.factor(c("MS","PHD")),
col7=rnorm(50),
col8=(c("cherlene","randy")),
col9=rnorm(50),
stringsAsFactors=FALSE)
df.in[19:23 , 2:4 ] <- NA
df.in[c(7, 9), ] <- NA
df.in[2:30 , 4 ] <- NA
df.in[10 , 7 ] <- NA
df.in[14 , 6:10 ] <- NA
str(df.in)
mat.out <- matrix(NA, nrow=nrow(df.in), ncol=ncol(df.in))
## red for dates
mat.out[,sapply(df.in,is.POSIXct)] <- 1
## blue for factors
mat.out[,sapply(df.in,is.factor)] <- 2
## green for characters
mat.out[,sapply(df.in,is.character)] <- 3
## white for numeric
mat.out[,sapply(df.in,is.numeric)] <- 4
## black for NA
mat.out[is.na(df.in)] <- 5
row.names(mat.out) <- 1:nrow(df.in)
lasagna(mat.out, col=c("red","blue","green","white","black"),
cex=0.67, main="")
lasagna(mat.out, col=c("red","blue","green","white","black"),
cex=.67, main="")
legend("bottom", fill=c("red","blue","green","white","black"),
legend=c("dates", "factors", "characters", "numeric", "NA"),
horiz=T, xpd=NA, inset=c(-.15), border="black")
lasagna(mat.out, col=c("red","blue","green","white","black"),
cex=.67, main="", gridlines=FALSE)
legend("bottom", fill=c("red","blue","green","white","black"),
legend=c("dates", "factors", "characters", "numeric", "NA"),
horiz=T, xpd=NA, inset=c(-.15), border="black")
df2.10 <- data.frame(date=seq(ymd('2012-04-07'),ymd('2013-03-22'),
by = '1 week'),
col1=rnorm(400000),
col2=rnorm(400000),
col3=rnorm(400000),
col4=rnorm(400000),
col5=as.factor(c("A","B")),
col6=as.factor(c("MS","PHD")),
col7=rnorm(400000),
col8=(c("cherlene","randy")),
col9=rnorm(400000),
stringsAsFactors=FALSE)
df2.10[c(19:23), c(2:4) ] <- NA
df2.10[c(7, 9), ] <- NA
df2.10[c(2:30), 4 ] <- NA
df2.10[10 , 7 ] <- NA
df2.10[14 , c(6:10) ] <- NA
df2.10[c(450:750), ] <- NA
df2.10[c(399990:399999), ] <- NA
df2.in <- cbind(df2.10, df2.10, df2.10, df2.10, df2.10)
str(df2.in)
mat.out <- matrix(NA, nrow=nrow(df2.in), ncol=ncol(df2.in))
## red for dates
mat.out[,sapply(df2.in,is.POSIXct)] <- 1
## blue for factors
mat.out[,sapply(df2.in,is.factor)] <- 2
## green for characters
mat.out[,sapply(df2.in,is.character)] <- 3
## white for numeric
mat.out[,sapply(df2.in,is.numeric)] <- 4
## black for NA
mat.out[is.na(df2.in)] <- 5
row.names(mat.out) <- 1:nrow(df2.in)
pdf("pages1000.pdf")
system.time(
for(i in 1:1000){
lasagna_plain(mat.out[((i-1)*400+1):(400*i),],
col=c("red","blue","green","white","black"), cex=1,
main=paste0("rows: ", (i-1)*400+1, " - ", (400*i)))
}
)
dev.off()
The for-loop completed 40 seconds on my machine, and the PDF very shortly thereafter. Now just page down after standardizing the page size in the PDF viewer, viewing pages/plots such as these:
Assuming that the blank/gaps you are talking about are missing values (NA)
image(t(as.matrix(is.na(df))))
You may want to have a look at the tabplot
package. With such a big data.frame
it will take a while to load, but it should also correctly identify missing values. More info here.
Here's an image example using the diamond data.frame
.
EDIT
I just saw that you said your df has 50 columns. I've used tabplot on df's that size and find the resolution of information limited by the screen breadth. The row count can also be an issue, but I personally find more information is lost if the df is too wide. Thus, may I suggest you parse it into 3 separate df (for example using dplyr
) and then run them through the tableplot()
function of tabplot
or similar.
Give this a shot.
require(Amelia)
data(freetrade)
missmap(freetrade)
It won't do the red, blue green, but it gets your grid. I'd also give the VIM package a shot as it provides numerous options for visualizing missing data.
http://www.statistik.tuwien.ac.at/forschung/CS/CS-2008-1complete.pdf