Inspecting and visualizing gaps/blanks and structure in large dataframes

前端 未结 4 1790
遥遥无期
遥遥无期 2021-01-02 21:51

I have a large dataframe (400000 x 50) that I want to visually inspect for structure and blanks/gaps.

Is there an existing library or ggplot2 function, that can spit

相关标签:
4条回答
  • 2021-01-02 22:33

    Have you tried dfviewr in lasagnar ? The following reproduces the desired graphic for the 50 row x 10 column df.in in the package:

    library(devtools)
    install_github("swihart/lasagnar")
    library(lasagnar)   
    dfviewr(df=df.in)
    ## also try:
    ##dfviewr(df=df.in, legend=FALSE)
    ##dfviewr(df=df.in, gridlines=FALSE)
    

    enter image description here

    So, to be fair, dfviewr didn't exist at the time of the question, but to see some of the ideas that led to its development and how to actually visualize 400,000 rows, see the for-loop at the very bottom, and don't be too foolhardy and run the function on df2.in (400,000 x 50):

    ## Do not run:
    ## system.time(dfviewr(df=df2.in, gridlines=FALSE)) ## 10 minutes before useRaster=TRUE                                          
                                                        ##  2 minutes after
    

    Also, tabplot:::tableplot() doesn't seem to support dates or characters:

    library(tabplot)
    tableplot(df.in)
    

    produces:

    Error in ff(initdata = initdata, length = length, levels = levels, ordered = ordered, : vmode 'character' not implemented

    and so we eliminate the character column (#9):

    tableplot(df.in[,c(-9)])
    

    which produces:

    Error in UseMethod("as.hi") : no applicable method for 'as.hi' applied to an object of class "c('POSIXct', 'POSIXt')"

    so we eliminate the first column (Date) as well:

    tableplot(df.in[,c(-1,-9)])
    

    and get

    enter image description here

    And for the 400,000 by 50 df2.in without the Date or character columns, the image rendering was quite quick (6 seconds):

    system.time(tableplot(df2.in[,c(-(1+seq(0,40,10)), -(9+seq(0,40,10))) ]))
    

    enter image description here

    For the interested reader...

    I present first a baby example on 50 rows, then an example on the 400,000 rows.

    For what it's worth, I second the comment by @cmbarbu about visually looking at 400K rows on the same plot being limited by a screen that at best has 2K pixels in height, so some kind of breaking apart across pages might be beneficial to prevent overplotting. I include an attempt at this breaking apart by making a PDF document with 400 rows in 1000 plots/pages.

    I do not know of a function that will render the requested plot with a data.frame being an input. My approach will make a matrix mask of the data.frame and then use lasagna() from the lasagnar package on github. lasagna() is a wrapper for the function image( t(X)[, (nrow(X):1)] ) where X is a matrix. This call reorders the rows so that they match the order of the data.frame, and the wrapper allows the ability to toggle grid lines and add legends (legend=TRUE will invoke image.plot( t(X)[, (nrow(X):1)] ) -- however, in the example below I explicitly add a legend not using image.plot()).

    libraries for the task

    library(fields)
    library(colorspace)  
    library(lubridate)
    library(devtools)
    install_github("swihart/lasagnar")
    library(lasagnar)   
    

    create a sample dataframe of 50 rows (baby example before 400K example)

    df.in <- data.frame(date=seq(ymd('2012-04-07'),ymd('2013-03-22'), 
                        by = '1 week'),
               col1=rnorm(50),
               col2=rnorm(50),
               col3=rnorm(50),
               col4=rnorm(50),
               col5=as.factor(c("A","B")),
               col6=as.factor(c("MS","PHD")),
               col7=rnorm(50),
               col8=(c("cherlene","randy")),
               col9=rnorm(50),
               stringsAsFactors=FALSE)
    

    induce missingness

    df.in[19:23  , 2:4  ] <- NA
    df.in[c(7, 9),      ] <- NA
    df.in[2:30   , 4    ] <- NA
    df.in[10     , 7    ] <- NA
    df.in[14     , 6:10 ] <- NA
    

    check structure

    str(df.in)
    

    prep the mask matrix

    mat.out <- matrix(NA, nrow=nrow(df.in), ncol=ncol(df.in))
    

    then cycle through columns for types; apply is.na() at the end

    ## red for dates
    mat.out[,sapply(df.in,is.POSIXct)] <- 1
    ## blue for factors
    mat.out[,sapply(df.in,is.factor)] <- 2
    ## green for characters
    mat.out[,sapply(df.in,is.character)] <- 3
    ## white for numeric
    mat.out[,sapply(df.in,is.numeric)] <- 4
    ## black for NA
    mat.out[is.na(df.in)] <- 5
    

    row names might be nice for tracing back to the original data

    row.names(mat.out) <- 1:nrow(df.in)
    

    render { lasagna(X) is a wrapper for image( t(X)[, (nrow(X):1)] ) }

    lasagna(mat.out, col=c("red","blue","green","white","black"), 
            cex=0.67, main="")
    

    enter image description here

    legends are possible:

    lasagna(mat.out, col=c("red","blue","green","white","black"), 
            cex=.67, main="")
    legend("bottom", fill=c("red","blue","green","white","black"),
            legend=c("dates", "factors", "characters", "numeric", "NA"), 
            horiz=T, xpd=NA, inset=c(-.15), border="black")
    

    enter image description here

    turn gridlines off with gridlines=FALSE

    lasagna(mat.out, col=c("red","blue","green","white","black"), 
            cex=.67, main="", gridlines=FALSE)
    legend("bottom", fill=c("red","blue","green","white","black"),
            legend=c("dates", "factors", "characters", "numeric", "NA"), 
            horiz=T, xpd=NA, inset=c(-.15), border="black")
    

    enter image description here

    Let's do an example of OP data size: 400,000 rows x 50 cols

    create a sample dataframe

    df2.10 <- data.frame(date=seq(ymd('2012-04-07'),ymd('2013-03-22'), 
                        by = '1 week'),
               col1=rnorm(400000),
               col2=rnorm(400000),
               col3=rnorm(400000),
               col4=rnorm(400000),
               col5=as.factor(c("A","B")),
               col6=as.factor(c("MS","PHD")),
               col7=rnorm(400000),
               col8=(c("cherlene","randy")),
               col9=rnorm(400000),
               stringsAsFactors=FALSE)
    

    induce missingness

    df2.10[c(19:23), c(2:4)  ] <- NA
    df2.10[c(7,  9),         ] <- NA
    df2.10[c(2:30), 4        ] <- NA
    df2.10[10     , 7        ] <- NA
    df2.10[14     , c(6:10)  ] <- NA    
    df2.10[c(450:750), ] <- NA
    df2.10[c(399990:399999), ] <- NA
    

    cbind into 50 column wide df; check structure

    df2.in <- cbind(df2.10, df2.10, df2.10, df2.10, df2.10)
    str(df2.in)
    

    prep the mask matrix

    mat.out <- matrix(NA, nrow=nrow(df2.in), ncol=ncol(df2.in))
    

    then cycle through columns for types; apply is.na() at the end

    ## red for dates
    mat.out[,sapply(df2.in,is.POSIXct)] <- 1
    ## blue for factors
    mat.out[,sapply(df2.in,is.factor)] <- 2
    ## green for characters
    mat.out[,sapply(df2.in,is.character)] <- 3
    ## white for numeric
    mat.out[,sapply(df2.in,is.numeric)] <- 4
    ## black for NA
    mat.out[is.na(df2.in)] <- 5
    

    row names might be nice for tracing back to the original data

    row.names(mat.out) <- 1:nrow(df2.in)
    

    render { lasagna_plain(X) has no gridelines or rownames }

    pdf("pages1000.pdf")
      system.time(
        for(i in 1:1000){
            lasagna_plain(mat.out[((i-1)*400+1):(400*i),],
                          col=c("red","blue","green","white","black"), cex=1, 
                          main=paste0("rows: ", (i-1)*400+1,  " - ",  (400*i)))
        }
      )
    dev.off()
    

    The for-loop completed 40 seconds on my machine, and the PDF very shortly thereafter. Now just page down after standardizing the page size in the PDF viewer, viewing pages/plots such as these:

    enter image description here enter image description here enter image description here

    0 讨论(0)
  • 2021-01-02 22:44

    Assuming that the blank/gaps you are talking about are missing values (NA)

    image(t(as.matrix(is.na(df))))

    0 讨论(0)
  • 2021-01-02 22:48

    You may want to have a look at the tabplot package. With such a big data.frame it will take a while to load, but it should also correctly identify missing values. More info here.

    Here's an image example using the diamond data.frame.

    tabplot_diamonds

    EDIT

    I just saw that you said your df has 50 columns. I've used tabplot on df's that size and find the resolution of information limited by the screen breadth. The row count can also be an issue, but I personally find more information is lost if the df is too wide. Thus, may I suggest you parse it into 3 separate df (for example using dplyr) and then run them through the tableplot() function of tabplot or similar.

    0 讨论(0)
  • 2021-01-02 22:51

    Give this a shot.

    require(Amelia)
    data(freetrade)
    missmap(freetrade)
    

    It won't do the red, blue green, but it gets your grid. I'd also give the VIM package a shot as it provides numerous options for visualizing missing data.

    http://www.statistik.tuwien.ac.at/forschung/CS/CS-2008-1complete.pdf

    0 讨论(0)
提交回复
热议问题