Reading Excel in R: how to find the start cell in messy spreadsheets

后端 未结 7 1700
暗喜
暗喜 2020-12-28 10:17

I\'m trying to write R code to read data from a mess of old spreadsheets. The exact location of the data varies from sheet to sheet: the only constant is that the first co

7条回答
  •  囚心锁ツ
    2020-12-28 10:19

    In those cases it's important to know the possible conditions of your data. I'm gonna assume that you want only remove columns and rows that doesn't confrom your table.

    I have this Excel book:

    I added 3 blank columns at left becouse when I loaded in R with one column the program omits them. Thats for confirm that R omits empty cols at the left.

    First: load data

    library(xlsx)
    dat <- read.xlsx('book.xlsx', sheetIndex = 1)
    head(dat)
    
                MY.COMPANY.PTY.LTD            NA.
    1             MC  Pension Fund           
    2    GROSS PERFORMANCE DETAILS           
    3 updated  by IG on 20/04/2017           
    4                          Monthly return
    5                       Mar-14         0.0097
    6                       Apr-14          6e-04
    

    Second: I added some cols with NA and '' values in the case that your data contain some

    dat$x2 <- NA
    dat$x4 <- NA
    head(dat)
    
                MY.COMPANY.PTY.LTD            NA. x2 x4
    1             MC  Pension Fund            NA NA
    2    GROSS PERFORMANCE DETAILS            NA NA
    3 updated  by IG on 20/04/2017            NA NA
    4                          Monthly return NA NA
    5                       Mar-14         0.0097 NA NA
    6                       Apr-14          6e-04 NA NA
    

    Third: Remove columns when all values are NA and ''. I have to deal with that kind of problems in past

    colSelect <- apply(dat, 2, function(x) !(length(x) == length(which(x == '' | is.na(x)))))
    dat2 <- dat[, colSelect]
    head(dat2)
    
                MY.COMPANY.PTY.LTD            NA.
    1             MC  Pension Fund           
    2    GROSS PERFORMANCE DETAILS           
    3 updated  by IG on 20/04/2017           
    4                          Monthly return
    5                       Mar-14         0.0097
    6                       Apr-14          6e-04
    

    Fourth: Keep only rows with complete observations (it's what I supose from your example)

    rowSelect <- apply(dat2, 1, function(x) !any(is.na(x)))
    dat3 <- dat2[rowSelect, ]
    head(dat3)
    
       MY.COMPANY.PTY.LTD     NA.
    5              Mar-14  0.0097
    6              Apr-14   6e-04
    7              May-14  0.0189
    8              Jun-14   0.008
    9              Jul-14 -0.0199
    10             Ago-14 0.00697
    

    Finally if you want to keep the header you can make something like this:

    colnames(dat3) <- as.matrix(dat2[which(rowSelect)[1] - 1, ])
    

    or

    colnames(dat3) <- c('Month', as.character(dat2[which(rowSelect)[1] - 1, 2]))
    dat3
    
        Month Monthly return
    5  Mar-14         0.0097
    6  Apr-14          6e-04
    7  May-14         0.0189
    8  Jun-14          0.008
    9  Jul-14        -0.0199
    10 Ago-14        0.00697
    

提交回复
热议问题