Reading Excel in R: how to find the start cell in messy spreadsheets

后端未结

关注

 7  1713

暗喜 2020-12-28 10:17

I\'m trying to write R code to read data from a mess of old spreadsheets. The exact location of the data varies from sheet to sheet: the only constant is that the first co

7条回答

囚心锁ツ (楼主)

2020-12-28 10:19

In those cases it's important to know the possible conditions of your data. I'm gonna assume that you want only remove columns and rows that doesn't confrom your table.

I have this Excel book:

I added 3 blank columns at left becouse when I loaded in R with one column the program omits them. Thats for confirm that R omits empty cols at the left.

First: load data

library(xlsx)
dat <- read.xlsx('book.xlsx', sheetIndex = 1)
head(dat)

            MY.COMPANY.PTY.LTD            NA.
1             MC  Pension Fund           
2    GROSS PERFORMANCE DETAILS           
3 updated  by IG on 20/04/2017           
4                          Monthly return
5                       Mar-14         0.0097
6                       Apr-14          6e-04

Second: I added some cols with NA and '' values in the case that your data contain some

dat$x2 <- NA
dat$x4 <- NA
head(dat)

            MY.COMPANY.PTY.LTD            NA. x2 x4
1             MC  Pension Fund            NA NA
2    GROSS PERFORMANCE DETAILS            NA NA
3 updated  by IG on 20/04/2017            NA NA
4                          Monthly return NA NA
5                       Mar-14         0.0097 NA NA
6                       Apr-14          6e-04 NA NA

Third: Remove columns when all values are NA and ''. I have to deal with that kind of problems in past

colSelect <- apply(dat, 2, function(x) !(length(x) == length(which(x == '' | is.na(x)))))
dat2 <- dat[, colSelect]
head(dat2)

            MY.COMPANY.PTY.LTD            NA.
1             MC  Pension Fund           
2    GROSS PERFORMANCE DETAILS           
3 updated  by IG on 20/04/2017           
4                          Monthly return
5                       Mar-14         0.0097
6                       Apr-14          6e-04

Fourth: Keep only rows with complete observations (it's what I supose from your example)

rowSelect <- apply(dat2, 1, function(x) !any(is.na(x)))
dat3 <- dat2[rowSelect, ]
head(dat3)

   MY.COMPANY.PTY.LTD     NA.
5              Mar-14  0.0097
6              Apr-14   6e-04
7              May-14  0.0189
8              Jun-14   0.008
9              Jul-14 -0.0199
10             Ago-14 0.00697

Finally if you want to keep the header you can make something like this:

colnames(dat3) <- as.matrix(dat2[which(rowSelect)[1] - 1, ])

colnames(dat3) <- c('Month', as.character(dat2[which(rowSelect)[1] - 1, 2]))
dat3

    Month Monthly return
5  Mar-14         0.0097
6  Apr-14          6e-04
7  May-14         0.0189
8  Jun-14          0.008
9  Jul-14        -0.0199
10 Ago-14        0.00697

0 讨论(0)

查看其它7个回答