Reading Excel in R: how to find the start cell in messy spreadsheets

后端未结

关注

 7  1718

I\'m trying to write R code to read data from a mess of old spreadsheets. The exact location of the data varies from sheet to sheet: the only constant is that the first co

相关标签:

7条回答

囚心锁ツ

2020-12-28 10:19

In those cases it's important to know the possible conditions of your data. I'm gonna assume that you want only remove columns and rows that doesn't confrom your table.

I have this Excel book:

I added 3 blank columns at left becouse when I loaded in R with one column the program omits them. Thats for confirm that R omits empty cols at the left.

First: load data

library(xlsx)
dat <- read.xlsx('book.xlsx', sheetIndex = 1)
head(dat)

            MY.COMPANY.PTY.LTD            NA.
1             MC  Pension Fund           <NA>
2    GROSS PERFORMANCE DETAILS           <NA>
3 updated  by IG on 20/04/2017           <NA>
4                         <NA> Monthly return
5                       Mar-14         0.0097
6                       Apr-14          6e-04

Second: I added some cols with NA and '' values in the case that your data contain some

dat$x2 <- NA
dat$x4 <- NA
head(dat)

            MY.COMPANY.PTY.LTD            NA. x2 x4
1             MC  Pension Fund           <NA> NA NA
2    GROSS PERFORMANCE DETAILS           <NA> NA NA
3 updated  by IG on 20/04/2017           <NA> NA NA
4                         <NA> Monthly return NA NA
5                       Mar-14         0.0097 NA NA
6                       Apr-14          6e-04 NA NA

Third: Remove columns when all values are NA and ''. I have to deal with that kind of problems in past

colSelect <- apply(dat, 2, function(x) !(length(x) == length(which(x == '' | is.na(x)))))
dat2 <- dat[, colSelect]
head(dat2)

            MY.COMPANY.PTY.LTD            NA.
1             MC  Pension Fund           <NA>
2    GROSS PERFORMANCE DETAILS           <NA>
3 updated  by IG on 20/04/2017           <NA>
4                         <NA> Monthly return
5                       Mar-14         0.0097
6                       Apr-14          6e-04

Fourth: Keep only rows with complete observations (it's what I supose from your example)

rowSelect <- apply(dat2, 1, function(x) !any(is.na(x)))
dat3 <- dat2[rowSelect, ]
head(dat3)

   MY.COMPANY.PTY.LTD     NA.
5              Mar-14  0.0097
6              Apr-14   6e-04
7              May-14  0.0189
8              Jun-14   0.008
9              Jul-14 -0.0199
10             Ago-14 0.00697

Finally if you want to keep the header you can make something like this:

colnames(dat3) <- as.matrix(dat2[which(rowSelect)[1] - 1, ])

colnames(dat3) <- c('Month', as.character(dat2[which(rowSelect)[1] - 1, 2]))
dat3

    Month Monthly return
5  Mar-14         0.0097
6  Apr-14          6e-04
7  May-14         0.0189
8  Jun-14          0.008
9  Jul-14        -0.0199
10 Ago-14        0.00697

0 讨论(0)

攒了一身酷

2020-12-28 10:20

Here is how I would tackle it.

STEP 1
Read the excel spreadsheet in without the headers.

STEP 2
Find the row index for your string Monthly return in this case

STEP 3
Filter from the identified row (or column or both), prettify a little and done.

Here is what a sample function looks like. It works for your example no matter where it is in the spreadsheet. You can play around with regex to make it more robust.

Function Definition:

library(xlsx)
extract_return <-  function(path = getwd(), filename = "Mysheet.xlsx", sheetnum = 1){
                       filepath = paste(path, "/", filename, sep = "")
                       input = read.xlsx(filepath, sheetnum, header = FALSE)
                       start_idx = which(input == "Monthly return", arr.ind = TRUE)[1]
                       output = input[start_idx:dim(input)[1],]
                       rownames(output) <- NULL
                       colnames(output) <- c("Date","Monthly Return")
                       output = output[-1, ]  
                       return(output)
                  }

Example:

final_df <- extract_return(
                path = "~/Desktop", 
                filename = "Apr2017.xlsx", 
                sheetnum = 2)

No matter ho many rows or columns you may have, the idea remains the same.. Give it a try and let me know.

0 讨论(0)

故里飘歌

2020-12-28 10:22

With a general purpose package like readxl, you'll have to read twice, if you want to enjoy automatic type conversion. I assume you have some sort of upper bound on the number of junk rows at the front? Here I assumed that was 10. I'm iterating over worksheets in one workbook, but the code would look pretty similar if iterating over workbooks. I'd write one function to handle a single worksheet or workbook then use lapply() or purrr::map(). This function will encapsulate the skip-learning read and the "real" read.

library(readxl)

two_passes <- function(path, sheet = NULL, n_max = 10) {
  first_pass <- read_excel(path = path, sheet = sheet, n_max = n_max)
  skip <- which(first_pass[[2]] == "Monthly return")
  message("For sheet '", if (is.null(sheet)) 1 else sheet,
          "' we'll skip ", skip, " rows.")  
  read_excel(path, sheet = sheet, skip = skip)
}

(sheets <- excel_sheets("so.xlsx"))
#> [1] "sheet_one" "sheet_two"
sheets <- setNames(sheets, sheets)
lapply(sheets, two_passes, path = "so.xlsx")
#> For sheet 'sheet_one' we'll skip 4 rows.
#> For sheet 'sheet_two' we'll skip 6 rows.
#> $sheet_one
#> # A tibble: 6 × 2
#>         X__1 `Monthly return`
#>       <dttm>            <dbl>
#> 1 2017-03-14          0.00907
#> 2 2017-04-14          0.00069
#> 3 2017-05-14          0.01890
#> 4 2017-06-14          0.00803
#> 5 2017-07-14         -0.01998
#> 6 2017-08-14          0.00697
#> 
#> $sheet_two
#> # A tibble: 6 × 2
#>         X__1 `Monthly return`
#>       <dttm>            <dbl>
#> 1 2017-03-14          0.00907
#> 2 2017-04-14          0.00069
#> 3 2017-05-14          0.01890
#> 4 2017-06-14          0.00803
#> 5 2017-07-14         -0.01998
#> 6 2017-08-14          0.00697

0 讨论(0)

南旧

2020-12-28 10:26
```
grep("2014",dat)[1]
```
This gives you first column with year. Or use "-14" or whatever you have for years. Similar way grep("Monthly",dat)[1] gives you second column
0 讨论(0)
发布评论:

提交评论
- 加载中...

囚心锁ツ

2020-12-28 10:29

Okay, at the format was specified for xls, update from csv to the correctly suggested xls loading.

library(readxl)
data <- readxl::read_excel(".../sampleData.xls", col_types = FALSE)

You would get something similar to:

data <- structure(list(V1 = structure(c(6L, 5L, 3L, 7L, 1L, 4L, 2L), .Label = c("", 
"Apr 14", "GROSS PERFROANCE DETAILS", "Mar-14", "MC Pension Fund", 
"MY COMPANY PTY LTD", "updated by JS on 6/4/2017"), class = "factor"), 
    V2 = structure(c(1L, 1L, 1L, 1L, 4L, 3L, 2L), .Label = c("", 
    "0.069%", "0.907%", "Monthly return"), class = "factor")), .Names = c("V1", 
"V2"), class = "data.frame", row.names = c(NA, -7L))

then you can dynamincally filter on the "Monthly return" cell and identify your matrix.

targetCell <- which(data == "Monthly return", arr.ind = T)
returns <- data[(targetCell[1] + 1):nrow(data), (targetCell[2] - 1):targetCell[2]]

0 讨论(0)

半阙折子戏

2020-12-28 10:39

This is a tidy alternative that avoids the multiple reads issue discussed above. However, when doing benchmarks, Rafael Zayas's answer still wins out.

library("tidyxl")
library("unpivotr")
library("tidyr")
library("dplyr")

tidy_solution <- function() {
    raw <- xlsx_cells("messyExcel.xlsx")

    start <- raw %>%
        filter_all(any_vars(. %in% c("Monthly return"))) %>%
        select(row, col)

    month.col <- raw %>%
        filter(row >= start$row + 1, col == start$col - 1) %>%
        pivot_wider(date, col)

    return.col <- raw %>%
        filter(row >= start$row + 1, col == start$col) %>%
        pivot_wider(numeric, col)
 
    output <- cbind(month.col, return.col)
 }

# My Solution
            expr     min       lq     mean   median      uq     max neval
tidy_solution() 29.0372 30.40305 32.13793 31.36925 32.9812 56.6455   100

# Rafael's
                expr     min      lq     mean   median       uq     max neval
original_solution() 21.4405 23.8009 25.86874 25.10865 26.99945 59.4128   100

0 讨论(0)

1 2 下一页