I\'m trying to write R code to read data from a mess of old spreadsheets. The exact location of the data varies from sheet to sheet: the only constant is that the first co
In those cases it's important to know the possible conditions of your data. I'm gonna assume that you want only remove columns and rows that doesn't confrom your table.
I have this Excel book:
I added 3 blank columns at left becouse when I loaded in R with one column the program omits them. Thats for confirm that R omits empty cols at the left.
First: load data
library(xlsx)
dat <- read.xlsx('book.xlsx', sheetIndex = 1)
head(dat)
MY.COMPANY.PTY.LTD NA.
1 MC Pension Fund <NA>
2 GROSS PERFORMANCE DETAILS <NA>
3 updated by IG on 20/04/2017 <NA>
4 <NA> Monthly return
5 Mar-14 0.0097
6 Apr-14 6e-04
Second: I added some cols with NA
and ''
values in the case that your data contain some
dat$x2 <- NA
dat$x4 <- NA
head(dat)
MY.COMPANY.PTY.LTD NA. x2 x4
1 MC Pension Fund <NA> NA NA
2 GROSS PERFORMANCE DETAILS <NA> NA NA
3 updated by IG on 20/04/2017 <NA> NA NA
4 <NA> Monthly return NA NA
5 Mar-14 0.0097 NA NA
6 Apr-14 6e-04 NA NA
Third: Remove columns when all values are NA
and ''
. I have to deal with that kind of problems in past
colSelect <- apply(dat, 2, function(x) !(length(x) == length(which(x == '' | is.na(x)))))
dat2 <- dat[, colSelect]
head(dat2)
MY.COMPANY.PTY.LTD NA.
1 MC Pension Fund <NA>
2 GROSS PERFORMANCE DETAILS <NA>
3 updated by IG on 20/04/2017 <NA>
4 <NA> Monthly return
5 Mar-14 0.0097
6 Apr-14 6e-04
Fourth: Keep only rows with complete observations (it's what I supose from your example)
rowSelect <- apply(dat2, 1, function(x) !any(is.na(x)))
dat3 <- dat2[rowSelect, ]
head(dat3)
MY.COMPANY.PTY.LTD NA.
5 Mar-14 0.0097
6 Apr-14 6e-04
7 May-14 0.0189
8 Jun-14 0.008
9 Jul-14 -0.0199
10 Ago-14 0.00697
Finally if you want to keep the header you can make something like this:
colnames(dat3) <- as.matrix(dat2[which(rowSelect)[1] - 1, ])
or
colnames(dat3) <- c('Month', as.character(dat2[which(rowSelect)[1] - 1, 2]))
dat3
Month Monthly return
5 Mar-14 0.0097
6 Apr-14 6e-04
7 May-14 0.0189
8 Jun-14 0.008
9 Jul-14 -0.0199
10 Ago-14 0.00697
Here is how I would tackle it.
STEP 1
Read the excel spreadsheet in without
the headers.
STEP 2
Find the row index for your string Monthly return
in this case
STEP 3
Filter from the identified row (or column or both), prettify a little and done.
Here is what a sample function looks like. It works for your example no matter where it is in the spreadsheet. You can play around with regex
to make it more robust.
Function Definition:
library(xlsx)
extract_return <- function(path = getwd(), filename = "Mysheet.xlsx", sheetnum = 1){
filepath = paste(path, "/", filename, sep = "")
input = read.xlsx(filepath, sheetnum, header = FALSE)
start_idx = which(input == "Monthly return", arr.ind = TRUE)[1]
output = input[start_idx:dim(input)[1],]
rownames(output) <- NULL
colnames(output) <- c("Date","Monthly Return")
output = output[-1, ]
return(output)
}
Example:
final_df <- extract_return(
path = "~/Desktop",
filename = "Apr2017.xlsx",
sheetnum = 2)
No matter ho many rows or columns you may have, the idea remains the same.. Give it a try and let me know.
With a general purpose package like readxl, you'll have to read twice, if you want to enjoy automatic type conversion. I assume you have some sort of upper bound on the number of junk rows at the front? Here I assumed that was 10. I'm iterating over worksheets in one workbook, but the code would look pretty similar if iterating over workbooks. I'd write one function to handle a single worksheet or workbook then use lapply()
or purrr::map()
. This function will encapsulate the skip-learning read and the "real" read.
library(readxl)
two_passes <- function(path, sheet = NULL, n_max = 10) {
first_pass <- read_excel(path = path, sheet = sheet, n_max = n_max)
skip <- which(first_pass[[2]] == "Monthly return")
message("For sheet '", if (is.null(sheet)) 1 else sheet,
"' we'll skip ", skip, " rows.")
read_excel(path, sheet = sheet, skip = skip)
}
(sheets <- excel_sheets("so.xlsx"))
#> [1] "sheet_one" "sheet_two"
sheets <- setNames(sheets, sheets)
lapply(sheets, two_passes, path = "so.xlsx")
#> For sheet 'sheet_one' we'll skip 4 rows.
#> For sheet 'sheet_two' we'll skip 6 rows.
#> $sheet_one
#> # A tibble: 6 × 2
#> X__1 `Monthly return`
#> <dttm> <dbl>
#> 1 2017-03-14 0.00907
#> 2 2017-04-14 0.00069
#> 3 2017-05-14 0.01890
#> 4 2017-06-14 0.00803
#> 5 2017-07-14 -0.01998
#> 6 2017-08-14 0.00697
#>
#> $sheet_two
#> # A tibble: 6 × 2
#> X__1 `Monthly return`
#> <dttm> <dbl>
#> 1 2017-03-14 0.00907
#> 2 2017-04-14 0.00069
#> 3 2017-05-14 0.01890
#> 4 2017-06-14 0.00803
#> 5 2017-07-14 -0.01998
#> 6 2017-08-14 0.00697
grep("2014",dat)[1]
This gives you first column with year. Or use "-14" or whatever you have for years. Similar way grep("Monthly",dat)[1] gives you second column
Okay, at the format was specified for xls, update from csv to the correctly suggested xls loading.
library(readxl)
data <- readxl::read_excel(".../sampleData.xls", col_types = FALSE)
You would get something similar to:
data <- structure(list(V1 = structure(c(6L, 5L, 3L, 7L, 1L, 4L, 2L), .Label = c("",
"Apr 14", "GROSS PERFROANCE DETAILS", "Mar-14", "MC Pension Fund",
"MY COMPANY PTY LTD", "updated by JS on 6/4/2017"), class = "factor"),
V2 = structure(c(1L, 1L, 1L, 1L, 4L, 3L, 2L), .Label = c("",
"0.069%", "0.907%", "Monthly return"), class = "factor")), .Names = c("V1",
"V2"), class = "data.frame", row.names = c(NA, -7L))
then you can dynamincally filter on the "Monthly return" cell and identify your matrix.
targetCell <- which(data == "Monthly return", arr.ind = T)
returns <- data[(targetCell[1] + 1):nrow(data), (targetCell[2] - 1):targetCell[2]]
This is a tidy alternative that avoids the multiple reads issue discussed above. However, when doing benchmarks, Rafael Zayas's answer still wins out.
library("tidyxl")
library("unpivotr")
library("tidyr")
library("dplyr")
tidy_solution <- function() {
raw <- xlsx_cells("messyExcel.xlsx")
start <- raw %>%
filter_all(any_vars(. %in% c("Monthly return"))) %>%
select(row, col)
month.col <- raw %>%
filter(row >= start$row + 1, col == start$col - 1) %>%
pivot_wider(date, col)
return.col <- raw %>%
filter(row >= start$row + 1, col == start$col) %>%
pivot_wider(numeric, col)
output <- cbind(month.col, return.col)
}
# My Solution
expr min lq mean median uq max neval
tidy_solution() 29.0372 30.40305 32.13793 31.36925 32.9812 56.6455 100
# Rafael's
expr min lq mean median uq max neval
original_solution() 21.4405 23.8009 25.86874 25.10865 26.99945 59.4128 100