How can I read an Excel file directly into R? Or should I first export the data to a text- or CSV file and import that file into R?
Given the proliferation of different ways to read an Excel file in R
and the plethora of answers here, I thought I'd try to shed some light on which of the options mentioned here perform the best (in a few simple situations).
I myself have been using xlsx
since I started using R
, for inertia if nothing else, and I recently noticed there doesn't seem to be any objective information about which package works better.
Any benchmarking exercise is fraught with difficulties as some packages are sure to handle certain situations better than others, and a waterfall of other caveats.
That said, I'm using a (reproducible) data set that I think is in a pretty common format (8 string fields, 3 numeric, 1 integer, 3 dates):
set.seed(51423)
data.frame(
str1 = sample(sprintf("%010d", 1:NN)), #ID field 1
str2 = sample(sprintf("%09d", 1:NN)), #ID field 2
#varying length string field--think names/addresses, etc.
str3 =
replicate(NN, paste0(sample(LETTERS, sample(10:30, 1L), TRUE),
collapse = "")),
#factor-like string field with 50 "levels"
str4 = sprintf("%05d", sample(sample(1e5, 50L), NN, TRUE)),
#factor-like string field with 17 levels, varying length
str5 =
sample(replicate(17L, paste0(sample(LETTERS, sample(15:25, 1L), TRUE),
collapse = "")), NN, TRUE),
#lognormally distributed numeric
num1 = round(exp(rnorm(NN, mean = 6.5, sd = 1.5)), 2L),
#3 binary strings
str6 = sample(c("Y","N"), NN, TRUE),
str7 = sample(c("M","F"), NN, TRUE),
str8 = sample(c("B","W"), NN, TRUE),
#right-skewed integer
int1 = ceiling(rexp(NN)),
#dates by month
dat1 =
sample(seq(from = as.Date("2005-12-31"),
to = as.Date("2015-12-31"), by = "month"),
NN, TRUE),
dat2 =
sample(seq(from = as.Date("2005-12-31"),
to = as.Date("2015-12-31"), by = "month"),
NN, TRUE),
num2 = round(exp(rnorm(NN, mean = 6, sd = 1.5)), 2L),
#date by day
dat3 =
sample(seq(from = as.Date("2015-06-01"),
to = as.Date("2015-07-15"), by = "day"),
NN, TRUE),
#lognormal numeric that can be positive or negative
num3 =
(-1) ^ sample(2, NN, TRUE) * round(exp(rnorm(NN, mean = 6, sd = 1.5)), 2L)
)
I then wrote this to csv and opened in LibreOffice and saved it as an .xlsx file, then benchmarked 4 of the packages mentioned in this thread: xlsx
, openxlsx
, readxl
, and gdata
, using the default options (I also tried a version of whether or not I specify column types, but this didn't change the rankings).
I'm excluding RODBC
because I'm on Linux; XLConnect
because it seems its primary purpose is not reading in single Excel sheets but importing entire Excel workbooks, so to put its horse in the race on only its reading capabilities seems unfair; and xlsReadWrite
because it is no longer compatible with my version of R
(seems to have been phased out).
I then ran benchmarks with NN=1000L
and NN=25000L
(resetting the seed before each declaration of the data.frame
above) to allow for differences with respect to Excel file size. gc
is primarily for xlsx
, which I've found at times can create memory clogs. Without further ado, here are the results I found:
benchmark1k <-
microbenchmark(times = 100L,
xlsx = {xlsx::read.xlsx2(fl, sheetIndex=1); invisible(gc())},
openxlsx = {openxlsx::read.xlsx(fl); invisible(gc())},
readxl = {readxl::read_excel(fl); invisible(gc())},
gdata = {gdata::read.xls(fl); invisible(gc())})
# Unit: milliseconds
# expr min lq mean median uq max neval
# xlsx 194.1958 199.2662 214.1512 201.9063 212.7563 354.0327 100
# openxlsx 142.2074 142.9028 151.9127 143.7239 148.0940 255.0124 100
# readxl 122.0238 122.8448 132.4021 123.6964 130.2881 214.5138 100
# gdata 2004.4745 2042.0732 2087.8724 2062.5259 2116.7795 2425.6345 100
So readxl
is the winner, with openxlsx
competitive and gdata
a clear loser. Taking each measure relative to the column minimum:
# expr min lq mean median uq max
# 1 xlsx 1.59 1.62 1.62 1.63 1.63 1.65
# 2 openxlsx 1.17 1.16 1.15 1.16 1.14 1.19
# 3 readxl 1.00 1.00 1.00 1.00 1.00 1.00
# 4 gdata 16.43 16.62 15.77 16.67 16.25 11.31
We see my own favorite, xlsx
is 60% slower than readxl
.
Due to the amount of time it takes, I only did 20 repetitions on the larger file, otherwise the commands were identical. Here's the raw data:
# Unit: milliseconds
# expr min lq mean median uq max neval
# xlsx 4451.9553 4539.4599 4738.6366 4762.1768 4941.2331 5091.0057 20
# openxlsx 962.1579 981.0613 988.5006 986.1091 992.6017 1040.4158 20
# readxl 341.0006 344.8904 347.0779 346.4518 348.9273 360.1808 20
# gdata 43860.4013 44375.6340 44848.7797 44991.2208 45251.4441 45652.0826 20
Here's the relative data:
# expr min lq mean median uq max
# 1 xlsx 13.06 13.16 13.65 13.75 14.16 14.13
# 2 openxlsx 2.82 2.84 2.85 2.85 2.84 2.89
# 3 readxl 1.00 1.00 1.00 1.00 1.00 1.00
# 4 gdata 128.62 128.67 129.22 129.86 129.69 126.75
So readxl
is the clear winner when it comes to speed. gdata
better have something else going for it, as it's painfully slow in reading Excel files, and this problem is only exacerbated for larger tables.
Two draws of openxlsx
are 1) its extensive other methods (readxl
is designed to do only one thing, which is probably part of why it's so fast), especially its write.xlsx
function, and 2) (more of a drawback for readxl
) the col_types
argument in readxl
only (as of this writing) accepts some nonstandard R
: "text"
instead of "character"
and "date"
instead of "Date"
.
An Excel file can be read directly into R as follows:
my_data <- read.table(file = "xxxxxx.xls", sep = "\t", header=TRUE)
Reading xls and xlxs files using readxl package
library("readxl")
my_data <- read_excel("xxxxx.xls")
my_data <- read_excel("xxxxx.xlsx")
Yes. See the relevant page on the R wiki. Short answer: read.xls
from the gdata
package works most of the time (although you need to have Perl installed on your system -- usually already true on MacOS and Linux, but takes an extra step on Windows, i.e. see http://strawberryperl.com/). There are various caveats, and alternatives, listed on the R wiki page.
The only reason I see not to do this directly is that you may want to examine the spreadsheet to see if it has glitches (weird headers, multiple worksheets [you can only read one at a time, although you can obviously loop over them all], included plots, etc.). But for a well-formed, rectangular spreadsheet with plain numbers and character data (i.e., not comma-formatted numbers, dates, formulas with divide-by-zero errors, missing values, etc. etc. ..) I generally have no problem with this process.
Just gave the package openxlsx
a try today. It worked really well (and fast).
http://cran.r-project.org/web/packages/openxlsx/index.html
I've had good luck with XLConnect
: http://cran.r-project.org/web/packages/XLConnect/index.html
Let me reiterate what @Chase recommended: Use XLConnect.
The reasons for using XLConnect are, in my opinion:
XLConnect is somewhat new compared to other solutions so it is less frequently mentioned in blog posts and reference docs. For me it's been very useful.