I am analysing a dataset having 200 rows and 1200 columns, this dataset is stored in a .CSV
file. In order to process, I read this file using R\'s read.csv()<
Wide data sets are typically slower to read into memory than long data sets (i.e. the transposed one). This effects many programs that read data, such as R, Python, Excel, etc. though this description is more pertinent to R:
NA
. This means that every column has at least as many cells as the number of rows in the csv file, whereas in a long dataset you can potentially drop the NA
values and save some spaceSince your dataset doesn't appear to contain any NA
values, my hunch is that you're seeing the speed improvement because of the second point. You can test this theory by passing colClasses = rep('numeric', 20)
to read.csv
or fread
for the 20 column data set, or rep('numeric', 120)
for the 120 column one, which should decrease the overhead of guessing data types.
Your question is basically about: is reading long dataset much faster than reading wide dataset?
What I give here is not going to be the final answer, but a new starting point.
For any performance-related issue, it is always better to profile than guess. system.time
is good, but it only tells you about the total run time than how time is split inside. If you have a quick glance of the source code of read.table
(read.csv
is merely a wrapper of read.table
), it contains three stages:
scan
to read in 5 rows of your data. I am not entirely sure about the purpose of this part;scan
to read in your complete data. Basically this will read your data column by column into a list of character strings, where each column is a "record";type.convert
, or explicitly (if you have specified column classes) by say as.numeric
, as.Date
, etc.The first two stages are done at C-level, while the final stage at R-level with a for loop through all records.
A basic profiling tool is Rprof
and summaryRprof
. The following is a very very simple example.
## configure size
m <- 10000
n <- 100
## a very very simple example, where all data are numeric
x <- runif(m * n)
## long and wide .csv
write.csv(matrix(x, m, n), file = "long.csv", row.names = FALSE, quote = FALSE)
write.csv(matrix(x, n, m), file = "wide.csv", row.names = FALSE, quote = FALSE)
## profiling (sample stage)
Rprof("long.out")
long <- read.csv("long.csv")
Rprof(NULL)
Rprof("wide.out")
wide <- read.csv("wide.csv")
Rprof(NULL)
## profiling (report stage)
summaryRprof("long.out")[c(2, 4)]
summaryRprof("wide.out")[c(2, 4)]
The c(2, 4)
extracts "by.total" time for all R-level functions with enough samples and "total CPU time" (may be lower than wall clock time). The following is what I get on my intel i5 2557m @1.1GHz (turbo boost disabled), Sandy Bridge 2011.
## "long.csv"
#$by.total
# total.time total.pct self.time self.pct
#"read.csv" 7.0 100 0.0 0
#"read.table" 7.0 100 0.0 0
#"scan" 6.3 90 6.3 90
#".External2" 0.7 10 0.7 10
#"type.convert" 0.7 10 0.0 0
#
#$sampling.time
#[1] 7
## "wide.csv"
#$by.total
# total.time total.pct self.time self.pct
#"read.table" 25.86 100.00 0.06 0.23
#"read.csv" 25.86 100.00 0.00 0.00
#"scan" 23.22 89.79 23.22 89.79
#"type.convert" 2.22 8.58 0.38 1.47
#"match.arg" 1.20 4.64 0.46 1.78
#"eval" 0.66 2.55 0.12 0.46
#".External2" 0.64 2.47 0.64 2.47
#"parent.frame" 0.50 1.93 0.50 1.93
#".External" 0.30 1.16 0.30 1.16
#"formals" 0.08 0.31 0.04 0.15
#"make.names" 0.04 0.15 0.04 0.15
#"sys.function" 0.04 0.15 0.02 0.08
#"as.character" 0.02 0.08 0.02 0.08
#"c" 0.02 0.08 0.02 0.08
#"lapply" 0.02 0.08 0.02 0.08
#"sys.parent" 0.02 0.08 0.02 0.08
#"sapply" 0.02 0.08 0.00 0.00
#
#$sampling.time
#[1] 25.86
So reading a long dataset takes 7s CPU time, while reading a wide dataset takes 25.86s CPU time.
It might be confusing at first glance, that more functions are reported for wide case. In fact, both long and wide cases execute the same set of functions, but long case is faster, so many functions take less time than the sampling interval (0.02s) hence can not be measured.
But anyway, the run time is dominated by scan
and type.convert
(implicit type conversion). For this example, we see that
scan
is basically all read.csv
is working with, but unfortunately, we are unable to further divide such time to stage-1 and stage-2. Don't take it for granted that because stage-1 only reads in 5 rows so it would be very fast. In debugging mode I actually find that stage-1 can take quite a long time.So what should we do next?
scan
;