问题
across the web I can read that I should use data.table and fread to load my data.
But when I run a benchmark, then I get the following results
Unit: milliseconds
expr min lq mean median uq max neval
test1 1.229782 1.280000 1.382249 1.366277 1.460483 1.580176 10
test3 1.294726 1.355139 1.765871 1.391576 1.542041 4.770357 10
test2 23.115503 23.345451 42.307979 25.492186 57.772522 125.941734 10
where the code can be seen below.
loadpath <- readRDS("paths.rds")
microbenchmark(
test1 = read.csv(paste0(loadpath,"data.csv"),header=TRUE,sep=";", stringsAsFactors = FALSE,colClasses = "character"),
test2 = data.table::fread(paste0(loadpath,"data.csv"), sep=";"),
test3 = read.csv(paste0(loadpath,"data.csv")),
times = 10
) %>%
print(order = "min")
I understand that fread()
should be faster than read.csv()
because it tries to first read rows into memory as character and then tries to convert them into integer and factor as data types. On the other hand, fread()
simply reads everything as character.
If this is true, shouldn't test2
be faster than test3
?
Can someone explain me, why I do not archieve a speed-up or atleast the same speed with test2
as test1
? :)
回答1:
data.table::fread
s significant performance advantage becomes clear if you consider larger files. Here is a fully reproducible example.
Let's generate a CSV file consisting of 10^5 rows and 100 columns
if (!file.exists("test.csv")) { set.seed(2017) df <- as.data.frame(matrix(runif(10^5 * 100), nrow = 10^5)) write.csv(df, "test.csv", quote = F) }
We run a
microbenchmark
analysis (note that this may take a couple of minutes depending on your hardware)library(microbenchmark) res <- microbenchmark( read.csv = read.csv("test.csv", header = TRUE, stringsAsFactors = FALSE, colClasses = "numeric"), fread = data.table::fread("test.csv", sep = ",", stringsAsFactors = FALSE, colClasses = "numeric"), times = 10) res # Unit: milliseconds # expr min lq mean median uq max # read.csv 17034.2886 17669.8653 19369.1286 18537.7057 20433.4933 23459.4308 # fread 287.1108 311.6304 432.8106 356.6992 460.6167 888.6531 library(ggplot2) autoplot(res)
回答2:
If you take a look into the functions you can see that fread does more checks than read.csv. If the file you are reading is small i takes more time to do checking and preparations for reading than actually reading.
data.table is incredibly faster for big datasets.
来源:https://stackoverflow.com/questions/51765374/read-csv-faster-than-data-tablefread