I am amazed by the speed of the fread
function in data.table
on large data files but how does it manages to read so fast? What are the basic implem
I assume we are comparing to read.csv
with all known advice applied such as setting colClasses
, nrows
etc. read.csv(filename)
without any other arguments is slow mainly because it first reads everything into memory as if it were character
and then attempts to coerce that to integer
or numeric
as a second step.
So, comparing fread
to read.csv(filename, colClasses=, nrows=, etc)
...
They are both written in C so it's not that.
There isn't one reason in particular, but essentially, fread
memory maps the file into memory and then iterates through the file using pointers. Whereas read.csv
reads the file into a buffer via a connection.
If you run fread
with verbose=TRUE
it will tell you how it works and report the time spent in each of the steps. For example, notice that it skips straight to the middle and the end of the file to make a much better guess of the column types (although in this case the top 5 were enough).
> fread("test.csv",verbose=TRUE)
Input contains no \n. Taking this to be a filename to open
File opened, filesize is 0.486 GB
File is opened and mapped ok
Detected eol as \n only (no \r afterwards), the UNIX and Mac standard.
Using line 30 to detect sep (the last non blank line in the first 'autostart') ... sep=','
Found 6 columns
First row with 6 fields occurs on line 1 (either column names or first row of data)
All the fields on line 1 are character fields. Treating as the column names.
Count of eol after first data row: 10000001
Subtracted 1 for last eol and any trailing empty lines, leaving 10000000 data rows
Type codes ( first 5 rows): 113431
Type codes (+ middle 5 rows): 113431
Type codes (+ last 5 rows): 113431
Type codes: 113431 (after applying colClasses and integer64)
Type codes: 113431 (after applying drop or select (if supplied)
Allocating 6 column slots (6 - 0 dropped)
Read 10000000 rows and 6 (of 6) columns from 0.486 GB file in 00:00:44
13.420s ( 31%) Memory map (rerun may be quicker)
0.000s ( 0%) sep and header detection
3.210s ( 7%) Count rows (wc -l)
0.000s ( 0%) Column type detection (first, middle and last 5 rows)
1.310s ( 3%) Allocation of 10000000x6 result (xMB) in RAM
25.580s ( 59%) Reading data
0.000s ( 0%) Allocation for type bumps (if any), including gc time if triggered
0.000s ( 0%) Coercing data already read in type bumps (if any)
0.040s ( 0%) Changing na.strings to NA
43.560s Total
NB: these timings on my very slow netbook with no SSD. Both the absolute and relative times of each step will vary widely from machine to machine. For example if you rerun fread
a second time you may notice the time to mmap is much less because your OS has cached it from the previous run.
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 2
On-line CPU(s) list: 0,1
Thread(s) per core: 1
Core(s) per socket: 2
Socket(s): 1
NUMA node(s): 1
Vendor ID: AuthenticAMD
CPU family: 20
Model: 2
Stepping: 0
CPU MHz: 800.000 # i.e. my slow netbook
BogoMIPS: 1995.01
Virtualisation: AMD-V
L1d cache: 32K
L1i cache: 32K
L2 cache: 512K
NUMA node0 CPU(s): 0,1