问题
I am trying to read a large (~700Mb) .csv file into R.
The file contains an array of integers less than 256, with a header row and 2 header columns.
I use:
trainSet <- read.csv(trainFileName)
This eventually barfs with:
Loading Data...
R(2760) malloc: *** mmap(size=151552) failed (error code=12)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug
R(2760) malloc: *** mmap(size=151552) failed (error code=12)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug
Error: cannot allocate vector of size 145 Kb
Execution halted
Looking at the memory usage, it conks out at about 3Gb usage on a 6Gb machine with zero page file usage at the time of the crash, so there may be another way to fix it.
If I use:
trainSet <- read.csv(trainFileName, header=TRUE, nrows=100)
classes = sapply(train,class);
I can see that all the columns are being loaded as "integer" which I think is 32 bits.
Clearly using 3Gb to load a part of a 700Mb .csv file is far from efficient. I wonder if there's a way to tell R to use 8 bit numbers for the columns? This is what I've done in the past in Matlab and it's worked a treat, however, I can't seem to find anywhere a mention of an 8 bit type in R.
Does it exist? And how would I tell it read.csv to use it?
Thanks in advance for any help.
回答1:
The narrow answer is that the add-on package ff allows you to use a more compact representation.
The downside is that the different representation prevents you from passing the data to standard functions.
So you may need to rethink your approach: maybe sub-sampling the data, or getting more RAM.
回答2:
Q: Can you tell R to use 8 bit numbers
A: No. (Edit: See Dirk's comment's below. He's smarter than I am.)
Q: Will more RAM help?
A: Maybe. Assuming a 64 bit OS and a 64 bit instance of R are the starting point, then "Yes", otherwise "No".
Implicit question A: Will a .csv dataset that is 700 MB be 700 MB when read in by read.csv
?
A: Maybe. If it really is all integers, it may be smaller or larger. It's going to take 4 bytes for each integer and if most of your integers were in the range of -9 to 10, they might actually "expand" in size when stored as 4 bytes each. At the moment you are only using 1-3 bytes per value so you would expect about a 50% increase in size You would want to use colClasses="integer"
in the read-function. Otherwise they may get stored as factor or as 8 byte "numeric" if there are any data-input glitches.
Implicit question B: If You get the data into the workspace will you be able to work with it?
A: Only maybe. You need at a minimum three times as much memory as your largest objects because of the way R copies on assignment even if it is a copy to its own name.
回答3:
Not trying to be snarky, but the way to fix this is documented in ?read.csv
:
These functions can use a surprising amount of memory when reading large files. There is extensive discussion in the ‘R Data Import/Export’ manual, supplementing the notes here. Less memory will be used if ‘colClasses’ is specified as one of the six atomic vector classes. This can be particularly so when reading a column that takes many distinct numeric values, as storing each distinct value as a character string can take up to 14 times as much memory as storing it as an integer. Using ‘nrows’, even as a mild over-estimate, will help memory usage.
This example takes awhile to run because of I/O, even with my SSD, but there are no memory issues:
R> # In one R session
R> x <- matrix(sample(256,2e8,TRUE),ncol=2)
R> write.csv(x,"700mb.csv",row.names=FALSE)
R> # In a new R session
R> x <- read.csv("700mb.csv", colClasses=c("integer","integer"),
+ header=TRUE, nrows=1e8)
R> gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 173632 9.3 350000 18.7 350000 18.7
Vcells 100276451 765.1 221142070 1687.2 200277306 1528.0
R> # Max memory used ~1.5Gb
R> print(object.size(x), units="Mb")
762.9 Mb
来源:https://stackoverflow.com/questions/12271274/does-r-support-8bit-variables