问题
I'm trying to import a very large dataset (101 GB) from a text file using read.table.ffdf in package ff. The dataset has >285 million records, but I am only able to read in the first 169,457,332 rows. The dataset is tab-separated with 44 variable-width columns. I've searched stackoverflow and other message boards and have tried many fixes, but still am consistently only able to import the same number of records.
Here's my code:
relFeb2016.test <- read.table.ffdf(x = NULL, file="D:/eBird/ebd_relFeb-2016.txt", fileEncoding = "", nrows = -1, first.rows = NULL, next.rows = NULL, header=TRUE, sep="\t", skipNul = TRUE, fill=T, quote = "", comment.char="", na.strings="", levels = NULL, appendLevels = TRUE, strip.white=T, blank.lines.skip=F, FUN = "read.table", transFUN = NULL, asffdf_args = list(), BATCHBYTES = getOption("ffbatchbytes"), VERBOSE = FALSE, colClasses=c("factor","numeric","factor","factor","factor","factor","factor", "factor","factor","factor","factor","factor","factor","factor","factor","factor","factor","factor","factor","factor","factor","factor","numeric","numeric","Date","factor","factor","factor","factor","factor","factor","factor","factor", "factor","factor","numeric","numeric","numeric","factor","factor","numeric","factor","factor"))
Here's what I've tried:
Added
skipNUL=TRUE
to bypass null characters that I know exist in the data.Added
quote=""
and"comment.char=""
to bypass quote marks, pound signs, and other characters that I know exist in the data.Added
na.strings=""
andfill=TRUE
because many fields are left blank.Tried reading it in with UTF-16 encoding (
encoding="UTF-16LE"
) in case the special characters were still a problem, though EmEditor reports it as UTF-8 unsigned.More than tripled my memory limit from ~130000 using
memory.limit(size=500000)
.
Here's what I've ruled out:
My data is not fixed-width so I can't use
laf_open_fwf
inLAF
, which solved a similar problem described here: http://r.789695.n4.nabble.com/read-table-ffdf-and-fixed-width-files-td4673220.htmlI can't use
bigmemory
because my data includes a variety of data types (factor, date, integer, numeric)There's nothing special about that last imported record that should cause the import to abort
Because it consistently reads in the same number of records each time, and it's always a block of the first 169+ million records, I don't think the problem can be traced to special characters, which occur earlier in the file.
Is there an upper limit on the number of records that can be imported using read.table.ffdf
? Can anyone recommend an alternative solution? Thanks!
ETA: No error messages are returned. I'm working on a server running Windows Server 2012 R2, with 128GB RAM and >1TB available on the drive.
来源:https://stackoverflow.com/questions/36973221/row-limit-in-read-table-ffdf