Row limit in read.table.ffdf?

让人想犯罪 __ 提交于 2019-12-25 06:27:07

问题


I'm trying to import a very large dataset (101 GB) from a text file using read.table.ffdf in package ff. The dataset has >285 million records, but I am only able to read in the first 169,457,332 rows. The dataset is tab-separated with 44 variable-width columns. I've searched stackoverflow and other message boards and have tried many fixes, but still am consistently only able to import the same number of records.

Here's my code:

relFeb2016.test <- read.table.ffdf(x = NULL, file="D:/eBird/ebd_relFeb-2016.txt", fileEncoding = "", nrows = -1, first.rows = NULL, next.rows = NULL, header=TRUE, sep="\t", skipNul = TRUE, fill=T, quote = "", comment.char="", na.strings="", levels = NULL, appendLevels = TRUE,                          strip.white=T, blank.lines.skip=F, FUN = "read.table", transFUN = NULL, asffdf_args = list(), BATCHBYTES = getOption("ffbatchbytes"), VERBOSE = FALSE, colClasses=c("factor","numeric","factor","factor","factor","factor","factor", "factor","factor","factor","factor","factor","factor","factor","factor","factor","factor","factor","factor","factor","factor","factor","numeric","numeric","Date","factor","factor","factor","factor","factor","factor","factor","factor",   "factor","factor","numeric","numeric","numeric","factor","factor","numeric","factor","factor"))

Here's what I've tried:

  1. Added skipNUL=TRUE to bypass null characters that I know exist in the data.

  2. Added quote="" and "comment.char="" to bypass quote marks, pound signs, and other characters that I know exist in the data.

  3. Added na.strings="" and fill=TRUE because many fields are left blank.

  4. Tried reading it in with UTF-16 encoding (encoding="UTF-16LE") in case the special characters were still a problem, though EmEditor reports it as UTF-8 unsigned.

  5. More than tripled my memory limit from ~130000 using memory.limit(size=500000).

Here's what I've ruled out:

  1. My data is not fixed-width so I can't use laf_open_fwf in LAF, which solved a similar problem described here: http://r.789695.n4.nabble.com/read-table-ffdf-and-fixed-width-files-td4673220.html

  2. I can't use bigmemory because my data includes a variety of data types (factor, date, integer, numeric)

  3. There's nothing special about that last imported record that should cause the import to abort

  4. Because it consistently reads in the same number of records each time, and it's always a block of the first 169+ million records, I don't think the problem can be traced to special characters, which occur earlier in the file.

Is there an upper limit on the number of records that can be imported using read.table.ffdf? Can anyone recommend an alternative solution? Thanks!

ETA: No error messages are returned. I'm working on a server running Windows Server 2012 R2, with 128GB RAM and >1TB available on the drive.

来源:https://stackoverflow.com/questions/36973221/row-limit-in-read-table-ffdf

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!