fread(): reading table with \r\r\n as newline symbol

问题

I have tab-delimited tables in text files where all lines end with \r\r\n (0x0D 0x0D 0x0A). If I try to read such file with fread(), it says

Line ending is \r\r\n. R's download.file() appears to add the extra \r in text mode on Windows. Please download again in binary mode (mode='wb') which might be faster too. Alternatively, pass the URL directly to fread and it will download the file in binary mode for you.

but I am not downloading these files, I already have them.

So far I came to the solution which first reads the file with read.table() (it treats \r\r\n combination as a single end-of-line character), then converts resulting data.frame by data.table():

mydt <- data.table(read.table(myfilename, header = T, sep = '\t', fill = T))

but I am wondering if there's any way to avoid slow read.table() and use fast fread() instead.

回答1:

I suggest using the GNU utility tr to get rid of those unnecessary \r characters. e.g.

cat("a,b,c\r\r\n1, 2, 3\r\r\n4, 5, 6", file = "test.csv")
fread("test.csv")
## Error in fread("test.csv") : 
##  Line ending is \r\r\n. R's download.file() appears to add the extra \r in text mode on Windows. Please download again in binary mode (mode='wb') which might be faster too. Alternatively, pass the URL directly to fread and it will download the file in binary mode for you.

system("tr -d '\r' < test.csv > test2.csv")
fread("test2.csv")
##    a b c
## 1: 1 2 3
## 2: 4 5 6

If you are using Windows and do not have the tr utility, you can get it here.

Added:

I did some comparisons of three methods, using a 100,000 x 5 sample cvs dataset.

OPcsv is the "slow" read.table method
freadScan is a method that discards the extra \r characters in pure R
freadtr calls GNU tr through the shell using fread() directly.

The third method is by far the fastest.

# create a 100,000 x 5 sample dataset with lines ending in \r\r\n
delim <- "\r\r\n"
sample.txt <- paste0("a, b, c, d, e", delim)
for (i in 1:100000) {
    sample.txt <- paste0(sample.txt,
                        paste(round(runif(5)*100), collapse = ","),
                        delim)
}
cat(sample.txt, file = "sample.csv")


# function that translates the extra \r characters in R only
fread2 <- function(filename) {
    tmp <- scan(file = filename, what = "character", quiet = TRUE)
    # remove empty lines caused by \r
    tmp <- tmp[tmp != ""]
    # paste lines back together together with \n character
    tmp <- paste(tmp, collapse = "\n")
    fread(tmp)
}

# OP function using read.csv that is slow
readcsvMethod <- function(myfilename)
    data.table(read.table(myfilename, header = TRUE, sep = ',', fill = TRUE))

require(microbenchmark)
microbenchmark(OPcsv = readcsvMethod("sample.csv"),
               freadScan = fread2("sample.csv"),
               freadtr = fread("tr -d \'\\r\' < sample.csv"),
               unit = "relative")
## Unit: relative
##           expr      min       lq     mean   median       uq      max neval
##          OPcsv 1.331462 1.336524 1.340037 1.365397 1.366041 1.249223   100
##      freadScan 1.532169 1.581195 1.624354 1.673691 1.676596 1.355434   100
##        freadtr 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000   100

来源：https://stackoverflow.com/questions/33339656/fread-reading-table-with-r-r-n-as-newline-symbol

标签

performance

data.table

line-endings