fread(): reading table with \r\r\n as newline symbol

倾然丶 夕夏残阳落幕 提交于 2019-12-21 17:47:25

问题


I have tab-delimited tables in text files where all lines end with \r\r\n (0x0D 0x0D 0x0A). If I try to read such file with fread(), it says

Line ending is \r\r\n. R's download.file() appears to add the extra \r in text mode on Windows. Please download again in binary mode (mode='wb') which might be faster too. Alternatively, pass the URL directly to fread and it will download the file in binary mode for you.

but I am not downloading these files, I already have them.

So far I came to the solution which first reads the file with read.table() (it treats \r\r\n combination as a single end-of-line character), then converts resulting data.frame by data.table():

mydt <- data.table(read.table(myfilename, header = T, sep = '\t', fill = T))

but I am wondering if there's any way to avoid slow read.table() and use fast fread() instead.


回答1:


I suggest using the GNU utility tr to get rid of those unnecessary \r characters. e.g.

cat("a,b,c\r\r\n1, 2, 3\r\r\n4, 5, 6", file = "test.csv")
fread("test.csv")
## Error in fread("test.csv") : 
##  Line ending is \r\r\n. R's download.file() appears to add the extra \r in text mode on Windows. Please download again in binary mode (mode='wb') which might be faster too. Alternatively, pass the URL directly to fread and it will download the file in binary mode for you.

system("tr -d '\r' < test.csv > test2.csv")
fread("test2.csv")
##    a b c
## 1: 1 2 3
## 2: 4 5 6

If you are using Windows and do not have the tr utility, you can get it here.

Added:

I did some comparisons of three methods, using a 100,000 x 5 sample cvs dataset.

  • OPcsv is the "slow" read.table method
  • freadScan is a method that discards the extra \r characters in pure R
  • freadtr calls GNU tr through the shell using fread() directly.

The third method is by far the fastest.

# create a 100,000 x 5 sample dataset with lines ending in \r\r\n
delim <- "\r\r\n"
sample.txt <- paste0("a, b, c, d, e", delim)
for (i in 1:100000) {
    sample.txt <- paste0(sample.txt,
                        paste(round(runif(5)*100), collapse = ","),
                        delim)
}
cat(sample.txt, file = "sample.csv")


# function that translates the extra \r characters in R only
fread2 <- function(filename) {
    tmp <- scan(file = filename, what = "character", quiet = TRUE)
    # remove empty lines caused by \r
    tmp <- tmp[tmp != ""]
    # paste lines back together together with \n character
    tmp <- paste(tmp, collapse = "\n")
    fread(tmp)
}

# OP function using read.csv that is slow
readcsvMethod <- function(myfilename)
    data.table(read.table(myfilename, header = TRUE, sep = ',', fill = TRUE))

require(microbenchmark)
microbenchmark(OPcsv = readcsvMethod("sample.csv"),
               freadScan = fread2("sample.csv"),
               freadtr = fread("tr -d \'\\r\' < sample.csv"),
               unit = "relative")
## Unit: relative
##           expr      min       lq     mean   median       uq      max neval
##          OPcsv 1.331462 1.336524 1.340037 1.365397 1.366041 1.249223   100
##      freadScan 1.532169 1.581195 1.624354 1.673691 1.676596 1.355434   100
##        freadtr 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000   100


来源:https://stackoverflow.com/questions/33339656/fread-reading-table-with-r-r-n-as-newline-symbol

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!