问题
I have tab-delimited tables in text files where all lines end with \r\r\n (0x0D 0x0D 0x0A). If I try to read such file with fread(), it says
Line ending is \r\r\n. R's download.file() appears to add the extra \r in text mode on Windows. Please download again in binary mode (mode='wb') which might be faster too. Alternatively, pass the URL directly to fread and it will download the file in binary mode for you.
but I am not downloading these files, I already have them.
So far I came to the solution which first reads the file with read.table() (it treats \r\r\n combination as a single end-of-line character), then converts resulting data.frame by data.table():
mydt <- data.table(read.table(myfilename, header = T, sep = '\t', fill = T))
but I am wondering if there's any way to avoid slow read.table() and use fast fread() instead.
回答1:
I suggest using the GNU utility tr to get rid of those unnecessary \r characters. e.g.
cat("a,b,c\r\r\n1, 2, 3\r\r\n4, 5, 6", file = "test.csv")
fread("test.csv")
## Error in fread("test.csv") :
## Line ending is \r\r\n. R's download.file() appears to add the extra \r in text mode on Windows. Please download again in binary mode (mode='wb') which might be faster too. Alternatively, pass the URL directly to fread and it will download the file in binary mode for you.
system("tr -d '\r' < test.csv > test2.csv")
fread("test2.csv")
## a b c
## 1: 1 2 3
## 2: 4 5 6
If you are using Windows and do not have the tr utility, you can get it here.
Added:
I did some comparisons of three methods, using a 100,000 x 5 sample cvs dataset.
OPcsvis the "slow"read.tablemethodfreadScanis a method that discards the extra\rcharacters in pure Rfreadtrcalls GNUtrthrough the shell usingfread()directly.
The third method is by far the fastest.
# create a 100,000 x 5 sample dataset with lines ending in \r\r\n
delim <- "\r\r\n"
sample.txt <- paste0("a, b, c, d, e", delim)
for (i in 1:100000) {
sample.txt <- paste0(sample.txt,
paste(round(runif(5)*100), collapse = ","),
delim)
}
cat(sample.txt, file = "sample.csv")
# function that translates the extra \r characters in R only
fread2 <- function(filename) {
tmp <- scan(file = filename, what = "character", quiet = TRUE)
# remove empty lines caused by \r
tmp <- tmp[tmp != ""]
# paste lines back together together with \n character
tmp <- paste(tmp, collapse = "\n")
fread(tmp)
}
# OP function using read.csv that is slow
readcsvMethod <- function(myfilename)
data.table(read.table(myfilename, header = TRUE, sep = ',', fill = TRUE))
require(microbenchmark)
microbenchmark(OPcsv = readcsvMethod("sample.csv"),
freadScan = fread2("sample.csv"),
freadtr = fread("tr -d \'\\r\' < sample.csv"),
unit = "relative")
## Unit: relative
## expr min lq mean median uq max neval
## OPcsv 1.331462 1.336524 1.340037 1.365397 1.366041 1.249223 100
## freadScan 1.532169 1.581195 1.624354 1.673691 1.676596 1.355434 100
## freadtr 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 100
来源:https://stackoverflow.com/questions/33339656/fread-reading-table-with-r-r-n-as-newline-symbol