Fast read different type of data with same command, better seperator guessing [duplicate]

怎甘沉沦 提交于 2020-01-04 14:18:20

问题


I have LD data, sometimes raw output file from PLINK as below (notice spaces - used to make the output pretty, notice leading and trailing spaces, too):

write.table(read.table(text="
 CHR_A     BP_A          SNP_A  CHR_B         BP_B          SNP_B           R2 
 1    154834183      rs1218582      1    154794318      rs9970364    0.0929391 
 1    154834183      rs1218582      1    154795033     rs56744813      0.10075 
 1    154834183      rs1218582      1    154797272     rs16836414     0.106455 
 1    154834183      rs1218582      1    154798550    rs200576863    0.0916789 
 1    154834183      rs1218582      1    154802379     rs11264270     0.176911 ",sep="x"),
          "Type1.txt",col.names=FALSE,row.names=FALSE,quote=FALSE)  

Or nicely tab separated file:

write.table(read.table(text="
CHR_A BP_A SNP_A CHR_B BP_B SNP_B R2
1 154834183 rs1218582 1 154794318 rs9970364 0.0929391
1 154834183 rs1218582 1 154795033 rs56744813 0.10075
1 154834183 rs1218582 1 154797272 rs16836414 0.106455
1 154834183 rs1218582 1 154798550 rs200576863 0.0916789
1 154834183 rs1218582 1 154802379 rs11264270 0.176911", sep=" "),
            "Type2.txt",col.names=FALSE,row.names=FALSE,quote=FALSE,sep="\t")

read.csv works for both types of data:

read.csv("Type1.txt", sep="")
read.csv("Type2.txt", sep="")

fread works only for Type2:

fread("Type1.txt")
fread("Type2.txt")

Files are big, in millions of rows, hence can't use read.csv option. Is there a way to make fread guess better? Other package/function suggestions?

I could use readLines then guess the type of file, or tidy up the file using system call then fread, but this will add overhead I am trying to avoid.

Edit: SessionInfo

R version 3.2.0 (2015-04-16)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

回答1:


Fixed on the devel version, v1.9.5. Either use devel (/upgrade) or wait a while for it to hit CRAN as v1.9.6:

require(data.table) # v1.9.5+
ans <- fread("Type1.txt")
#    CHR_A      BP_A     SNP_A CHR_B      BP_B       SNP_B        R2
# 1:     1 154834183 rs1218582     1 154794318   rs9970364 0.0929391
# 2:     1 154834183 rs1218582     1 154795033  rs56744813 0.1007500
# 3:     1 154834183 rs1218582     1 154797272  rs16836414 0.1064550
# 4:     1 154834183 rs1218582     1 154798550 rs200576863 0.0916789
# 5:     1 154834183 rs1218582     1 154802379  rs11264270 0.1769110

fread() has gained strip.white (default=TRUE) amidst other arguments / bug fixes. Please see README file on project page for more info.


Types are recognised correctly as well.

sapply(ans, class)
#       CHR_A        BP_A       SNP_A       CHR_B        BP_B       SNP_B          R2 
#   "integer"   "integer" "character"   "integer"   "integer" "character"   "numeric" 



回答2:


I don't think fread has that ability natively. The system command option however would work and the extra copying cost is usually well worth it:

fread("powershell \"cat Type1.txt | % { $_ -replace ' +',',' } | % { $_ -replace '^,|,$','' }\"")
#   CHR_A      BP_A     SNP_A CHR_B      BP_B       SNP_B        R2
#1:     1 154834183 rs1218582     1 154794318   rs9970364 0.0929391
#2:     1 154834183 rs1218582     1 154795033  rs56744813 0.1007500
#3:     1 154834183 rs1218582     1 154797272  rs16836414 0.1064550
#4:     1 154834183 rs1218582     1 154798550 rs200576863 0.0916789
#5:     1 154834183 rs1218582     1 154802379  rs11264270 0.1769110



回答3:


You could try the package readr. Available on Cran or on github.

Read the vignettes if this will help you. I find it reads most csv's correctly including dates and no need to specify stringsAsFactors = False.

But do read the comparison with fread()



来源:https://stackoverflow.com/questions/31069439/fast-read-different-type-of-data-with-same-command-better-seperator-guessing

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!