data.table::fread doesn't like missing values in first column

后端 未结 1 1669
天涯浪人
天涯浪人 2020-12-20 19:09

Is this a bug in data.table::fread (version 1.9.2) or misplaced user expectation/error?

Consider this trivial example where I have a table of values,

相关标签:
1条回答
  • 2020-12-20 19:38

    I believe this is the same bug that I reported here.

    The most recent version that I know will work with this type of input is Rev. 1180. You could checkout and build that version by adding @1180 to the end of the svn checkout command.

    svn checkout svn://svn.r-forge.r-project.org/svnroot/datatable/@1180
    

    If you're not familiar with checking out and building packages, see here

    But, a lot of great features, bug fixes, enhancements have been implemented since Rev. 1180. (The deveolpment version at the time of this writing is Rev. 1272). So, a better solution, is to replace the R/fread.R and src/fread.c files with the versions from Rev. 1180 or older, and then re-building the package.

    You can find those files online without checking them out here (sorry, I can't figure out how to post links that include '*', so you have to copy/paste):

    fread.R:
    http://r-forge.r-project.org/scm/viewvc.php/*checkout*/pkg/R/fread.R?revision=988&root=datatable

    fread.c:
    http://r-forge.r-project.org/scm/viewvc.php/*checkout*/pkg/src/fread.c?revision=1159&root=datatable

    Once you've rebuilt the package, you'll be able to read your tsv file.

    > fread("12\t876\t19\n23\t39\t\n\t15\t20")
       V1  V2 V3
    1: 12 876 19
    2: 23  39 NA
    3: NA  15 20
    

    The downside to doing this is that the old version of fread() does not pass a newer test -- you won't be able to read fields that have quotes in the middle.

    > fread('A,B,C\n1.2,Foo"Bar,"a"b\"c"d"\nfo"o,bar,"b,az""\n')
    Error in fread("A,B,C\n1.2,Foo\"Bar,\"a\"b\"c\"d\"\nfo\"o,bar,\"b,az\"\"\n") : 
      Not positioned correctly after testing format of header row. ch=','
    

    With newer versions of fread, you would get this

    > fread('A,B,C\n1.2,Foo"Bar,"a"b\"c"d"\nfo"o,bar,"b,az""\n')
          A       B       C
    1:  1.2 Foo"Bar a"b"c"d
    2: fo"o     bar   b,az"
    

    So, for now, which version "works" depends on whether you're more likely to have missing values in the first column, or quotes in fields. For me, it's the former, so I'm still using the old code.

    0 讨论(0)
提交回复
热议问题