R - readRDS() & load() fail to give identical data.tables as the original

后端 未结 4 2147
野趣味
野趣味 2021-02-13 13:44

Background

I tried to replace some CSV output files with rds files to improve efficiency. These are intermediate files that wi

相关标签:
4条回答
  • 2021-02-13 14:21

    Probably, this has to do with pointers:

     attributes(aDT)
    $names
    [1] "a" "b"
    
    $row.names
     [1]  1  2  3  4  5  6  7  8  9 10
    
    $class
    [1] "data.table" "data.frame"
    
    $.internal.selfref
    <pointer: 0x0000000000390788>
    
    > attributes(bDT)
    $names
    [1] "a" "b"
    
    $row.names
     [1]  1  2  3  4  5  6  7  8  9 10
    
    $class
    [1] "data.table" "data.frame"
    
    $.internal.selfref
    <pointer: (nil)>
    
    > attributes(bDF)
    $names
    [1] "a" "b"
    
    $row.names
     [1]  1  2  3  4  5  6  7  8  9 10
    
    $class
    [1] "data.frame"
    
    > attributes(aDF)
    $names
    [1] "a" "b"
    
    $row.names
     [1]  1  2  3  4  5  6  7  8  9 10
    
    $class
    [1] "data.frame"
    

    You can closely look at what's going using .Internal(inspect(.)) command:

    .Internal(inspect(aDT))
    
     .Internal(inspect(bDT))
    
    0 讨论(0)
  • 2021-02-13 14:24

    The solution is to use setDT after load or readRDS

    aDT2 <- readRDS("aDT2.RData")
    setDT(aDT2)
    

    source: Adding new columns to a data.table by-reference within a function not always working

    0 讨论(0)
  • 2021-02-13 14:30

    The newly loaded data.table doesn't know the pointer value of the already loaded one. You could tell it with

    attributes(bDT)$.internal.selfref <- attributes(aDT)$.internal.selfref
    identical( aDT, bDT, ignore.environment = T )
    # [1] TRUE
    

    data.frame don't keep this attribute, probably because they don't do in place modification.

    0 讨论(0)
  • 2021-02-13 14:37

    I happen to find a way that resolves the issue (disclaimer: it's a rather inelegant way but it works!) - adding then deleting a dummy column in the loaded data table leads to identical being 'True'. I have also successfully replaced csv with rds intermediate files in my own code.

    To be honest, I don't understand enough of the inner workings of R nor data table to know why it works, so any explanations and/or more elegant solutions would be welcomed.

    library( data.table )
    
    aDT <- data.table( a=1:10, b=LETTERS[1:10] )
    saveRDS( aDT, file = "aDT.rds")
    bDT <- readRDS( file = "aDT.rds" )
    identical( aDT, bDT, ignore.environment = T )  # Gives 'False'
    
    bDT[ , aaa := NA ]; bDT[ , aaa := NULL ]
    identical( aDT, bDT, ignore.environment = T )  # Now gives 'True'
    
    
    # Using the add-del-col 'trick' works here too
    aDT2 <- data.table( a=1:10, b=LETTERS[1:10] )
    save( aDT2, file = "aDT2.RData")
    bDT2 <- aDT2; rm( aDT2 )
    load( file = "aDT2.RData" )
    identical( aDT2, bDT2, ignore.environment = T )  # Gives 'False'
    
    aDT2[ , aaa := NA ]; aDT2[ , aaa := NULL ]
    identical( aDT2, bDT2, ignore.environment = T )  # Now gives 'True'
    
    0 讨论(0)
提交回复
热议问题