How to delete a row by reference in data.table?

前端 未结 6 861
南方客
南方客 2020-11-22 16:07

My question is related to assignment by reference versus copying in data.table. I want to know if one can delete rows by reference, similar to

         


        
相关标签:
6条回答
  • 2020-11-22 16:37

    Good question. data.table can't delete rows by reference yet.

    data.table can add and delete columns by reference since it over-allocates the vector of column pointers, as you know. The plan is to do something similar for rows and allow fast insert and delete. A row delete would use memmove in C to budge up the items (in each and every column) after the deleted rows. Deleting a row in the middle of the table would still be quite inefficient compared to a row store database such as SQL, which is more suited for fast insert and delete of rows wherever those rows are in the table. But still, it would be a lot faster than copying a new large object without the deleted rows.

    On the other hand, since column vectors would be over-allocated, rows could be inserted (and deleted) at the end, instantly; e.g., a growing time series.


    It's filed as an issue: Delete rows by reference.

    0 讨论(0)
  • 2020-11-22 16:45

    the approach that i have taken in order to make memory use be similar to in-place deletion is to subset a column at a time and delete. not as fast as a proper C memmove solution, but memory use is all i care about here. something like this:

    DT = data.table(col1 = 1:1e6)
    cols = paste0('col', 2:100)
    for (col in cols){ DT[, (col) := 1:1e6] }
    keep.idxs = sample(1e6, 9e5, FALSE) # keep 90% of entries
    DT.subset = data.table(col1 = DT[['col1']][keep.idxs]) # this is the subsetted table
    for (col in cols){
      DT.subset[, (col) := DT[[col]][keep.idxs]]
      DT[, (col) := NULL] #delete
    }
    
    0 讨论(0)
  • 2020-11-22 16:45

    Here is a working function based on @vc273's answer and @Frank's feedback.

    delete <- function(DT, del.idxs) {           # pls note 'del.idxs' vs. 'keep.idxs'
      keep.idxs <- setdiff(DT[, .I], del.idxs);  # select row indexes to keep
      cols = names(DT);
      DT.subset <- data.table(DT[[1]][keep.idxs]); # this is the subsetted table
      setnames(DT.subset, cols[1]);
      for (col in cols[2:length(cols)]) {
        DT.subset[, (col) := DT[[col]][keep.idxs]];
        DT[, (col) := NULL];  # delete
      }
       return(DT.subset);
    }
    

    And example of its usage:

    dat <- delete(dat,del.idxs)   ## Pls note 'del.idxs' instead of 'keep.idxs'
    

    Where "dat" is a data.table. Removing 14k rows from 1.4M rows takes 0.25 sec on my laptop.

    > dim(dat)
    [1] 1419393      25
    > system.time(dat <- delete(dat,del.idxs))
       user  system elapsed 
       0.23    0.02    0.25 
    > dim(dat)
    [1] 1404715      25
    > 
    

    PS. Since I am new to SO, I could not add comment to @vc273's thread :-(

    0 讨论(0)
  • 2020-11-22 16:50

    Here are some strategies I have used. I believe a .ROW function may be coming. None of these approaches below are fast. These are some strategies a little beyond subsets or filtering. I tried to think like dba just trying to clean up data. As noted above, you can select or remove rows in data.table:

    data(iris)
    iris <- data.table(iris)
    
    iris[3] # Select row three
    
    iris[-3] # Remove row three
    
    You can also use .SD to select or remove rows:
    
    iris[,.SD[3]] # Select row three
    
    iris[,.SD[3:6],by=,.(Species)] # Select row 3 - 6 for each Species
    
    iris[,.SD[-3]] # Remove row three
    
    iris[,.SD[-3:-6],by=,.(Species)] # Remove row 3 - 6 for each Species
    

    Note: .SD creates a subset of the original data and allows you to do quite a bit of work in j or subsequent data.table. See https://stackoverflow.com/a/47406952/305675. Here I ordered my irises by Sepal Length, take a specified Sepal.Length as minimum,select the top three (by Sepal Length) of all Species and return all accompanying data:

    iris[order(-Sepal.Length)][Sepal.Length > 3,.SD[1:3],by=,.(Species)]
    

    The approaches above all reorder a data.table sequentially when removing rows. You can transpose a data.table and remove or replace the old rows which are now transposed columns. When using ':=NULL' to remove a transposed row, the subsequent column name is removed as well:

    m_iris <- data.table(t(iris))[,V3:=NULL] # V3 column removed
    
    d_iris <- data.table(t(iris))[,V3:=V2] # V3 column replaced with V2
    

    When you transpose the data.frame back to a data.table, you may want to rename from the original data.table and restore class attributes in the case of deletion. Applying ":=NULL" to a now transposed data.table creates all character classes.

    m_iris <- data.table(t(d_iris));
    setnames(d_iris,names(iris))
    
    d_iris <- data.table(t(m_iris));
    setnames(m_iris,names(iris))
    

    You may just want to remove duplicate rows which you can do with or without a Key:

    d_iris[,Key:=paste0(Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species)]     
    
    d_iris[!duplicated(Key),]
    
    d_iris[!duplicated(paste0(Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species)),]  
    

    It is also possible to add an incremental counter with '.I'. You can then search for duplicated keys or fields and remove them by removing the record with the counter. This is computationally expensive, but has some advantages since you can print the lines to be removed.

    d_iris[,I:=.I,] # add a counter field
    
    d_iris[,Key:=paste0(Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species)]
    
    for(i in d_iris[duplicated(Key),I]) {print(i)} # See lines with duplicated Key or Field
    
    for(i in d_iris[duplicated(Key),I]) {d_iris <- d_iris[!I == i,]} # Remove lines with duplicated Key or any particular field.
    

    You can also just fill a row with 0s or NAs and then use an i query to delete them:

     X 
       x v foo
    1: c 8   4
    2: b 7   2
    
    X[1] <- c(0)
    
    X
       x v foo
    1: 0 0   0
    2: b 7   2
    
    X[2] <- c(NA)
    X
        x  v foo
    1:  0  0   0
    2: NA NA  NA
    
    X <- X[x != 0,]
    X <- X[!is.na(x),]
    
    0 讨论(0)
  • 2020-11-22 16:53

    Instead or trying to set to NULL, try setting to NA (matching the NA-type for the first column)

    set(DT,1:2, 1:3 ,NA_character_)
    
    0 讨论(0)
  • 2020-11-22 16:56

    The topic is still interesting many people (me included).

    What about that? I used assign to replace the glovalenv and the code described previously. It would be better to capture the original environment but at least in globalenv it is memory efficient and acts like a change by ref.

    delete <- function(DT, del.idxs) 
    { 
      varname = deparse(substitute(DT))
    
      keep.idxs <- setdiff(DT[, .I], del.idxs)
      cols = names(DT);
      DT.subset <- data.table(DT[[1]][keep.idxs])
      setnames(DT.subset, cols[1])
    
      for (col in cols[2:length(cols)]) 
      {
        DT.subset[, (col) := DT[[col]][keep.idxs]]
        DT[, (col) := NULL];  # delete
      }
    
      assign(varname, DT.subset, envir = globalenv())
      return(invisible())
    }
    
    DT = data.table(x = rep(c("a", "b", "c"), each = 3), y = c(1, 3, 6), v = 1:9)
    delete(DT, 3)
    
    0 讨论(0)
提交回复
热议问题