Efficiently selecting top number of rows for each unique value of a column in a data.frame

后端 未结 2 1615
太阳男子
太阳男子 2021-01-05 04:00

I am trying to take a subset of a data frame, based on the occurence of a value. This is best explained in an example, given below. This question has a high relation to: Sel

相关标签:
2条回答
  • 2021-01-05 04:54

    Two solutions spring to mind. plyr::ddply is designed for your needs but using a data.table will be waaaaaay faster.

    You want to take a data.frame split it up into chunks, remove all the bottom 25% of rows of each chunk which is sorted by date and recombine into a data.frame. This can be accomplished in one simple line...

    require( plyr )
    ddply( input , .(Product) , function(x) x[ - c( 1 : ceiling( nrow(x) * 0.25 ) ) , ] )
    #  Product Something       Date
    #1 1000001    100005 2011-01-01
    #2 1000001    100002 2011-01-02
    #3 1000001    100006 2011-01-02
    #4 1000001    100004 2011-01-04
    #5 1000002    100007 2011-01-01
    #6 1000002    100003 2011-01-04
    #7 1000003    100002 2011-01-02
    #8 1000003    100008 2011-01-04
    

    data.table solution

    For data.table you will need the latest development version from r-forge (due to us of negative subscript not being implemented in the CRAN version of data.table yet). Make sure you follow the install.package call to get the latest version...

    install.packages( "data.table" , repos="http://r-forge.r-project.org" )
    require( data.table )
    DT <- data.table( input )
    
    #  Sort by Product then Date very quickly
    setkeyv( DT , c( "Product" , "Date" ) )
    
    #  Return the bottom 75% of rows (i.e. not the earliest)
    DT[ ,  tail( .SD , -ceiling( nrow(.SD) * .25 ) )  , by = Product ] 
    #   Product Something       Date
    #1: 1000001    100005 2011-01-01
    #2: 1000001    100002 2011-01-02
    #3: 1000001    100006 2011-01-02
    #4: 1000001    100004 2011-01-04
    #5: 1000002    100007 2011-01-01
    #6: 1000002    100003 2011-01-04
    #7: 1000003    100002 2011-01-02
    #8: 1000003    100008 2011-01-04
    

    A better way to use data.table

    You could more easily do this (so you don't require development version of data.table)...

    DT[ ,  .SD[ -c( 1:ceiling( .25 * .N ) ) ] , by = Product ] 
    

    And you can also use lapply in the j argument (I was worried about my use of .SD) and this runs in ~ 14 seconds on a data.table of 2e6 rows with 90,000 products (groups)...

    set.seed(1)
    Product <- sample( 1:9e5 , 2e6 , repl = TRUE )
    dates <- sample( 1:20 , 2e6 , repl = TRUE )
    Date <- as.Date( Sys.Date() + dates )
    DT <- data.table( Product = Product , Date = Date )
    
    system.time( { setkeyv( DT , c( "Product" , "Date" ) ); DT[ , lapply( .SD , `[` ,  -c( 1:ceiling( .25 * .N ) ) ) , by = Product ] } )
    #   user  system elapsed 
    #  14.65    0.03   14.74 
    

    Update: The best way to use data.table!

    So thanks to @Arun (who is now an author of the data.table package) we now have the best way to use data.table which is to use .I which is an integer vector of all the row indices, subset in [ by removing the first 25% of record with -(1:ceiling(.N*.25)), and then performaing a subset using these row indices to get the final table. This is ~ 4-5 times faster than using my .SD method above. Amazing stuff!

    system.time( DT[ DT[, .I[-(1:ceiling(.N*.25))] , by = Product]$V1] )
       user  system elapsed 
       3.02    0.00    3.03
    
    0 讨论(0)
  • 2021-01-05 04:54

    Here is a way using mapply and your input and table_input:

        #your code
        #input <- matrix( c(1000001,1000001,1000001,1000001,1000001,1000001,1000002,1000002,1000002,1000003,1000003,1000003,100001,100002,100003,100004,100005,100006,100002,100003,100007,100002,100003,100008,"2011-01-01","2011-01-02","2011-01-01","2011-01-04","2011-01-01","2011-01-02","2011-01-01","2011-01-04","2011-01-01","2011-01-02","2011-01-01","2011-01-04"), ncol=3)
        #colnames(input) <- c( "Product" , "Something" ,"Date")
        #input <- as.data.frame(input)
        #input$Date <- as.Date(input[,"Date"], "%Y-%m-%d")
    
        #Sort based on date, I want to leave out the entries with the oldest dates.
        #input <- input[ with( input, order(Date)), ]
    
        #Create number of items I want to select
        #table_input <- as.data.frame(table(input$Product))
        #table_input$twentyfive <- ceiling( table_input$Freq*0.25  )
    
        #function to "mapply" on "table_input"
        fun = function(p, d) { grep(p, input$Product)[1:d] }
    
        #subset "input"
        input[unlist(mapply(fun, table_input$Var1, table_input$twentyfive)),]
    
           Product Something       Date
        1  1000001    100001 2011-01-01
        3  1000001    100003 2011-01-01
        7  1000002    100002 2011-01-01
        11 1000003    100003 2011-01-01
    

    I, also, called system.time and replicate to compare speed of mapply and the alternatives from SimonO101's answer:

        #SimonO101's code
        #require( plyr )
        #ddply( input , .(Product) , function(x) x[ - c( 1 : ceiling( nrow(x) * 0.25 ) ) , ] )
        #install.packages( "data.table" , repos="http://r-forge.r-project.org" )
        #require( data.table )
        #DT <- data.table( input )
        #setkeyv( DT , c( "Product" , "Date" ) )
        #DT[ ,  tail( .SD , -ceiling( nrow(.SD) * .25 ) )  , by = Product ]
    
        > system.time(replicate(10000, input[unlist(mapply(fun, table_input$Var1, table_input$twentyfive)),]))
           user  system elapsed 
           5.29    0.00    5.29 
        > system.time(replicate(10000, ddply( input , .(Product) , function(x) x[ - c( 1 : ceiling( nrow(x) * 0.25 ) ) , ] )))
          user  system elapsed 
          43.48    0.03   44.04 
        > system.time(replicate(10000,  DT[ ,  tail( .SD , -ceiling( nrow(.SD) * .25 ) )  , by = Product ] ))                        
          user  system elapsed 
          34.30    0.01   34.50 
    

    BUT: SimonO101's alternatives do not produce the same as mapply, becaused I used mapply using the table_input you posted; I don't know if this plays any role in the comparison. Also, the comparison may have been dumbly setted up by me. I just did it because of the speed issue you pointed. I'd, really, want @SimonO101 to see this in case I'm talking nonsense.

    0 讨论(0)
提交回复
热议问题