Efficiently selecting top number of rows for each unique value of a column in a data.frame

后端未结

关注

 2  1618

I am trying to take a subset of a data frame, based on the occurence of a value. This is best explained in an example, given below. This question has a high relation to: Sel

`data.table` solution

For data.table you will need the latest development version from r-forge (due to us of negative subscript not being implemented in the CRAN version of data.table yet). Make sure you follow the install.package call to get the latest version...

install.packages( "data.table" , repos="http://r-forge.r-project.org" )
require( data.table )
DT <- data.table( input )

#  Sort by Product then Date very quickly
setkeyv( DT , c( "Product" , "Date" ) )

#  Return the bottom 75% of rows (i.e. not the earliest)
DT[ ,  tail( .SD , -ceiling( nrow(.SD) * .25 ) )  , by = Product ] 
#   Product Something       Date
#1: 1000001    100005 2011-01-01
#2: 1000001    100002 2011-01-02
#3: 1000001    100006 2011-01-02
#4: 1000001    100004 2011-01-04
#5: 1000002    100007 2011-01-01
#6: 1000002    100003 2011-01-04
#7: 1000003    100002 2011-01-02
#8: 1000003    100008 2011-01-04

A better way to use `data.table`

You could more easily do this (so you don't require development version of data.table)...

DT[ ,  .SD[ -c( 1:ceiling( .25 * .N ) ) ] , by = Product ]

And you can also use lapply in the j argument (I was worried about my use of .SD) and this runs in ~ 14 seconds on a data.table of 2e6 rows with 90,000 products (groups)...

set.seed(1)
Product <- sample( 1:9e5 , 2e6 , repl = TRUE )
dates <- sample( 1:20 , 2e6 , repl = TRUE )
Date <- as.Date( Sys.Date() + dates )
DT <- data.table( Product = Product , Date = Date )

system.time( { setkeyv( DT , c( "Product" , "Date" ) ); DT[ , lapply( .SD , `[` ,  -c( 1:ceiling( .25 * .N ) ) ) , by = Product ] } )
#   user  system elapsed 
#  14.65    0.03   14.74

Update: The best way to use `data.table`!

So thanks to @Arun (who is now an author of the data.table package) we now have the best way to use data.table which is to use .I which is an integer vector of all the row indices, subset in [ by removing the first 25% of record with -(1:ceiling(.N*.25)), and then performaing a subset using these row indices to get the final table. This is ~ 4-5 times faster than using my .SD method above. Amazing stuff!

system.time( DT[ DT[, .I[-(1:ceiling(.N*.25))] , by = Product]$V1] )
   user  system elapsed 
   3.02    0.00    3.03

0 讨论(0)

栀梦

2021-01-05 04:54

Here is a way using mapply and your input and table_input:

    #your code
    #input <- matrix( c(1000001,1000001,1000001,1000001,1000001,1000001,1000002,1000002,1000002,1000003,1000003,1000003,100001,100002,100003,100004,100005,100006,100002,100003,100007,100002,100003,100008,"2011-01-01","2011-01-02","2011-01-01","2011-01-04","2011-01-01","2011-01-02","2011-01-01","2011-01-04","2011-01-01","2011-01-02","2011-01-01","2011-01-04"), ncol=3)
    #colnames(input) <- c( "Product" , "Something" ,"Date")
    #input <- as.data.frame(input)
    #input$Date <- as.Date(input[,"Date"], "%Y-%m-%d")

    #Sort based on date, I want to leave out the entries with the oldest dates.
    #input <- input[ with( input, order(Date)), ]

    #Create number of items I want to select
    #table_input <- as.data.frame(table(input$Product))
    #table_input$twentyfive <- ceiling( table_input$Freq*0.25  )

    #function to "mapply" on "table_input"
    fun = function(p, d) { grep(p, input$Product)[1:d] }

    #subset "input"
    input[unlist(mapply(fun, table_input$Var1, table_input$twentyfive)),]

       Product Something       Date
    1  1000001    100001 2011-01-01
    3  1000001    100003 2011-01-01
    7  1000002    100002 2011-01-01
    11 1000003    100003 2011-01-01

I, also, called system.time and replicate to compare speed of mapply and the alternatives from SimonO101's answer:

    #SimonO101's code
    #require( plyr )
    #ddply( input , .(Product) , function(x) x[ - c( 1 : ceiling( nrow(x) * 0.25 ) ) , ] )
    #install.packages( "data.table" , repos="http://r-forge.r-project.org" )
    #require( data.table )
    #DT <- data.table( input )
    #setkeyv( DT , c( "Product" , "Date" ) )
    #DT[ ,  tail( .SD , -ceiling( nrow(.SD) * .25 ) )  , by = Product ]

    > system.time(replicate(10000, input[unlist(mapply(fun, table_input$Var1, table_input$twentyfive)),]))
       user  system elapsed 
       5.29    0.00    5.29 
    > system.time(replicate(10000, ddply( input , .(Product) , function(x) x[ - c( 1 : ceiling( nrow(x) * 0.25 ) ) , ] )))
      user  system elapsed 
      43.48    0.03   44.04 
    > system.time(replicate(10000,  DT[ ,  tail( .SD , -ceiling( nrow(.SD) * .25 ) )  , by = Product ] ))                        
      user  system elapsed 
      34.30    0.01   34.50

BUT: SimonO101's alternatives do not produce the same as mapply, becaused I used mapply using the table_input you posted; I don't know if this plays any role in the comparison. Also, the comparison may have been dumbly setted up by me. I just did it because of the speed issue you pointed. I'd, really, want @SimonO101 to see this in case I'm talking nonsense.

0 讨论(0)

Efficiently selecting top number of rows for each unique value of a column in a data.frame

data.table solution

A better way to use data.table

Update: The best way to use data.table!

`data.table` solution

A better way to use `data.table`

Update: The best way to use `data.table`!