Efficiently selecting top number of rows for each unique value of a column in a data.frame

后端未结

关注

 2  1616

太阳男子 2021-01-05 04:00

I am trying to take a subset of a data frame, based on the occurence of a value. This is best explained in an example, given below. This question has a high relation to: Sel

2条回答

栀梦 (楼主)

2021-01-05 04:54

Here is a way using mapply and your input and table_input:

    #your code
    #input <- matrix( c(1000001,1000001,1000001,1000001,1000001,1000001,1000002,1000002,1000002,1000003,1000003,1000003,100001,100002,100003,100004,100005,100006,100002,100003,100007,100002,100003,100008,"2011-01-01","2011-01-02","2011-01-01","2011-01-04","2011-01-01","2011-01-02","2011-01-01","2011-01-04","2011-01-01","2011-01-02","2011-01-01","2011-01-04"), ncol=3)
    #colnames(input) <- c( "Product" , "Something" ,"Date")
    #input <- as.data.frame(input)
    #input$Date <- as.Date(input[,"Date"], "%Y-%m-%d")

    #Sort based on date, I want to leave out the entries with the oldest dates.
    #input <- input[ with( input, order(Date)), ]

    #Create number of items I want to select
    #table_input <- as.data.frame(table(input$Product))
    #table_input$twentyfive <- ceiling( table_input$Freq*0.25  )

    #function to "mapply" on "table_input"
    fun = function(p, d) { grep(p, input$Product)[1:d] }

    #subset "input"
    input[unlist(mapply(fun, table_input$Var1, table_input$twentyfive)),]

       Product Something       Date
    1  1000001    100001 2011-01-01
    3  1000001    100003 2011-01-01
    7  1000002    100002 2011-01-01
    11 1000003    100003 2011-01-01

I, also, called system.time and replicate to compare speed of mapply and the alternatives from SimonO101's answer:

    #SimonO101's code
    #require( plyr )
    #ddply( input , .(Product) , function(x) x[ - c( 1 : ceiling( nrow(x) * 0.25 ) ) , ] )
    #install.packages( "data.table" , repos="http://r-forge.r-project.org" )
    #require( data.table )
    #DT <- data.table( input )
    #setkeyv( DT , c( "Product" , "Date" ) )
    #DT[ ,  tail( .SD , -ceiling( nrow(.SD) * .25 ) )  , by = Product ]

    > system.time(replicate(10000, input[unlist(mapply(fun, table_input$Var1, table_input$twentyfive)),]))
       user  system elapsed 
       5.29    0.00    5.29 
    > system.time(replicate(10000, ddply( input , .(Product) , function(x) x[ - c( 1 : ceiling( nrow(x) * 0.25 ) ) , ] )))
      user  system elapsed 
      43.48    0.03   44.04 
    > system.time(replicate(10000,  DT[ ,  tail( .SD , -ceiling( nrow(.SD) * .25 ) )  , by = Product ] ))                        
      user  system elapsed 
      34.30    0.01   34.50

BUT: SimonO101's alternatives do not produce the same as mapply, becaused I used mapply using the table_input you posted; I don't know if this plays any role in the comparison. Also, the comparison may have been dumbly setted up by me. I just did it because of the speed issue you pointed. I'd, really, want @SimonO101 to see this in case I'm talking nonsense.

0 讨论(0)

查看其它2个回答