Efficiently selecting top number of rows for each unique value of a column in a data.frame

后端 未结 2 1616
太阳男子
太阳男子 2021-01-05 04:00

I am trying to take a subset of a data frame, based on the occurence of a value. This is best explained in an example, given below. This question has a high relation to: Sel

2条回答
  •  栀梦
    栀梦 (楼主)
    2021-01-05 04:54

    Here is a way using mapply and your input and table_input:

        #your code
        #input <- matrix( c(1000001,1000001,1000001,1000001,1000001,1000001,1000002,1000002,1000002,1000003,1000003,1000003,100001,100002,100003,100004,100005,100006,100002,100003,100007,100002,100003,100008,"2011-01-01","2011-01-02","2011-01-01","2011-01-04","2011-01-01","2011-01-02","2011-01-01","2011-01-04","2011-01-01","2011-01-02","2011-01-01","2011-01-04"), ncol=3)
        #colnames(input) <- c( "Product" , "Something" ,"Date")
        #input <- as.data.frame(input)
        #input$Date <- as.Date(input[,"Date"], "%Y-%m-%d")
    
        #Sort based on date, I want to leave out the entries with the oldest dates.
        #input <- input[ with( input, order(Date)), ]
    
        #Create number of items I want to select
        #table_input <- as.data.frame(table(input$Product))
        #table_input$twentyfive <- ceiling( table_input$Freq*0.25  )
    
        #function to "mapply" on "table_input"
        fun = function(p, d) { grep(p, input$Product)[1:d] }
    
        #subset "input"
        input[unlist(mapply(fun, table_input$Var1, table_input$twentyfive)),]
    
           Product Something       Date
        1  1000001    100001 2011-01-01
        3  1000001    100003 2011-01-01
        7  1000002    100002 2011-01-01
        11 1000003    100003 2011-01-01
    

    I, also, called system.time and replicate to compare speed of mapply and the alternatives from SimonO101's answer:

        #SimonO101's code
        #require( plyr )
        #ddply( input , .(Product) , function(x) x[ - c( 1 : ceiling( nrow(x) * 0.25 ) ) , ] )
        #install.packages( "data.table" , repos="http://r-forge.r-project.org" )
        #require( data.table )
        #DT <- data.table( input )
        #setkeyv( DT , c( "Product" , "Date" ) )
        #DT[ ,  tail( .SD , -ceiling( nrow(.SD) * .25 ) )  , by = Product ]
    
        > system.time(replicate(10000, input[unlist(mapply(fun, table_input$Var1, table_input$twentyfive)),]))
           user  system elapsed 
           5.29    0.00    5.29 
        > system.time(replicate(10000, ddply( input , .(Product) , function(x) x[ - c( 1 : ceiling( nrow(x) * 0.25 ) ) , ] )))
          user  system elapsed 
          43.48    0.03   44.04 
        > system.time(replicate(10000,  DT[ ,  tail( .SD , -ceiling( nrow(.SD) * .25 ) )  , by = Product ] ))                        
          user  system elapsed 
          34.30    0.01   34.50 
    

    BUT: SimonO101's alternatives do not produce the same as mapply, becaused I used mapply using the table_input you posted; I don't know if this plays any role in the comparison. Also, the comparison may have been dumbly setted up by me. I just did it because of the speed issue you pointed. I'd, really, want @SimonO101 to see this in case I'm talking nonsense.

提交回复
热议问题