I am trying to take a subset of a data frame, based on the occurence of a value. This is best explained in an example, given below. This question has a high relation to: Sel
Here is a way using mapply
and your input
and table_input
:
#your code
#input <- matrix( c(1000001,1000001,1000001,1000001,1000001,1000001,1000002,1000002,1000002,1000003,1000003,1000003,100001,100002,100003,100004,100005,100006,100002,100003,100007,100002,100003,100008,"2011-01-01","2011-01-02","2011-01-01","2011-01-04","2011-01-01","2011-01-02","2011-01-01","2011-01-04","2011-01-01","2011-01-02","2011-01-01","2011-01-04"), ncol=3)
#colnames(input) <- c( "Product" , "Something" ,"Date")
#input <- as.data.frame(input)
#input$Date <- as.Date(input[,"Date"], "%Y-%m-%d")
#Sort based on date, I want to leave out the entries with the oldest dates.
#input <- input[ with( input, order(Date)), ]
#Create number of items I want to select
#table_input <- as.data.frame(table(input$Product))
#table_input$twentyfive <- ceiling( table_input$Freq*0.25 )
#function to "mapply" on "table_input"
fun = function(p, d) { grep(p, input$Product)[1:d] }
#subset "input"
input[unlist(mapply(fun, table_input$Var1, table_input$twentyfive)),]
Product Something Date
1 1000001 100001 2011-01-01
3 1000001 100003 2011-01-01
7 1000002 100002 2011-01-01
11 1000003 100003 2011-01-01
I, also, called system.time
and replicate
to compare speed of mapply
and the alternatives from SimonO101's answer:
#SimonO101's code
#require( plyr )
#ddply( input , .(Product) , function(x) x[ - c( 1 : ceiling( nrow(x) * 0.25 ) ) , ] )
#install.packages( "data.table" , repos="http://r-forge.r-project.org" )
#require( data.table )
#DT <- data.table( input )
#setkeyv( DT , c( "Product" , "Date" ) )
#DT[ , tail( .SD , -ceiling( nrow(.SD) * .25 ) ) , by = Product ]
> system.time(replicate(10000, input[unlist(mapply(fun, table_input$Var1, table_input$twentyfive)),]))
user system elapsed
5.29 0.00 5.29
> system.time(replicate(10000, ddply( input , .(Product) , function(x) x[ - c( 1 : ceiling( nrow(x) * 0.25 ) ) , ] )))
user system elapsed
43.48 0.03 44.04
> system.time(replicate(10000, DT[ , tail( .SD , -ceiling( nrow(.SD) * .25 ) ) , by = Product ] ))
user system elapsed
34.30 0.01 34.50
BUT: SimonO101's alternatives do not produce the same as mapply
, becaused I used mapply
using the table_input
you posted; I don't know if this plays any role in the comparison. Also, the comparison may have been dumbly setted up by me. I just did it because of the speed issue you pointed. I'd, really, want @SimonO101 to see this in case I'm talking nonsense.