I am trying to take a subset of a data frame, based on the occurence of a value. This is best explained in an example, given below. This question has a high relation to: Sel
Two solutions spring to mind. plyr::ddply
is designed for your needs but using a data.table
will be waaaaaay faster.
You want to take a data.frame
split it up into chunks, remove all the bottom 25% of rows of each chunk which is sorted by date and recombine into a data.frame
. This can be accomplished in one simple line...
require( plyr )
ddply( input , .(Product) , function(x) x[ - c( 1 : ceiling( nrow(x) * 0.25 ) ) , ] )
# Product Something Date
#1 1000001 100005 2011-01-01
#2 1000001 100002 2011-01-02
#3 1000001 100006 2011-01-02
#4 1000001 100004 2011-01-04
#5 1000002 100007 2011-01-01
#6 1000002 100003 2011-01-04
#7 1000003 100002 2011-01-02
#8 1000003 100008 2011-01-04
data.table
solutionFor data.table
you will need the latest development version from r-forge
(due to us of negative subscript not being implemented in the CRAN version of data.table yet). Make sure you follow the install.package
call to get the latest version...
install.packages( "data.table" , repos="http://r-forge.r-project.org" )
require( data.table )
DT <- data.table( input )
# Sort by Product then Date very quickly
setkeyv( DT , c( "Product" , "Date" ) )
# Return the bottom 75% of rows (i.e. not the earliest)
DT[ , tail( .SD , -ceiling( nrow(.SD) * .25 ) ) , by = Product ]
# Product Something Date
#1: 1000001 100005 2011-01-01
#2: 1000001 100002 2011-01-02
#3: 1000001 100006 2011-01-02
#4: 1000001 100004 2011-01-04
#5: 1000002 100007 2011-01-01
#6: 1000002 100003 2011-01-04
#7: 1000003 100002 2011-01-02
#8: 1000003 100008 2011-01-04
data.table
You could more easily do this (so you don't require development version of data.table
)...
DT[ , .SD[ -c( 1:ceiling( .25 * .N ) ) ] , by = Product ]
And you can also use lapply
in the j
argument (I was worried about my use of .SD
) and this runs in ~ 14 seconds on a data.table
of 2e6 rows with 90,000 products (groups)...
set.seed(1)
Product <- sample( 1:9e5 , 2e6 , repl = TRUE )
dates <- sample( 1:20 , 2e6 , repl = TRUE )
Date <- as.Date( Sys.Date() + dates )
DT <- data.table( Product = Product , Date = Date )
system.time( { setkeyv( DT , c( "Product" , "Date" ) ); DT[ , lapply( .SD , `[` , -c( 1:ceiling( .25 * .N ) ) ) , by = Product ] } )
# user system elapsed
# 14.65 0.03 14.74
data.table
!So thanks to @Arun (who is now an author of the data.table
package) we now have the best way to use data.table
which is to use .I
which is an integer vector of all the row indices, subset in [
by removing the first 25% of record with -(1:ceiling(.N*.25))
, and then performaing a subset using these row indices to get the final table. This is ~ 4-5 times faster than using my .SD
method above. Amazing stuff!
system.time( DT[ DT[, .I[-(1:ceiling(.N*.25))] , by = Product]$V1] )
user system elapsed
3.02 0.00 3.03
Here is a way using mapply
and your input
and table_input
:
#your code
#input <- matrix( c(1000001,1000001,1000001,1000001,1000001,1000001,1000002,1000002,1000002,1000003,1000003,1000003,100001,100002,100003,100004,100005,100006,100002,100003,100007,100002,100003,100008,"2011-01-01","2011-01-02","2011-01-01","2011-01-04","2011-01-01","2011-01-02","2011-01-01","2011-01-04","2011-01-01","2011-01-02","2011-01-01","2011-01-04"), ncol=3)
#colnames(input) <- c( "Product" , "Something" ,"Date")
#input <- as.data.frame(input)
#input$Date <- as.Date(input[,"Date"], "%Y-%m-%d")
#Sort based on date, I want to leave out the entries with the oldest dates.
#input <- input[ with( input, order(Date)), ]
#Create number of items I want to select
#table_input <- as.data.frame(table(input$Product))
#table_input$twentyfive <- ceiling( table_input$Freq*0.25 )
#function to "mapply" on "table_input"
fun = function(p, d) { grep(p, input$Product)[1:d] }
#subset "input"
input[unlist(mapply(fun, table_input$Var1, table_input$twentyfive)),]
Product Something Date
1 1000001 100001 2011-01-01
3 1000001 100003 2011-01-01
7 1000002 100002 2011-01-01
11 1000003 100003 2011-01-01
I, also, called system.time
and replicate
to compare speed of mapply
and the alternatives from SimonO101's answer:
#SimonO101's code
#require( plyr )
#ddply( input , .(Product) , function(x) x[ - c( 1 : ceiling( nrow(x) * 0.25 ) ) , ] )
#install.packages( "data.table" , repos="http://r-forge.r-project.org" )
#require( data.table )
#DT <- data.table( input )
#setkeyv( DT , c( "Product" , "Date" ) )
#DT[ , tail( .SD , -ceiling( nrow(.SD) * .25 ) ) , by = Product ]
> system.time(replicate(10000, input[unlist(mapply(fun, table_input$Var1, table_input$twentyfive)),]))
user system elapsed
5.29 0.00 5.29
> system.time(replicate(10000, ddply( input , .(Product) , function(x) x[ - c( 1 : ceiling( nrow(x) * 0.25 ) ) , ] )))
user system elapsed
43.48 0.03 44.04
> system.time(replicate(10000, DT[ , tail( .SD , -ceiling( nrow(.SD) * .25 ) ) , by = Product ] ))
user system elapsed
34.30 0.01 34.50
BUT: SimonO101's alternatives do not produce the same as mapply
, becaused I used mapply
using the table_input
you posted; I don't know if this plays any role in the comparison. Also, the comparison may have been dumbly setted up by me. I just did it because of the speed issue you pointed. I'd, really, want @SimonO101 to see this in case I'm talking nonsense.