I have a performance issue I need help with. Please bear with me for the explanation:
I have a database of known Car Vin# and years (only first 4 lines of ~5,000 sh
You can do this relatively easily with data.table
:
vin.names <- vinDB[seq(1, nrow(vinDB), 2), ]
vin.vins <- vinDB[seq(2, nrow(vinDB), 2), ]
car.vins <- carFile[seq(2, nrow(carFile), 4), ]
library(data.table)
dt <- data.table(vin.names, vin.vins, key="vin.vins")
dt[J(car.vins), list(NumTimesFound=.N), by=vin.names]
# vin.names NumTimesFound
# 1: Ford 2014 15
# 2: Chrysler 1998 10
# 3: GM 1998 9
# 4: Ford 1998 11
# 5: Toyota 2000 12
# ---
# 75: Toyota 2007 7
# 76: Chrysler 1995 4
# 77: Toyota 2010 5
# 78: Toyota 2008 1
# 79: GM 1997 5
The main thing to understand is with J(car.vins)
we are creating a one column data.table
with the vins to match (J
is just shorthand for data.table
, so long as you use it within a data.table
). By using that data.table
inside dt
, we are joining the list of vins
to the list of cars because we keyed dt
by "vin.vins" in the prior step. The last argument tells us to group the joined set by vin.names
, and the middle argument that we want to know the number of instances .N
for each group (.N
is a special data.table
variable).
Also, I made some junk data to run this on. In the future, please provide data like this.
set.seed(1)
makes <- c("Toyota", "Ford", "GM", "Chrysler")
years <- 1995:2014
cars <- paste(sample(makes, 500, rep=T), sample(years, 500, rep=T))
vins <- unlist(replicate(500, paste0(sample(LETTERS, 16), collapse="")))
vinDB <- data.frame(c(cars, vins)[order(rep(1:500, 2))])
carFile <-
data.frame(
c(rep("junk", 1000), sample(vins, 1000, rep=T), rep("junk", 2000))[order(rep(1:1000, 4))]
)