Accelerate performance and speed of string match in R

后端未结

关注

 1  1446

你的背包

I have a performance issue I need help with. Please bear with me for the explanation:

I have a database of known Car Vin# and years (only first 4 lines of ~5,000 sh

相关标签:

1条回答

情歌与酒

2021-01-07 16:12

You can do this relatively easily with data.table:

vin.names <- vinDB[seq(1, nrow(vinDB), 2), ]
vin.vins <- vinDB[seq(2, nrow(vinDB), 2), ]
car.vins <- carFile[seq(2, nrow(carFile), 4), ]

library(data.table)
dt <- data.table(vin.names, vin.vins, key="vin.vins")
dt[J(car.vins), list(NumTimesFound=.N), by=vin.names]
#         vin.names NumTimesFound
#  1:     Ford 2014            15
#  2: Chrysler 1998            10
#  3:       GM 1998             9
#  4:     Ford 1998            11
#  5:   Toyota 2000            12
# ---                            
# 75:   Toyota 2007             7
# 76: Chrysler 1995             4
# 77:   Toyota 2010             5
# 78:   Toyota 2008             1
# 79:       GM 1997             5

The main thing to understand is with J(car.vins) we are creating a one column data.table with the vins to match (J is just shorthand for data.table, so long as you use it within a data.table). By using that data.table inside dt, we are joining the list of vins to the list of cars because we keyed dt by "vin.vins" in the prior step. The last argument tells us to group the joined set by vin.names, and the middle argument that we want to know the number of instances .N for each group (.N is a special data.table variable).

Also, I made some junk data to run this on. In the future, please provide data like this.

set.seed(1)
makes <- c("Toyota", "Ford", "GM", "Chrysler")
years <- 1995:2014
cars <- paste(sample(makes, 500, rep=T), sample(years, 500, rep=T))
vins <- unlist(replicate(500, paste0(sample(LETTERS, 16), collapse="")))
vinDB <- data.frame(c(cars, vins)[order(rep(1:500, 2))])               
carFile <- 
  data.frame(
    c(rep("junk", 1000), sample(vins, 1000, rep=T), rep("junk", 2000))[order(rep(1:1000, 4))]
  )

0 讨论(0)