Accelerate performance and speed of string match in R

后端 未结 1 1446
你的背包
你的背包 2021-01-07 15:40

I have a performance issue I need help with. Please bear with me for the explanation:

I have a database of known Car Vin# and years (only first 4 lines of ~5,000 sh

相关标签:
1条回答
  • 2021-01-07 16:12

    You can do this relatively easily with data.table:

    vin.names <- vinDB[seq(1, nrow(vinDB), 2), ]
    vin.vins <- vinDB[seq(2, nrow(vinDB), 2), ]
    car.vins <- carFile[seq(2, nrow(carFile), 4), ]
    
    library(data.table)
    dt <- data.table(vin.names, vin.vins, key="vin.vins")
    dt[J(car.vins), list(NumTimesFound=.N), by=vin.names]
    #         vin.names NumTimesFound
    #  1:     Ford 2014            15
    #  2: Chrysler 1998            10
    #  3:       GM 1998             9
    #  4:     Ford 1998            11
    #  5:   Toyota 2000            12
    # ---                            
    # 75:   Toyota 2007             7
    # 76: Chrysler 1995             4
    # 77:   Toyota 2010             5
    # 78:   Toyota 2008             1
    # 79:       GM 1997             5    
    

    The main thing to understand is with J(car.vins) we are creating a one column data.table with the vins to match (J is just shorthand for data.table, so long as you use it within a data.table). By using that data.table inside dt, we are joining the list of vins to the list of cars because we keyed dt by "vin.vins" in the prior step. The last argument tells us to group the joined set by vin.names, and the middle argument that we want to know the number of instances .N for each group (.N is a special data.table variable).

    Also, I made some junk data to run this on. In the future, please provide data like this.

    set.seed(1)
    makes <- c("Toyota", "Ford", "GM", "Chrysler")
    years <- 1995:2014
    cars <- paste(sample(makes, 500, rep=T), sample(years, 500, rep=T))
    vins <- unlist(replicate(500, paste0(sample(LETTERS, 16), collapse="")))
    vinDB <- data.frame(c(cars, vins)[order(rep(1:500, 2))])               
    carFile <- 
      data.frame(
        c(rep("junk", 1000), sample(vins, 1000, rep=T), rep("junk", 2000))[order(rep(1:1000, 4))]
      )  
    
    0 讨论(0)
提交回复
热议问题