Imperfect string match using data.table

前端 未结 2 890
灰色年华
灰色年华 2020-12-21 07:32

Ok, so I posted a question a while back concerning writing an R function to accelerate string matching of large text files. I had my eyes opened to \'data.table\' and my que

相关标签:
2条回答
  • 2020-12-21 07:33

    Thanks for the answer to your own question on partial matching. Here's the complete code that I got to work on my own machine (including the reproducible example provide by BrodieG in your linked post). I had to change lapply to sapply.

    library(data.table)
    set.seed(1)
    makes <- c("Toyota", "Ford", "GM", "Chrysler")
    years <- 1995:2014
    cars <- paste(sample(makes, 500, rep=T), sample(years, 500, rep=T))
    vins <- unlist(replicate(500, paste0(sample(LETTERS, 16), collapse="")))
    vinDB <- data.frame(c(cars, vins)[order(rep(1:500, 2))])               
    carFile <- data.frame(c(rep("junk", 1000), sample(vins, 1000, rep=T), rep("junk", 2000))[order(rep(1:1000, 4))])
    
    vin.names <- vinDB[seq(1, nrow(vinDB), 2), ]
    vin.vins <- vinDB[seq(2, nrow(vinDB), 2), ]
    car.vins <- carFile[seq(2, nrow(carFile), 4), ]`
    
    #Add some errors to car.vins strings
    s <- sample(length(car.vins),100)
    car.vins.err <- as.character(car.vins)
    car.vins.err[s] <- gsub("A","B",car.vins.err[s])
    s <- sample(length(car.vins.err),100)
    car.vins.err[s] <- gsub("E","F",car.vins.err[s])
    s <- sample(length(car.vins.err),100)
    car.vins.err[s] <- gsub("I","J",car.vins.err[s])
    car.vins.err <- as.factor(car.vins.err)`
    
    dt <- data.table(vin.names, vin.vins, key="vin.vins")
    dt1 <- dt[J(car.vins), list(NumTimesFound=.N), keyby=vin.names]
    dt1.err <- dt[J(car.vins.err), list(NumTimesFound=.N), keyby=vin.names]
    dt2 <- dt[sapply(car.vins, agrep, x=vin.vins, max.distance=c(cost=2, all=2), value=TRUE), list(NumTimesFound=.N), keyby=vin.names]
    dt2.err <- dt[sapply(car.vins.err, agrep, x=vin.vins, max.distance=c(cost=2, all=2), value=TRUE), list(NumTimesFound=.N), keyby="vin.names"]
    
    dt1[dt1.err][dt2.err]
    
    0 讨论(0)
  • 2020-12-21 07:34

    I finally got it.

    The agrep-function has a value-option that needs to be altered from FALSE (default) to TRUE:

    dt <- dt[lapply(car.vins, agrep, x = vin.vins, max.distance = c(cost=2, all=2), value = TRUE)
             , .(NumTimesFound = .N)
             , by = vin.names]
    

    Note: the max.distance parameters can be altered based on Levenshtein distance, substitutions, deletions, etc. 'agrep' is a fascinating function!

    Thanks again for all the help!

    0 讨论(0)
提交回复
热议问题