Looping grepl() through data.table (R)

浪尽此生 提交于 2019-11-28 00:31:09

问题


I have a dataset stored as a data.table DT that looks like this:

print(DT)
   category            industry
1: administration      admin
2: nurse practitioner  truck
3: trucking            truck
4: administration      admin
5: warehousing         nurse
6: warehousing         admin
7: trucking            truck
8: nurse practitioner  nurse         
9: nurse practitioner  truck 

I would like to reduce the table to only rows where the industry matches the category. My general approach is to use grepl() to regex match the string '^{{INDUSTRY}}[a-z ]+$' and each row of DT$category, with each corresponding row of DT$industry inserted in place of {{INDUSTRY}} in the regex string using infuse(). I struggled to find a sleek data.table solution that would properly loop through the table and make within-row comparisons, so I resorted to a for-loop to get the job done:

template <- "^{{IND}}[a-z ]+$"
DT[,match := FALSE,]
for (i in seq(1,length(DT$category))) {
    ind <- DT[i]$industry
    categ <- d.daily[i]$category
    if (grepl(infuse(IND=ind,template),categ)){
        DT[i]$match <- TRUE
    }
}
DT<- DT[match==TRUE]
print(DT)
       category            industry
1: administration      admin
2: trucking            truck
3: administration      admin
4: trucking            truck
5: nurse practitioner  nurse         

However, I am sure this can be done in a better way. Any suggestions for how I could achieve this result by utilizing the data.table package's functionality? It's my understanding that, in this context, an approach that uses the package would likely be more efficient than a for-loop.


回答1:


Data.table is good at grouped operations; I think that's how it can help, assuming you have many rows with the same industry:

DT[ DT[, .I[grep(industry, category)], by = industry]$V1 ]

This uses the current idiom for subsetting by group, thanks to @eddi .


Comments. These might help further:

  • If you have many rows with the same industry-category combo, try by=.(industry,category).

  • Try something else in the place of grep (like the options in Ken and Richard's answers).




回答2:


As long as the match is always based on the start of the category string, then this works just fine:

dt[substring(category, 1, nchar(industry)) == industry]
#              category industry
# 1:     administration    admin
# 2:           trucking    truck
# 3:     administration    admin
# 4:           trucking    truck
# 5: nurse practitioner    nurse



回答3:


You could use stringi::stri_detect_fixed(). It is vectorized over both str and pattern.

DT[stringi::stri_detect_fixed(category, industry)]
#              category industry
# 1:     administration    admin
# 2:           trucking    truck
# 3:     administration    admin
# 4:           trucking    truck
# 5: nurse practitioner    nurse 

Alternatively, stringr::str_detect() can be used. It is also vectorized over both its arguments.

library(stringr)
DT[str_detect(category, fixed(industry))]

Or a base R option is to run grepl() through mapply()

DT[mapply(grepl, industry, category, fixed = TRUE)]

Or you could get the same result with Vectorize(grepl).

DT[Vectorize(grepl)(industry, category, fixed = TRUE)]

All of these produce the same result.

Data:

DT <- structure(list(category = c("administration", "nurse practitioner", 
"trucking", "administration", "warehousing", "warehousing", "trucking", 
"nurse practitioner", "nurse practitioner"), industry = c("admin", 
"truck", "truck", "admin", "nurse", "admin", "truck", "nurse", 
"truck")), .Names = c("category", "industry"), class = "data.frame", row.names = c(NA, 
-9L))
setDT(DT)


来源:https://stackoverflow.com/questions/33699122/looping-grepl-through-data-table-r

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!