问题
I have a dataset stored as a data.table DT
that looks like this:
print(DT)
category industry
1: administration admin
2: nurse practitioner truck
3: trucking truck
4: administration admin
5: warehousing nurse
6: warehousing admin
7: trucking truck
8: nurse practitioner nurse
9: nurse practitioner truck
I would like to reduce the table to only rows where the industry matches the category. My general approach is to use grepl()
to regex match the string '^{{INDUSTRY}}[a-z ]+$'
and each row of DT$category
, with each corresponding row of DT$industry
inserted in place of {{INDUSTRY}}
in the regex string using infuse()
. I struggled to find a sleek data.table solution that would properly loop through the table and make within-row comparisons, so I resorted to a for-loop to get the job done:
template <- "^{{IND}}[a-z ]+$"
DT[,match := FALSE,]
for (i in seq(1,length(DT$category))) {
ind <- DT[i]$industry
categ <- d.daily[i]$category
if (grepl(infuse(IND=ind,template),categ)){
DT[i]$match <- TRUE
}
}
DT<- DT[match==TRUE]
print(DT)
category industry
1: administration admin
2: trucking truck
3: administration admin
4: trucking truck
5: nurse practitioner nurse
However, I am sure this can be done in a better way. Any suggestions for how I could achieve this result by utilizing the data.table package's functionality? It's my understanding that, in this context, an approach that uses the package would likely be more efficient than a for-loop.
回答1:
Data.table is good at grouped operations; I think that's how it can help, assuming you have many rows with the same industry:
DT[ DT[, .I[grep(industry, category)], by = industry]$V1 ]
This uses the current idiom for subsetting by group, thanks to @eddi .
Comments. These might help further:
If you have many rows with the same industry-category combo, try
by=.(industry,category)
.Try something else in the place of
grep
(like the options in Ken and Richard's answers).
回答2:
As long as the match is always based on the start of the category
string, then this works just fine:
dt[substring(category, 1, nchar(industry)) == industry]
# category industry
# 1: administration admin
# 2: trucking truck
# 3: administration admin
# 4: trucking truck
# 5: nurse practitioner nurse
回答3:
You could use stringi::stri_detect_fixed()
. It is vectorized over both str
and pattern
.
DT[stringi::stri_detect_fixed(category, industry)]
# category industry
# 1: administration admin
# 2: trucking truck
# 3: administration admin
# 4: trucking truck
# 5: nurse practitioner nurse
Alternatively, stringr::str_detect()
can be used. It is also vectorized over both its arguments.
library(stringr)
DT[str_detect(category, fixed(industry))]
Or a base R option is to run grepl()
through mapply()
DT[mapply(grepl, industry, category, fixed = TRUE)]
Or you could get the same result with Vectorize(grepl)
.
DT[Vectorize(grepl)(industry, category, fixed = TRUE)]
All of these produce the same result.
Data:
DT <- structure(list(category = c("administration", "nurse practitioner",
"trucking", "administration", "warehousing", "warehousing", "trucking",
"nurse practitioner", "nurse practitioner"), industry = c("admin",
"truck", "truck", "admin", "nurse", "admin", "truck", "nurse",
"truck")), .Names = c("category", "industry"), class = "data.frame", row.names = c(NA,
-9L))
setDT(DT)
来源:https://stackoverflow.com/questions/33699122/looping-grepl-through-data-table-r