问题
Assume 900+ company names pasted together to form a regex pattern using the pipe separator -- "firm.pat".
firm.pat <- str_c(firms$firm, collapse = "|")
With a data frame called "bio" that has a large character variable (250 rows each with 100+ words) named "comment", I would like to replace all the company names with blanks. Both a gsub
call and a str_replace_all
call return the same mysterious error.
bio$comment <- gsub(pattern = firm.pat, x = bio$comment, replacement = "")
Error in gsub(pattern = firm.pat, x = bio$comment, replacement = "") :
assertion 'tree->num_tags == num_tags' failed in executing regexp: file 'tre-compile.c', line 634
library(stringr)
bio$comment <- str_replace_all(bio$comment, firm.pat, "")
Error: assertion 'tree->num_tags == num_tags' failed in executing regexp: file 'tre-compile.c', line 634
traceback()
did not enlighten me.
> traceback()
4: gsub("aaronson rappaport|adams reese|adelson testan|adler pollock|ahlers cooney|ahmuty demers|akerman|akin gump|allen kopet|allen matkins|alston bird|alston hunt|alvarado smith|anderson kill|andrews kurth|archer
# hundreds of lines of company names omitted here
lties in all 50 states and washington, dc. results are compiled through a peer-review survey in which thousands of lawyers in the u.s. confidentially evaluate their professional peers."
), fixed = FALSE, ignore.case = FALSE, perl = FALSE)
3: do.call(f, compact(args))
2: re_call("gsub", string, pattern, replacement)
1: str_replace_all(bio$comment, firm.pat, "")
Three other posts have mentioned the cryptic error on SO, a passing reference and cites two other oblique references, but with no discussion.
I know this question lacks reproducible code, but even so, how do I find out what the error is explaining? Even better, how do I avoid throwing the error? The error does not seem to occur with smaller numbers of companies but I can't detect a pattern or threshold. I am running Windows 8, RStudio, updated versions of every package.
Thank you.
回答1:
I had the same problem with pattern consisiting of hundreds of manufacters names. As I can suggest the pattern is too long, so I split it in two or more patterns and it works well.
ml<-length(firms$firm)
xyz<-gsub(sprintf("(*UCP)\\b(%s)\\b", paste(head(firms$firm,n=ml/2), collapse = "|")), "", bio$comment, perl=TRUE)
xyz<-gsub(sprintf("(*UCP)\\b(%s)\\b", paste(tail(firms$firm,n=ml/2), collapse = "|")), "", xyz, perl=TRUE)
回答2:
You can use mgsub in the qdap package, which is an extension to gsub that handles vectors of patterns and replacements.
Please refer to this Answer
来源:https://stackoverflow.com/questions/28684438/in-regex-mystery-error-assertion-tree-num-tags-num-tags-failed-in-execut