stringr

In regex, mystery Error: assertion 'tree->num_tags == num_tags' failed in executing regexp: file 'tre-compile.c', line 634

做~自己de王妃 提交于 2019-12-05 16:14:11
Assume 900+ company names pasted together to form a regex pattern using the pipe separator -- "firm.pat". firm.pat <- str_c(firms$firm, collapse = "|") With a data frame called "bio" that has a large character variable (250 rows each with 100+ words) named "comment", I would like to replace all the company names with blanks. Both a gsub call and a str_replace_all call return the same mysterious error. bio$comment <- gsub(pattern = firm.pat, x = bio$comment, replacement = "") Error in gsub(pattern = firm.pat, x = bio$comment, replacement = "") : assertion 'tree->num_tags == num_tags' failed in

How to extract everything until first occurrence of pattern

本秂侑毒 提交于 2019-12-05 12:38:27
问题 I'm trying to use the stringr package in R to extract everything from a string up until the first occurrence of an underscore. What I've tried str_extract("L0_123_abc", ".+?(?<=_)") > "L0_" Close but no cigar. How do I get this one? Also, Ideally I'd like something that's easy to extend so that I can get the information in between the 1st and 2nd underscore and get the information after the 3rd underscore. 回答1: To get L0 , you may use > library(stringr) > str_extract("L0_123_abc", "[^_]+") [1

Create new variables based upon specific values

安稳与你 提交于 2019-12-05 10:37:20
I read up on regular expressions and Hadley Wickham's stringr and dplyr packages but can't figure out how to get this to work. I have library circulation data in a data frame, with the call number as a character variable. I'd like to take the initial capital letters and make that a new variable and the digits between the letters and period into a second new variable. Call_Num HV5822.H4 C47 Circulating Collection, 3rd Floor QE511.4 .G53 1982 Circulating Collection, 3rd Floor TL515 .M63 Circulating Collection, 3rd Floor D753 .F4 Circulating Collection, 3rd Floor DB89.F7 D4 Circulating Collection

str_count with overlapping substrings

非 Y 不嫁゛ 提交于 2019-12-05 08:56:11
I am trying to count the number of appearances of a substring within a character vector. For example: lookin<-c("babababa", "bellow", "ra;baba") searchfor<-"aba" str_count(lookin, searchfor) returns: 2 0 1 However, I want it to return '3 0 1' but it isn't picking up on the middle 'aba' in the first item since it is partially used in the first instance (I think). I found this question but couldn't figure out how to use that with a vector having multiple items. Try: str_count(lookin, paste0("(?=",searchfor,")")) [1] 3 0 1 Which, as answered in your link, uses lookahead to match all instances. 来源

Extract last word in a string after comma if there are multiple words else the first word

老子叫甜甜 提交于 2019-12-05 08:20:32
I have data where the words as follows location<- c("xyz, sss, New Zealand", "USA", "Pris,France") id<- c(1,2,3) df<-data.frame(location,id) I would like to extract the country name from the data. The tricky part is if i extract just the last word then I will have only one record (France). library(stringr) df$country<- word(df$location,-1) Any ideas on how to extract country data from this data? id location country 1 xyz, sss, New Zealand New Zealand 2 USA USA 3 Pris,France France You can try sub df$country <- sub('.*,\\s*', '', df$location) df$country #[1] "New Zealand" "USA" "France" Or

Extract text in parentheses in R

无人久伴 提交于 2019-12-05 06:55:56
Two related questions. I have vectors of text data such as "a(b)jk(p)" "ipq" "e(ijkl)" and want to easily separate it into a vector containing the text OUTSIDE the parentheses: "ajk" "ipq" "e" and a vector containing the text INSIDE the parentheses: "bp" "" "ijkl" Is there any easy way to do this? An added difficulty is that these can get quite large and have a large (unlimited) number of parentheses. Thus, I can't simply grab text "pre/post" the parentheses and need a smarter solution. Text outside the parenthesis > x <- c("a(b)jk(p)" ,"ipq" , "e(ijkl)") > gsub("\\([^()]*\\)", "", x) [1] "ajk

stringr str_extract capture group capturing everything

旧城冷巷雨未停 提交于 2019-12-05 04:46:45
I'm looking to extract the year from a string. This always comes after an 'X' and before "." then a string of other characters. Using stringr 's str_extract I'm trying the following: year = str_extract(string = 'X2015.XML.Outgoing.pounds..millions.' , pattern = 'X(\\d{4})\\.') I thought the brackets would define the capture group, returning 2015 , but I actually get the complete match X2015. Am I doing this correctly? Why am i not trimming "X" and "."? The capture group is irrelevant in this case. The function str_extract will return the whole match including characters before and after the

Meaning of regular expressions like - \\d , \\D, ^ , $ etc [duplicate]

删除回忆录丶 提交于 2019-12-05 03:12:24
问题 This question already has an answer here : Reference - What does this regex mean? (1 answer) Closed 3 years ago . What do these expressions mean? Where can I learn about their usage? \\d \\D \\s \\S \\w \\W \\t \\n ^ $ \ | etc.. I need to use the stringr package and i have absolutely no idea how to use these . 回答1: From ?regexp , in the Extended Regular Expressions section: The caret ‘^’ and the dollar sign ‘$’ are metacharacters that respectively match the empty string at the beginning and

R regex gsub separate letters and numbers

丶灬走出姿态 提交于 2019-12-04 21:16:21
问题 I have a string that's mixed letters and numbers: "The sample is 22mg" I'd like to split strings where a number is immediately followed by letter like this: "The sample is 22 mg" I've tried this: gsub('[0-9]+[[aA-zZ]]', '[0-9]+ [[aA-zZ]]', 'This is a test 22mg') but am not getting the desired results. Any suggestions? 回答1: You need to use capturing parentheses in the regular expression and group references in the replacement. For example: gsub('([0-9])([[:alpha:]])', '\\1 \\2', 'This is a

What is the difference between paste/paste0 and str_c?

徘徊边缘 提交于 2019-12-04 09:35:27
I don't seem to see a difference between paste / paste0 and str_c for combining a single vector into a single string, multiple strings into one string, or multiple vectors into a single string. While I was writing the question I found this: https://www.rdocumentation.org/packages/stringr/versions/1.3.1/topics/str_c . The community example from richie@datacamp.com says the difference is is that str_c treats blanks as blanks (not as NAs) and recycles more appropriately. Any other differences? paste0(..., collapse = NULL) is a wrapper for paste(..., sep = "", collapse = NULL) , which means there