问题
I have the following data frame:
df <- data.frame(city=c("in London", "in Manchester city", "in Sao Paolo"))
I am using str_extract and return the word after 'in' in a separate column.
library(stringr)
str_extract(df$city, '(?<=in\\s)\\w+')
This works fine for me in 95% of cases. However, there are cases like "Sao Paolo" above where my regex would return "Sao" rather than the city name.
Can someone please help me with amending it to capture either:
1) everything to the end of the text string I am extracting from? OR
2) where there is more than one word after 'in', then return that too
Many thanks.
回答1:
To match all the rest of the string after the first in
followed with a space, you can use
(?<=in\\s).+
The lookbehind matches the in
preposition with a white space after it, but does not return it inside the match since lookbehinds are zero-width assertions.
回答2:
Does this one liner do it for you?
unlist(lapply(strsplit(c("in London", "in Sao Paulo", "in Manchester City"), "in "), function(x) x[2]))
[1] "London" "Sao Paulo" "Manchester City"
回答3:
You can try this:
library(stringr)
df$onlyCity <- str_extract(df$city, '[^in ](.)*')
df
city onlyCity
1 in London London
2 in Manchester city Manchester city
3 in Sao Paolo Sao Paolo
回答4:
gsub("^in[ ]*(.*$)", "\\1", df$city)
[1] "London" "Manchester city" "Sao Paolo"
Assumes that your strings start with "in"
, followed by some number of spaces (won't fail with more than one), followed by the text of interest which is captured from the first non-whitespace character up to the end of the string.
来源:https://stackoverflow.com/questions/34844745/using-stringr-to-extract-one-or-multiple-words-from-text-string-in-r