Using stringr to extract one or multiple words from text string in R

问题

I have the following data frame:

df <- data.frame(city=c("in London", "in Manchester city", "in Sao Paolo"))

I am using str_extract and return the word after 'in' in a separate column.

library(stringr)
str_extract(df$city, '(?<=in\\s)\\w+')

This works fine for me in 95% of cases. However, there are cases like "Sao Paolo" above where my regex would return "Sao" rather than the city name.

Can someone please help me with amending it to capture either:

1) everything to the end of the text string I am extracting from? OR

2) where there is more than one word after 'in', then return that too

Many thanks.

回答1:

To match all the rest of the string after the first in followed with a space, you can use

(?<=in\\s).+

The lookbehind matches the in preposition with a white space after it, but does not return it inside the match since lookbehinds are zero-width assertions.

回答2:

Does this one liner do it for you?

unlist(lapply(strsplit(c("in London", "in Sao Paulo", "in Manchester City"), "in "), function(x) x[2]))
[1] "London"          "Sao Paulo"       "Manchester City"

回答3:

You can try this:

library(stringr)
df$onlyCity <- str_extract(df$city, '[^in ](.)*')
df
                city        onlyCity
1          in London          London
2 in Manchester city Manchester city
3       in Sao Paolo       Sao Paolo

回答4:

gsub("^in[ ]*(.*$)", "\\1", df$city)
[1] "London"          "Manchester city" "Sao Paolo"

Assumes that your strings start with "in", followed by some number of spaces (won't fail with more than one), followed by the text of interest which is captured from the first non-whitespace character up to the end of the string.

来源：https://stackoverflow.com/questions/34844745/using-stringr-to-extract-one-or-multiple-words-from-text-string-in-r

标签

regex

stringr