REGEX in R: extracting words from a string

后端 未结 2 644
走了就别回头了
走了就别回头了 2021-01-05 15:38

i guess this is a common problem, and i found quite a lot of webpages, including some from SO, but i failed to understand how to implement it.

I am new to REGEX, and

相关标签:
2条回答
  • 2021-01-05 16:18

    You've already accepted an answer, but I'm going to share this as a means of helping you understand a little more about regex in R, since you were actually very close to getting the answer on your own.


    There are two problems with your gsub approach:

    1. You used single backslashes (\). R requires you to escape those since they are special characters. You escape them by adding another backslash (\\). If you do nchar("\\"), you'll see that it returns "1".

    2. You didn't specify what the replacement should be. Here, we don't want to replace anything, but we want to capture a specific part of the string. You capture groups in parentheses (...), and then you can refer to them by the number of the group. Here, we have just one group, so we refer to it as "\\1".

    You should have tried something like:

    sub("^((?:\\S+\\s+){2}\\S+).*", "\\1", z, perl = TRUE)
    # [1] "I love stack"
    

    This is essentially saying:

    • Work from the start of the contents of "z".
    • Start creating group 1.
    • Find non-whitespace (like a word) followed by whitespace (\S+\s+) two times {2} and then the next set of non-whitespaces (\S+). This will get us 3 words, without also getting the whitespace after the third word. Thus, if you wanted a different number of words, change the {2} to be one less than the number you are actually after.
    • End group 1 there.
    • Then, just return the contents of group 1 (\1) from "z".

    To get the last three words, just switch the position of the capturing group and put it at the end of the pattern to match.

    sub("^.*\\s+((?:\\S+\\s+){2}\\S+)$", "\\1", z, perl = TRUE)
    # [1] "a cool site"
    
    0 讨论(0)
  • 2021-01-05 16:27

    For getting the first four words.

    library(stringr)
    str_extract(x, "^\\s*(?:\\S+\\s+){3}\\S+")
    

    For getting the last four.

    str_extract(x, "(?:\\S+\\s+){3}\\S+(?=\\s*$)")
    
    0 讨论(0)
提交回复
热议问题