i guess this is a common problem, and i found quite a lot of webpages, including some from SO, but i failed to understand how to implement it.
I am new to REGEX, and
You've already accepted an answer, but I'm going to share this as a means of helping you understand a little more about regex in R, since you were actually very close to getting the answer on your own.
There are two problems with your gsub
approach:
You used single backslashes (\
). R requires you to escape those since they are special characters. You escape them by adding another backslash (\\
). If you do nchar("\\")
, you'll see that it returns "1".
You didn't specify what the replacement should be. Here, we don't want to replace anything, but we want to capture a specific part of the string. You capture groups in parentheses (...)
, and then you can refer to them by the number of the group. Here, we have just one group, so we refer to it as "\\1"
.
You should have tried something like:
sub("^((?:\\S+\\s+){2}\\S+).*", "\\1", z, perl = TRUE)
# [1] "I love stack"
This is essentially saying:
\S+\s+
) two times {2}
and then the next set of non-whitespaces (\S+
). This will get us 3 words, without also getting the whitespace after the third word. Thus, if you wanted a different number of words, change the {2}
to be one less than the number you are actually after.\1
) from "z".To get the last three words, just switch the position of the capturing group and put it at the end of the pattern to match.
sub("^.*\\s+((?:\\S+\\s+){2}\\S+)$", "\\1", z, perl = TRUE)
# [1] "a cool site"
For getting the first four words.
library(stringr)
str_extract(x, "^\\s*(?:\\S+\\s+){3}\\S+")
For getting the last four.
str_extract(x, "(?:\\S+\\s+){3}\\S+(?=\\s*$)")