问题
I have a single long string variable with 3 obs. I was trying to create a field prob to extract the specific string from the long string. the code and message is below.
data aa: "The probability of being a carrier is 0.0002422359 " " an BRCA1 carrier 0.0001061067 " " an BRCA2 carrier 0.00013612 "
enter code here aa$prob <- ifelse(grepl("The probability of being a carrier is", xx)==TRUE, word(aa, 8, 8), ifelse(grepl("BRCA", xx)==TRUE, word(aa, 5, 5), NA))
Warning message: In aa$prob <- ifelse(grepl("The probability of being a carrier is", : Coercing LHS to a list
回答1:
Here is my previous answer, updated to reflect a data.frame
.
library(dplyr)
aa <- data.frame(aa = c("...", "...", "The probability of being a carrier is 0.0002422359 ", " an BRCA1 carrier 0.0001061067 ", " an BRCA2 carrier 0.00013612 ", "..."))
aa %>%
mutate(prob = as.numeric(if_else(grepl("(probability|BRCA[12] carrier)", aa),
gsub("^.*?\\b([0-9]+\\.?[0-9]*)\\s*$", "\\1", aa), NA_character_)))
# aa prob
# 1 ... NA
# 2 ... NA
# 3 The probability of being a carrier is 0.0002422359 0.0002422359
# 4 an BRCA1 carrier 0.0001061067 0.0001061067
# 5 an BRCA2 carrier 0.00013612 0.0001361200
# 6 ... NA
Regex walk-through:
^
and$
are beginning and end of string, respective;\\b
is a word-boundary; none of these "consume" any characters, they just mark beginnings and endings.
means one character?
means "zero or one", aka optional;*
means "zero or more";+
means "one or more"; all refer to the previous character/class/group\\s
is blank space, including spaces and tabs[0-9]
is a class, meaning any character between 0 and 9; similarly,[a-z]
is all lowercase letters,[a-zA-Z]
are all letters,[0-9A-F]
are hexadecimal digits, etc(...)
is a saved group; it's not uncommon in a group to use|
as an "or"; this group is used later in thereplacement=
part ofgsub
as numbered groups, so\\1
recalls the first group from the pattern
So grouped and summarized:
"^.*?\\b([0-9]+\\.?[0-9]*)\\s*$"
1 ^^^^^^^^^^^^^^^^^^
2 ^^^
3 ^^^
4 ^^^^
- This is the "number" part, that allows for one or more digits, an optional decimal point, and zero or more digits. This is saved in group "1".
- The word boundary guarantees that we include leading numbers (it's possible, depending on a few things, for
"12.345"
to be parsed as"2.345"
without this. - Anything before the number-like string.
- Some or no blank space after the number.
Grouped logically, in an organized way
Regex isn't unique to R, it's a parsing language that R (and most other programming languages) supports in one way or another.
来源:https://stackoverflow.com/questions/66002140/extract-part-of-word-into-a-field-from-a-long-string-using-r