extract part of word into a field from a long string using R

人盡茶涼 提交于 2021-02-12 11:40:11

问题


I have a single long string variable with 3 obs. I was trying to create a field prob to extract the specific string from the long string. the code and message is below.

data aa: "The probability of being a carrier is 0.0002422359 " " an BRCA1 carrier 0.0001061067 " " an BRCA2 carrier 0.00013612 "

enter code here aa$prob <- ifelse(grepl("The probability of being a carrier is", xx)==TRUE, word(aa, 8, 8), ifelse(grepl("BRCA", xx)==TRUE, word(aa, 5, 5), NA))

Warning message: In aa$prob <- ifelse(grepl("The probability of being a carrier is", : Coercing LHS to a list


回答1:


Here is my previous answer, updated to reflect a data.frame.

library(dplyr)

aa <- data.frame(aa = c("...", "...", "The probability of being a carrier is 0.0002422359 ", " an BRCA1 carrier 0.0001061067 ", " an BRCA2 carrier 0.00013612 ", "..."))

aa %>%
  mutate(prob = as.numeric(if_else(grepl("(probability|BRCA[12] carrier)", aa), 
                                   gsub("^.*?\\b([0-9]+\\.?[0-9]*)\\s*$", "\\1", aa), NA_character_)))
#                                                    aa         prob
# 1                                                 ...           NA
# 2                                                 ...           NA
# 3 The probability of being a carrier is 0.0002422359  0.0002422359
# 4                      an BRCA1 carrier 0.0001061067  0.0001061067
# 5                        an BRCA2 carrier 0.00013612  0.0001361200
# 6                                                 ...           NA

Regex walk-through:

  • ^ and $ are beginning and end of string, respective; \\b is a word-boundary; none of these "consume" any characters, they just mark beginnings and endings
  • . means one character
  • ? means "zero or one", aka optional; * means "zero or more"; + means "one or more"; all refer to the previous character/class/group
  • \\s is blank space, including spaces and tabs
  • [0-9] is a class, meaning any character between 0 and 9; similarly, [a-z] is all lowercase letters, [a-zA-Z] are all letters, [0-9A-F] are hexadecimal digits, etc
  • (...) is a saved group; it's not uncommon in a group to use | as an "or"; this group is used later in the replacement= part of gsub as numbered groups, so \\1 recalls the first group from the pattern

So grouped and summarized:

  "^.*?\\b([0-9]+\\.?[0-9]*)\\s*$"
1         ^^^^^^^^^^^^^^^^^^
2      ^^^
3   ^^^
4                           ^^^^
  1. This is the "number" part, that allows for one or more digits, an optional decimal point, and zero or more digits. This is saved in group "1".
  2. The word boundary guarantees that we include leading numbers (it's possible, depending on a few things, for "12.345" to be parsed as "2.345" without this.
  3. Anything before the number-like string.
  4. Some or no blank space after the number.

Grouped logically, in an organized way

Regex isn't unique to R, it's a parsing language that R (and most other programming languages) supports in one way or another.



来源:https://stackoverflow.com/questions/66002140/extract-part-of-word-into-a-field-from-a-long-string-using-r

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!