问题
Suppose I have a data frame like this
df <- data.frame(x=c("This script outputs 10 visualizations.",
"This script outputs 1 visualization.",
"This script outputs 5 data files.",
"This script outputs 1 data file.",
"This script doesn't output any visualizations or data files",
"This script outputs 9 visualizations and 28 data files.",
"This script outputs 1 visualization and 1 data file."))
It looks like this
x
1 This script outputs 10 visualizations.
2 This script outputs 1 visualization.
3 This script outputs 5 data files.
4 This script outputs 1 data file.
5 This script doesn't output any visualizations or data files
6 This script outputs 9 visualizations and 28 data files.
7 This script outputs 1 visualization and 1 data file.
Is there a simple way, possibly using the Tidyverse
to extract the number of visualizations and the number of files for each row? When there are no visualizations (or no data files, or both) I would like to extract 0
. Essentially I would like the final result to be like this
viz files
1 10 0
2 1 0
3 0 5
4 0 1
5 0 0
6 9 28
7 1 1
I tried using stuff like
str_extract(df$x, "(?<=This script outputs )(.*)(?= visualizatio(n\\.$|ns\\.$))")
but I got so lost.
回答1:
We can use regex lookaround in str_extract
to extract one or more digits (\\d+
) followed by a space and 'vis' or 'data files' into two columns
library(dplyr)
library(stringr)
df %>%
transmute(viz = as.numeric(str_extract(x, "\\d+(?= vis)")),
files = as.numeric(str_extract(x, "\\d+(?= data files?)"))) %>%
mutate_all(replace_na, 0)
# viz files
#1 10 0
#2 1 0
#3 0 5
#4 0 0
#5 0 0
#6 9 28
#7 1 0
In the first case, the pattern matches one or more digits (\\d+
) followed by a regex lookaround ((?=
) where there is a space followed by the 'vis' word and in second column, it extracts the digits followed by the space and the word 'file' or 'files'
回答2:
You could use the package unglue to get a readable solution as you have a limited amount of possible patterns, then replace NAs by 0 :
library(unglue)
patterns <-
c("This script outputs {viz} visualization{=s{0,1}} and {files} data file{=s{0,1}}.",
"This script outputs {viz} visualization{=s{0,1}}.",
"This script outputs {files} data file{=s{0,1}}.")
res <- unglue_unnest(df, x, patterns, convert = TRUE)
res[is.na(res)] <- 0
res
#> viz files
#> 1 10 0
#> 2 1 0
#> 3 0 5
#> 4 0 1
#> 5 0 0
#> 6 9 28
#> 7 1 1
回答3:
A base R approach ...
df$viz <- as.numeric(sub(".*This script outputs (\\d+).*", "\\1", df$x))
df$files <- as.numeric(sub(".*(\\d+) data file.*", "\\1", df$x))
df[is.na(df)] <- 0
df
# x viz files
# 1 This script outputs 10 visualizations. 10 0
# 2 This script outputs 1 visualization. 1 0
# 3 This script outputs 5 data files. 5 5
# 4 This script outputs 1 data file. 1 1
# 5 This script doesn't output any visualizations or data files 0 0
# 6 This script outputs 9 visualizations and 28 data files. 9 28
# 7 This script outputs 1 visualization and 1 data file. 1 1
来源:https://stackoverflow.com/questions/59515177/r-tidyr-regex-extract-ordered-numbers-from-character-column