R tidyr regex: extract ordered numbers from character column

瘦欲@ 提交于 2020-01-04 05:18:07

问题


Suppose I have a data frame like this

df <- data.frame(x=c("This script outputs 10 visualizations.", 
                     "This script outputs 1 visualization.", 
                     "This script outputs 5 data files.", 
                     "This script outputs 1 data file.", 
                     "This script doesn't output any visualizations or data files", 
                     "This script outputs 9 visualizations and 28 data files.", 
                     "This script outputs 1 visualization and 1 data file."))

It looks like this

                                                            x
1                      This script outputs 10 visualizations.
2                        This script outputs 1 visualization.
3                           This script outputs 5 data files.
4                            This script outputs 1 data file.
5 This script doesn't output any visualizations or data files
6     This script outputs 9 visualizations and 28 data files.
7        This script outputs 1 visualization and 1 data file.

Is there a simple way, possibly using the Tidyverse to extract the number of visualizations and the number of files for each row? When there are no visualizations (or no data files, or both) I would like to extract 0. Essentially I would like the final result to be like this

    viz   files
1    10       0
2     1       0
3     0       5
4     0       1
5     0       0
6     9      28
7     1       1

I tried using stuff like

str_extract(df$x, "(?<=This script outputs )(.*)(?= visualizatio(n\\.$|ns\\.$))")

but I got so lost.


回答1:


We can use regex lookaround in str_extract to extract one or more digits (\\d+) followed by a space and 'vis' or 'data files' into two columns

library(dplyr)
library(stringr)
df %>% 
  transmute(viz = as.numeric(str_extract(x, "\\d+(?= vis)")),
            files = as.numeric(str_extract(x, "\\d+(?= data files?)"))) %>%
  mutate_all(replace_na, 0)
#  viz files
#1  10     0
#2   1     0
#3   0     5
#4   0     0
#5   0     0
#6   9    28
#7   1     0

In the first case, the pattern matches one or more digits (\\d+) followed by a regex lookaround ((?=) where there is a space followed by the 'vis' word and in second column, it extracts the digits followed by the space and the word 'file' or 'files'




回答2:


You could use the package unglue to get a readable solution as you have a limited amount of possible patterns, then replace NAs by 0 :

library(unglue)
patterns <-
  c("This script outputs {viz} visualization{=s{0,1}} and {files} data file{=s{0,1}}.",
    "This script outputs {viz} visualization{=s{0,1}}.",
    "This script outputs {files} data file{=s{0,1}}.")

res <- unglue_unnest(df, x, patterns, convert = TRUE) 
res[is.na(res)] <- 0
res
#>   viz files
#> 1  10     0
#> 2   1     0
#> 3   0     5
#> 4   0     1
#> 5   0     0
#> 6   9    28
#> 7   1     1



回答3:


A base R approach ...

df$viz <- as.numeric(sub(".*This script outputs (\\d+).*", "\\1", df$x))
df$files <- as.numeric(sub(".*(\\d+) data file.*", "\\1", df$x))
df[is.na(df)] <- 0

df
#                                                             x viz files
# 1                      This script outputs 10 visualizations.  10     0
# 2                        This script outputs 1 visualization.   1     0
# 3                           This script outputs 5 data files.   5     5
# 4                            This script outputs 1 data file.   1     1
# 5 This script doesn't output any visualizations or data files   0     0
# 6     This script outputs 9 visualizations and 28 data files.   9    28
# 7        This script outputs 1 visualization and 1 data file.   1     1


来源:https://stackoverflow.com/questions/59515177/r-tidyr-regex-extract-ordered-numbers-from-character-column

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!