I have a character vector that I need to clean. Specifically, I want to remove the number that comes before the word \"Votes.\" Note that the number has a comma to separate thou
Easiest way is with stringr
:
> library(stringr)
> regexp <- "-?[[:digit:]]+\\.*,*[[:digit:]]*\\.*,*[[:digit:]]* Votes+"
> str_extract(text,regexp)
[1] "558,586 Votes"
To do the same thing but extract only the number, wrap it in gsub
:
> gsub('\\s+[[:alpha:]]+', '', str_extract(text,regexp))
[1] "558,586"
Here's a version that will strip out all numbers before the word "Votes" even if they have commas or periods in it:
> gsub('\\s+[[:alpha:]]+', '', unlist(regmatches (text,gregexpr("-?[[:digit:]]+\\.*,*[[:digit:]]*\\.*,*[[:digit:]]* Votes+",text) )) )
[1] "558,586"
If you want the label too, then just throw out the gsub
part:
> unlist(regmatches (text,gregexpr("-?[[:digit:]]+\\.*,*[[:digit:]]*\\.*,*[[:digit:]]* Votes+",text) ))
[1] "558,586 Votes"
And if you want to pull out all the numbers:
> unlist(regmatches (text,gregexpr("-?[[:digit:]]+\\.*,*[[:digit:]]*\\.*,*[[:digit:]]*",text) ))
[1] "1" "15" "202" "558,586"