I have a character vector that I need to clean. Specifically, I want to remove the number that comes before the word \"Votes.\" Note that the number has a comma to separate thou
You may use
text <- "STATE QUESTION NO. 1 Amendment to Title 15 of the Nevada Revised Statutes Shall Chapter 202 of the Nevada Revised Statutes be amended to prohibit, except in certain circumstances, a person from selling or transferring a firearm to another person unless a federally-licensed dealer first conducts a federal background check on the potential buyer or transferee? 558,586 Votes"
trimws(gsub("(\\s){2,}|\\d[0-9,]*\\s*(Votes)", "\\1\\2", text))
# => [1] "STATE QUESTION NO. 1 Amendment to Title 15 of the Nevada Revised Statutes Shall Chapter 202 of the Nevada Revised Statutes be amended to prohibit, except in certain circumstances, a person from selling or transferring a firearm to another person unless a federally-licensed dealer first conducts a federal background check on the potential buyer or transferee? Votes"
See the online R demo and the online regex demo.
Details
(\\s){2,}
- matches 2 or more whitespace chars while capturing the last occurrence that will be reinserted using the \1
placeholder in the replacement pattern|
- or\\d
- a digit[0-9,]*
- 0 or more digits or commas\\s*
- 0+ whitespace chars(Votes)
- Group 2 (will be restored in the output using the \2
placeholder): a Votes
substring.Note that trimws
will remove any leading/trailing whitespace.