Remove all punctuation except underline between characters in R with POSIX character class

问题

I would like to use R to remove all underlines expect those between words. At the end the code removes underlines at the end or at the beginning of a word. The result should be 'hello_world and hello_world'. I want to use those pre-built classes. Right know I have learn to expect particular characters with following code but I don't know how to use the word boundary sequences.

test<-"hello_world and _hello_world_"
gsub("[^_[:^punct:]]", "", test, perl=T)

回答1:

You can use

gsub("[^_[:^punct:]]|_+\\b|\\b_+", "", test, perl=TRUE)

See the regex demo

Details:

[^_[:^punct:]] - any punctuation except _
| - or
_+\b - one or more _ at the end of a word
| - or
\b_+ - one or more _ at the start of a word

回答2:

One non-regex way is to split and use trimws by setting the whitespace argument to _, i.e.

paste(sapply(strsplit(test, ' '), function(i)trimws(i, whitespace = '_')), collapse = ' ')
#[1] "hello_world and hello_world"

回答3:

We can remove all the underlying which has a word boundary on either of the end. We use positive lookahead and lookbehind regex to find such underlyings. To remove underlying at the start and end we use trimws.

test<-"hello_world and _hello_world_"
gsub("(?<=\\b)_|_(?=\\b)", "", trimws(test, whitespace = '_'), perl = TRUE)
#[1] "hello_world and hello_world"

回答4:

You could use:

test <- "hello_world and _hello_world_"
output <- gsub("(?<![^\\W])_|_(?![^\\W])", "", test, perl=TRUE)
output

[1] "hello_world and hello_world"

Explanation of regex:

(?<![^\\W])  assert that what precedes is a non word character OR the start of the input
_            match an underscore to remove
|            OR
_            match an underscore to remove, followed by
(?![^\\W])   assert that what follows is a non word character OR the end of the input

来源：https://stackoverflow.com/questions/64135363/remove-all-punctuation-except-underline-between-characters-in-r-with-posix-chara

标签

posix

gsub