I am working on cleaning up a text-based data file and cannot figure out how why the gsub(\"[[:punct:]]\", \"\", X1)
is not giving a match for all punctuation.
Probably the punctuation character is out of the ascii range. By default [[:punct:]]
contains only ascii punctuation characters. But you can extend the class to unicode with the (*UCP)
directive. But this doesn't suffice, you need to inform the regex engine that it must read the target string as an utf encoded string with (*UTF)
(otherwise a multibyte encoded character will be seen as several one byte characters). So:
gsub("(*UCP)(*UTF)[[:punct:]]", "", X1, perl=T)
Note: these two directives exist only in perl mode and must be at the very begining of the pattern.
Note2: you can do the same like this:
gsub("(*UTF)\\pP+", "", X1, perl=T)
Because \pP
is a shorthand for all unicode punctation characters, (*UCP)
becomes useless.