Why R gsub (or regexp) for punctuation doesn't get all punctuation?

后端未结

关注

 1  498

I am working on cleaning up a text-based data file and cannot figure out how why the gsub(\"[[:punct:]]\", \"\", X1) is not giving a match for all punctuation.

相关标签:

1条回答

野趣味

2021-01-20 16:18
Probably the punctuation character is out of the ascii range. By default [[:punct:]] contains only ascii punctuation characters. But you can extend the class to unicode with the (*UCP) directive. But this doesn't suffice, you need to inform the regex engine that it must read the target string as an utf encoded string with (*UTF) (otherwise a multibyte encoded character will be seen as several one byte characters). So:
```
gsub("(*UCP)(*UTF)[[:punct:]]", "", X1, perl=T)
```
Note: these two directives exist only in perl mode and must be at the very begining of the pattern.

Note2: you can do the same like this:
```
gsub("(*UTF)\\pP+", "", X1, perl=T)
```
Because \pP is a shorthand for all unicode punctation characters, (*UCP) becomes useless.
0 讨论(0)
发布评论:

提交评论
- 加载中...