Why R gsub (or regexp) for punctuation doesn't get all punctuation?

后端 未结 1 482
有刺的猬
有刺的猬 2021-01-20 15:51

I am working on cleaning up a text-based data file and cannot figure out how why the gsub(\"[[:punct:]]\", \"\", X1) is not giving a match for all punctuation.

1条回答
  •  野趣味
    野趣味 (楼主)
    2021-01-20 16:18

    Probably the punctuation character is out of the ascii range. By default [[:punct:]] contains only ascii punctuation characters. But you can extend the class to unicode with the (*UCP) directive. But this doesn't suffice, you need to inform the regex engine that it must read the target string as an utf encoded string with (*UTF) (otherwise a multibyte encoded character will be seen as several one byte characters). So:

    gsub("(*UCP)(*UTF)[[:punct:]]", "", X1, perl=T)
    

    Note: these two directives exist only in perl mode and must be at the very begining of the pattern.

    Note2: you can do the same like this:

    gsub("(*UTF)\\pP+", "", X1, perl=T)
    

    Because \pP is a shorthand for all unicode punctation characters, (*UCP) becomes useless.

    0 讨论(0)
提交回复
热议问题