How to use Regex to strip punctuation without tainting UTF-8 or UTF-16 encoded text like chinese?

问题

How do I strip punctuation from ASCII and UTF-8 encoded strings without messing up the UTF-8 original characters, specifically Chinese, in R.

text <- "Longchamp Le Pliage 肩背包 (小)"
stri_replace_all_regex(text, '\\p{P}', '')

results in:

Longchamp Le Pliage ��背�� 小

but the desired result should be:

Longchamp Le Pliage 肩背包 小

I'm looking to remove all the CJK Symbols and Punctuation as well ask ASCII punctuations.

@akrun, sessionInfo() is as follows

locale:
[1] LC_COLLATE=English_Singapore.1252  LC_CTYPE=English_Singapore.1252    LC_MONETARY=English_Singapore.1252
[4] LC_NUMERIC=C                       LC_TIME=English_Singapore.1252

回答1:

Display of Chinese characters (hanzi) works variably depending on platform and IDE (see this answer for lots of details about R's handling of non-ASCII characters). It looks to me like stri_replace_all_regex is doing what you want, but that some of the hanzi are being displayed wrong (even if their underlying codepoints are correct). Try this:

library(stringi)
my_text <- "Longchamp Le Pliage 肩背包 (小)"
plot(0,0)
text(0, 0, my_text, pos=3)

If you can get the text to display on a plot, then underlyingly the string is properly encoded and the problem is just how it displays in the R terminal. If not, check Encoding(my_text) and consider using enc2utf8 before further text processing. If the plotting worked, try:

no_punct <- stri_replace_all_regex(my_text, "\\p{P}", "")
text(0, 0, no_punct, pos=1)

to see if the result of stri_replace_all_regex is in fact doing what you expect.

来源：https://stackoverflow.com/questions/32451109/how-to-use-regex-to-strip-punctuation-without-tainting-utf-8-or-utf-16-encoded-t

标签

regex

unicode

data.table

stringi