How to match all internationalized text?

后端 未结 2 426
眼角桃花
眼角桃花 2021-01-22 05:49

I\'m on a search-and-destroy mission for anything Amazon finds distasteful. In the past I\'ve dealt with this by using iconv to convert from \"UTF-8\" to \"latin1\

相关标签:
2条回答
  • 2021-01-22 06:29

    I looped a bit through iconvlist() and found this (among other combinations):

    test<-"Gwena\xeblle M"
    iconv(test,"CP1163","UTF-8")
     [1] "Gwenaëlle M"
    

    I realize, this is not what you asked for, but it might be possible to find the correct encoding.

    0 讨论(0)
  • 2021-01-22 06:41

    I believe this pattern should work:

    pat <- "[\x80-\xFF]"
    
    test <- c("Gwena\xeblle M", "\x92","\xe4","\xe1","\xeb") 
    gsub(pat, "", test, perl=TRUE)
    # [1] "Gwenalle M" ""           ""           ""           ""     
    

    Explanation:

    It works because the character class "[\x00-\xFF]" would match all characters of the form \x##. But the first half of those -- the 0th to 127th (or 00'th to 7F'th in hex digits) -- are the ASCII characters. So it's the second half of them -- the 128th to 255th (or 80'th to FF'th in hex mode) -- that you want to search out and destroy.

    0 讨论(0)
提交回复
热议问题