unicode characters conversion in R

后端未结

关注

 3  702

I have this MTST column, which when printed yields

 [1] \"G


                      
              相关标签:


      
      
        
          3条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  清歌不尽        
                
              
                            
                2020-12-03 20:10
              
            
            
                                                                       
Here's another alternative for recovering the true uncode character encoded in a string (borrowed from this question). Here we carefully match the form <U+[hex]> and unroll that hex value into a properly sized unicode character with some bit manipulation.

trueunicode <- function(x) {
    packuni<-Vectorize(function(cp) {
        bv <- intToBits(cp)
        maxbit <- tail(which(bv!=as.raw(0)),1)
        if(maxbit < 8) {
            rawToChar(as.raw(codepoint))
        } else if (maxbit < 12) {
            rawToChar(rev(packBits(c(bv[1:6], as.raw(c(0,1)), bv[7:11], as.raw(c(0,1,1))), "raw")))
        } else if (maxbit < 17){
            rawToChar(rev(packBits(c(bv[1:6], as.raw(c(0,1)), bv[7:12], as.raw(c(0,1)), bv[13:16], as.raw(c(0,1,1,1))), "raw")))    
        } else {
           stop("too many bits")
        }
    })
    m <- gregexpr("<U\\+[0-9a-fA-F]{4}>", x)
    codes <- regmatches(x,m)
    chars <- lapply(codes, function(x) {
        codepoints <- strtoi(paste0("0x", substring(x,4,7)))
        packuni(codepoints)

    })
    regmatches(x,m) <- chars
    Encoding(x)<-"UTF-8"
    x
}


using the sample

input <- c("<U+0391>G<U+03A1><U+0399><U+039D><U+0399><U+039F>", "<U+0391>G<U+03A7><U+0399><U+0391><U+039B><U+039F>S","<U+0391><U+0399>G<U+0399><U+039D><U+0391>", "<U+0391><U+0399>G<U+0399><U+039F>", "<U+0391><U+0399><U+0394><U+0397><U+03A8><U+039F>S","<U+0391><U+039A><U+03A4><U+0399><U+039F>(<U+03A0><U+03A1><U+0395><U+0392><U+0395><U+0396><U+0391>)")


you get

trueunicode(input)
# [1] "ΑGΡΙΝΙΟ"        "ΑGΧΙΑΛΟS"       "ΑΙGΙΝΑ"         "ΑΙGΙΟ"         
# [5] "ΑΙΔΗΨΟS"        "ΑΚΤΙΟ(ΠΡΕΒΕΖΑ)"

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  天涯浪人        
                
              
                            
                2020-12-03 20:12
              
            
            
                                                                       
What you have there looks like plain 7-bit ASCII characters with some attempt at encoding Unicode code-points by wrapping some of them thus: <U+abcd>.

This is not a recognised encoding for Unicode, as far as I can tell, partly because how would you put a real < in your text? I suppose every < could be <U+jklm> where jklm is the code for an angle bracket... But ick.

So, first, try and get a UTF-8 encoded string from whatever generated this ascii-encoded mess!

However... after some serious hair pulling...

stringi to the rescue! Where 'MTST' is your vector of stuff, first convert the angle bracket notation to backslash-u and then use stri_unescape_unicode:

> require(stringi)
> greek2=gsub(">","", gsub("<U\\+","\\\\u",MTST))
> stri_unescape_unicode(greek2)
[1] "ΑGΡΙΝΙΟ                                 "
[2] "ΑGΧΙΑΛΟS                                "
[3] "ΑΙGΙΝΑ                                  "
[4] "ΑΙGΙΟ                                   "
[5] "ΑΙΔΗΨΟS                                 "
[6] "ΑΚΤΙΟ(ΠΡΕΒΕΖΑ)                          "


all the way up to

[123] "FΥΧΤΙΑ                                  "
[124] "ΧΑΛΚΙΔΑ                                 "
[125] "ΧΑΝΙΑ                                   "
[126] "ΧΙΟS                                    "
[127] "ΧΡΥSΟΥΠΟΛΗ_ΚΑΒΑΛΑ                       "
[128] "OΡΕΟΙ                                   "


once I fixed the bizarrely missing comma and quote mark in your "dput" data (edited your question for you).
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  春和景丽        
                
              
                            
                2020-12-03 20:13
              
            
            
                                                                       
I've written a convenient, general, and internally slightly wonky function in base R that works well for this purpose. Here it is:

dsub <- function(input,re,f=function(s,d) paste0(s,c(if (length(d)==0L) NULL else paste0('<',d,'>'),''),collapse='')) {
    splits <- strsplit(input,re,perl=T);
    delims <- lapply(strsplit(gsub(paste0('(',re,')'),'.\\1',input,perl=T),paste0('\\.(?=',re,')'),perl=T),function(x) sub(paste0('^(',re,').*'),'\\1',x[-1],perl=T))
    lapply(1:length(splits), function(i) { s <- splits[[i]]; d <- delims[[i]]; f(c(s,if (length(s)==length(d)) '' else NULL),d); } );
};


The idea behind the function is to provide a more powerful variation of strsplit() which allows you to not just split the input strings into fields, but have a lambda called once for each input string, which takes both the field list (I called it splits or s in my code) and the delimiters that delimited each field (called it delims or d).

Importantly, the final field is never delimited, so s is always one element longer than d. Now, it should be noted that that's not how strsplit() normally behaves; it actually doesn't return a final empty string field if the final delimiter in the input string comprised the end of said string, but I've sort of "patched" that behavior in my dsub() function for the sake of consistency; for every call to the lambda f(), it is guaranteed that s will be one element longer than d.

A second quirk is related to the way I extracted the delimiters; that was easier said than done. I used strsplit() again, but made the regex a zero-width lookahead assertion to preserve the delimiter content, and then, after the split, I called sub() to strip off everything after the delimiter. Now, strsplit() behaves weirdly when you use an entirely zero-width regex that matches multiple characters; I think what happens is that it matches the regex twice in the same spot, and then splits up the first and subsequent characters across adjacent returned fields. To solve that I added a dummy char before every instance of the delimiter and then matched that char (non-zero-width, just prior to the lookahead assertion) as part of the split regex, which naturally strips it off.

Anyway, here's a simple demo that shows how you can use dsub() with the lambda calling intToUtf8() to do this kind of "Unicode interpolation":

input <- c('Luc TR<U+00c9>HAN','aa<U+00ca>bb<U+00cb>cc','<U+00CC><U+00Cd>','','  ');
re <- '<U\\+([0-9a-fA-F]{4})>';
f <- function(s,d) paste0(s,c(if (length(d)==0L) NULL else intToUtf8(paste0('0x',sub(re,'\\1',d)),multiple=T),''),collapse='');
do.call(c,dsub(input,re,f));
## [1] "Luc TRÉHAN" "aaÊbbËcc"   "ÌÍ"         ""           "  "


And using the rather extensive example data given in this question:

input <- c("<U+0391>G<U+03A1><U+0399><..."); ## (excerpted)
do.call(c,dsub(input,re,f));
##   [1] "ΑGΡΙΝΙΟ                                 " "ΑGΧΙΑΛΟS                                " "ΑΙGΙΝΑ                                  " "ΑΙGΙΟ                                   "
##   [5] "ΑΙΔΗΨΟS                                 " "ΑΚΤΙΟ(ΠΡΕΒΕΖΑ)                          " "ΑΛΕΞΑΝΔΡΟΥΠΟΛΗ                          " "ΑΛΙΑΡΤΟS                                "
##   [9] "ΑΝΑΒΡΥΤΑ                                " "ΑΝΔΡΑΒΙΔΑ                               " "ΑΝOGΕΙΑ                                 " "ΑΡΑΞΟS                                  "
##  [13] "ΑΡΑΧOΒΑ                                 " "ΑΡGΟS(ΠΥΡGΕΛΑ)                          " "ΑΡGΟSΤΟΛΙ                               " "ΑΡΤΑ (ΠΟΛΗ)                             "
##  [17] "ΑΡΤΑ (FΙΛΟTΕΗ)                          " "ΑSΤΕΡΟSΚΟΠΕΙΟ                           " "ΑSΤΡΟS                                  " "ΑSΤΥΠΑΛΑΙΑ                              "
##  [21] "ΒΑΜΟS                                   " "ΒΕΛΟ (ΚΟΡΙΝTΙΑS)                        " "ΒΟΛΟS                                   " "ΒΥΤΙΝΑ                                  "
##  [25] "GΟΡΤΥS                                  " "GΥTΕΙΟ                                  " "ΔΕSFΙΝΑ                                 " "ΔΙΑΒΟΛΙΤSΙ                              "
##  [29] "ΔΟΜΟΚΟS                                 " "ΔΡΑΜΑ                                   " "ΕΔΕSSΑ                                  " "ΕΛΕΥSΙΝΑ                                "
##  [33] "ΕΛΛΗΝΙΚΟ aeρ                            " "ΖΑΚΥΝTΟS                                " "ΖΑΚΥΝTΟS_ΠΟΛΗ                           " "ΖΑΡΟS                                   "
##  [37] "ΗΡΑΚΛΕΙΟ                                " "TΑSΟS                                   " "TΗΡΑ (SΑΝΤΟΡΙΝΗ"                          "ΙΕΡΑΠΕΤΡΑ                               "
##  [41] "ΙΚΑΡΙΑ_Α/Δ                              " "ΙOΑΝΝΙΝΑ                                " "ΚΑΒΑΛΑ (ΠΟΛΗ)                           " "ΚΑΒΑΛΑ(ΑΜΥGΔΑΛΕOΝΑS)                    "
##  [45] "ΚΑΛΑΒΡΥΤΑ                               " "ΚΑΛΑΜΑΤΑ                                " "ΚΑΛΑΜΠΑΚΑ                               " "ΚΑΡΔΙΤSΑ                                "
##  [49] "ΚΑΡΠΑTΟS_Α/Δ                            " "ΚΑΡΠΑTΟS_ΠΟΛΗ                           " "ΚΑΡΠΕΝΗSΙ                               " "ΚΑΡΥSΤΟS                                "
##  [53] "ΚΑSΟS                                   " "ΚΑSΤΕΛΛΙ                                " "ΚΑSΤΟΡΙΑ                                " "ΚΕΡΚΥΡΑ                                 "
##  [57] "ΚΟΖΑΝΗ                                  " "ΚΟΜΟΤΗΝΗ                                " "ΚΟΝΙΤSΑ                                 " "ΚΟΡΙΝTΟS                                "
##  [61] "ΚΥTΗΡΑ_Α/Δ                              " "ΚΥΜΗ                                    " "ΚOS                                     " "ΚOS_ΠΟΛΗ                                "
##  [65] "ΛΑΜΙΑ                                   " "ΛΑΡΙSΑ                                  " "ΛΕΡΟS                                   " "ΛΕΥΚΑΔΑ (ΝΗSΙ)                          "
##  [69] "ΛΕOΝΙΔΙΟ                                " "ΛΗΜΝΟS                                  " "ΛΙΔOΡΙΚΙ                                " "ΜΑΚΕΔΟΝΙΑ                               "
##  [73] "ΜΑΡΑTOΝΑS                               " "ΜΕTOΝΗ                                  " "ΜΕSΟΛΟGGΙ                               " "ΜΗΛΟS_ΑΜS                               "
##  [77] "ΜΥΚΟΝΟS                                 " "ΜΥΤΙΛΗΝΗ                                " "ΝΑΞΟS                                   " "ΝΑΥΠΑΚΤΟS                               "
##  [81] "ΝΑΥΠΛΙΟ                                 " "ΝΕΑ FΙΛΑΔΕΛFΕΙΑ                         " "ΞΑΝTΗ                                   " "ΟΡΕSΤΙΑΔΑ                               "
##  [85] "ΠΑΙΑΝΙΑ                                 " "ΠΑΛΑΙΟΧOΡΑ                              " "ΠΑΡΟS_Α/Δ                               " "ΠΑΤΡΑ                                   "
##  [89] "ΠΕΙΡΑΙΑS                                " "ΠΟΛΥGΥΡΟS                               " "ΠΟΤΙΔΑΙΑ                                " "ΠΤΟΛΕΜΑΙΔΑ                              "
##  [93] "ΠΥΡGΟS                                  " "ΡΑFΗΝΑ                                  " "ΡΕTΥΜΝΟ                                 " "ΡΟΔΟS                                   "
##  [97] "SΑΜΟS                                   " "SΕΔΕS                                   " "SΕΡΡΕS                                  " "SΗΤΕΙΑ                                  "
## [101] "SΚΙΑTΟS                                 " "SΚΟΤΙΝΑ                                 " "SΚΥΡΟS                                  " "SΟΥΔΑ                                   "
## [105] "SΟΥFΛΙ                                  " "SΠΑΡΤΗ                                  " "SΠΑΤΑ(ΒΕΝΙΖΕΛΟS)                        " "SΠΕΤSΕS                                 "
## [109] "SΤΕFΑΝΙ (ΚΟΡΙΝTΙΑS)                     " "SΥΚΥOΝΑ                                 " "SΥΡΟS_Α/Δ                               " "ΤΑΝΑGΡΑ                                 "
## [113] "ΤΑΤΟΙ (ΔΕΚΕΛΕΙΑ)                        " "ΤΖΕΡΜΙΑΔΕS                              " "ΤΡΙΚΑΛΑ ΗΜΑTΕΙΑS                        " "ΤΡΙΚΑΛΑ TΕSSΑΛΙΑS                       "
## [117] "ΤΡΙΠΟΛΗ                                 " "ΤΥΜΠΑΚΙ                                 " "ΤΥΡΙΝTΑ                                 " "FΑΡSΑΛΑ                                 "
## [121] "FΛOΡΙΝΑ                                 " "FΟΥΡΝΗ                                  " "FΥΧΤΙΑ                                  " "ΧΑΛΚΙΔΑ                                 "
## [125] "ΧΑΝΙΑ                                   " "ΧΙΟS                                    " "ΧΡΥSΟΥΠΟΛΗ_ΚΑΒΑΛΑ                       " "OΡΕΟΙ                                   "

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复