What happens under the hood when bytes converted to String in Java?

后端未结

关注

 4  1878

I have a problem when trying to convert bytes to String in Java, with code like:

byte[] bytes = {1, 2, -3};

byte[] transferred = new String(bytes, Charsets.


                      
              相关标签:


      
      
        
          4条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  别那么骄傲        
                
              
                            
                2021-01-17 18:36
              
            
            
                                                                       
There is a line in the documentation of the constructor:


  This method always replaces malformed-input and unmappable-character sequences with this charset's default replacement string.


This is definitely the culprit here, as -3 is invalid in UTF-8. By the way, if you are really interested, you can always download the source of the rt.jar, and debug into it.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  执念已碎        
                
              
                            
                2021-01-17 18:54
              
            
            
                                                                       
The encoded values you are getting, [-17, -65, -67] correspond to Unicode code point 0xFFFD. If you look up that code point, the Unicode specification tells you that 0XFFFD "used to replace an incoming character whose value is unknown or unrepresentable in Unicode." And as others have pointed out, -3 without any followup code-units is broken UTF-8, so this character is appropriate.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  -上瘾入骨i        
                
              
                            
                2021-01-17 18:58
              
            
            
                                                                       
In Java, byte is signed, where negative values are above 127. And those you used (-3 = 0xFD, -32 = 0xE0) are not valid in UTF-8, so they both are converted to Unicode codepoint U+FFFD REPLACEMENT CHARACTER, which is converted back to UTF-8 as 0xEF = -17, 0xBF = -65, 0xBD = -67.

You cannot expect that random byte values are correctly interpreted as UTF-8 text.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  不思量自难忘°        
                
              
                            
                2021-01-17 19:03
              
            
            
                                                                       
Not all sequences of bytes are valid in UTF-8.

UTF-8 is a smart scheme with a variable number of bytes per code point, the form of every byte indicating how many other bytes follow for the same code point.

Refer to this table:



Now let's see how it applies to your {1, 2, -3}:

Bytes 1 (hex 0x01, binary 00000001) and 2 (hex 0x02, binary 00000010) stand alone, no problem.

Byte -3 (hex 0xFD, binary 11111101) is the start byte of a 6-byte sequence (which is actually illegal in the current UTF-8 standard), but your byte array does not have such a sequence.

Your UTF-8 is invalid.  The Java UTF-8 decoder replaces this invalid byte -3 with Unicode codepoint U+FFFD REPLACEMENT CHARACTER (also see this).  in UTF-8, codepoint U+FFFD is hex 0xEF 0xBF 0xBD (binary 11101111 10111111 10111101), represented in Java as -17, -65, -67. 
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复