Convert Unicode Escape to Hebrew text

后端 未结 1 656
抹茶落季
抹茶落季 2021-01-20 19:46

I have the following text in a json file:

\"\\u00d7\\u0090\\u00d7\\u0097\\u00d7\\u0095\\u00d7\\u0096\\u00d7\\u00aa 
\\u00d7\\u00a4\\u00d7\\u0095\\u00d7\\u009         


        
相关标签:
1条回答
  • 2021-01-20 20:16

    This string does not "represent" Hebrew text (at least not as unicode code points, UTF-16, UTF-8, or in any well-known way at all). Instead, it represents a sequence of UTF-16 code units, and this sequence consists mostly of multiplication signs, currency signs, and some weird control characters.

    It looks like the original character data has been encoded and decoded several times with some strange combination of encodings.

    Assuming that this is what literally is saved in your JSON file:

    "\u00d7\u0090\u00d7\u0097\u00d7\u0095\u00d7\u0096\u00d7\u00aa \u00d7\u00a4\u00d7\u0095\u00d7\u009c\u00d7\u0092"
    

    you can recover the Hebrew text as follows:

    (jsonInput
      .encode('latin-1')
      .decode('raw_unicode_escape')
      .encode('latin-1')
      .decode('utf-8')
    )
    

    For the above example, it gives:

    'אחוזת פולג'
    

    If you are using a JSON deserializer to read in the data, then you should of course omit the .encode('latin-1').decode('raw_unicode_escape') steps, because the JSON deserializer would already interpret the escape sequences for you. That is, after the text element is loaded by JSON deserializer, it should be sufficient to just encode it as latin-1 and then decode it as utf-8. This works because latin-1 (ISO-8859-1) is an 8-bit character encoding that corresponds exactly to the first 256 code points of unicode, whereas your strangely broken text encodes each byte of UTF-8 encoding as an ASCII-escape of an UTF-16 code unit.

    I'm not sure what you can do if your JSON contains both the broken escape sequences and valid text at the same time, it might be that the latin-1 doesn't work properly any more. Please don't apply this transformation to your JSON file unless the JSON itself contains only ASCII, it would only make everything worse.

    0 讨论(0)
提交回复
热议问题