utf-16

Transform UTF8 string to UCS-2 with replace invalid characters in java

懵懂的女人 提交于 2020-12-15 04:55:48
问题 I have a sting in UTF8: "Red🌹🌹Röses" I need that to be converted to valid UCS-2(or fixed size UTF-16BE without BOM, they are the same things) encoding, so the output will be: "Red Röses" as the "🌹" out of range of UCS-2. What I have tried: @Test public void testEncodeProblem() throws CharacterCodingException { String in = "Red\uD83C\uDF39\uD83C\uDF39Röses"; ByteBuffer input = ByteBuffer.wrap(in.getBytes()); CharsetDecoder utf8Decoder = StandardCharsets.UTF_16BE.newDecoder(); utf8Decoder

How is const std::wstring encoded and how to change to UTF-16

不羁的心 提交于 2020-12-12 09:41:58
问题 I created this minimum working C++ example snippet to compare bytes (by their hex representation) in a std::string and a std::wstring when defining a string with german non-ASCII characters in either type. #include <iostream> #include <iomanip> #include <string> int main(int, char**) { std::wstring wstr = L"äöüß"; std::string str = "äöüß"; for ( unsigned char c : str ) { std::cout << std::setw(2) << std::setfill('0') << std::hex << static_cast<unsigned short>(c) << ' '; } std::cout << std:

How is const std::wstring encoded and how to change to UTF-16

泄露秘密 提交于 2020-12-12 09:38:07
问题 I created this minimum working C++ example snippet to compare bytes (by their hex representation) in a std::string and a std::wstring when defining a string with german non-ASCII characters in either type. #include <iostream> #include <iomanip> #include <string> int main(int, char**) { std::wstring wstr = L"äöüß"; std::string str = "äöüß"; for ( unsigned char c : str ) { std::cout << std::setw(2) << std::setfill('0') << std::hex << static_cast<unsigned short>(c) << ' '; } std::cout << std:

How to convert UTF8 string to UTF16

旧时模样 提交于 2020-08-23 08:56:20
问题 I'm getting a UTF8 string by processing a request sent by a client application. But the string is really UTF16. What can I do to get it into my local string is a letter followed by \0 character? I need to convert that String into UTF16. Sample received string: S\0a\0m\0p\0l\0e (UTF8). What I want is : Sample (UTF16) FileItem item = (FileItem) iter.next(); String field = ""; String value = ""; if (item.isFormField()) { try{ value=item.getString(); System.out.println("====" + value); } 回答1: The

How to convert UTF8 string to UTF16

狂风中的少年 提交于 2020-08-23 08:55:38
问题 I'm getting a UTF8 string by processing a request sent by a client application. But the string is really UTF16. What can I do to get it into my local string is a letter followed by \0 character? I need to convert that String into UTF16. Sample received string: S\0a\0m\0p\0l\0e (UTF8). What I want is : Sample (UTF16) FileItem item = (FileItem) iter.next(); String field = ""; String value = ""; if (item.isFormField()) { try{ value=item.getString(); System.out.println("====" + value); } 回答1: The

How does UTF-16 achieve self-synchronization?

梦想的初衷 提交于 2020-06-29 05:09:15
问题 I know that UTF-16 is a self-synchronizing encoding scheme. I also read the below Wiki, but did not quite get it. Self Synchronizing Code Can you please explain me with an example of UTF-16? 回答1: In UTF-16 characters outside of the BMP are represented using a surrogate pair in with the first code unit (CU) lies between 0xD800—0xDBFF and the second one between 0xDC00—0xDFFF. Each of the CU represents 10 bits of the code point. Characters in the BMP is encoded as itself. Now the synchronization

How UTF-16 and UTF-8 conversion happen?

不打扰是莪最后的温柔 提交于 2020-06-28 04:42:18
问题 I'm kinda confused about unicode characters codepoints conversion to UTF-16 and I'm looking for someone who can explain it to me in the easiest way possible. For characters like "𐒌" we get; d801dc8c --> UTF-16 0001048c --> UTF-32 f090928c --> UTF-8 66700 --> Decimal Value So, UTF-16 hexadecimal value converts to " 11011000 00000001 11011100 10001100 " which is " 3624000652 " in decimal value, so my question is how do we got this value in hexadecimal?? and how can we convert it back to the

Deprecated header <codecvt> replacement

戏子无情 提交于 2020-04-18 05:47:58
问题 A bit of foreground: my task required converting UTF-8 XML file to UTF-16 (with proper header, of course). And so I searched about usual ways of converting UTF-8 to UTF-16, and found out that one should use templates from <codecvt> . But now when it is deprecated, I wonder what is the new common way of doing the same task? (Don't mind using Boost at all, but other than that I prefer to stay as close to standard library as possible.) 回答1: std::codecvt template from <locale> itself isn't

浅谈Unicode编码

孤者浪人 提交于 2020-03-06 13:36:29
目录 1.概述 2.ASCII编码 3.历史问题 4.Unicode 4-1.Unicode 编码方案 4-2.关于bom 5.UTF-8 6.UTF-16 1.概述 对于ASCII编码,相信同学们都比较了解,那么对于Unicode、UTF-8和UTF-16,它们是怎么编码的呢?以及它们之间的关系是什么呢?它们与ASCII之间又有什么关系? 本文就来回答这两个问题。 2.ASCII编码 在学校学 C 语言的时候,了解到一些计算机内部的机制,知道所有的信息最终都表示为一个二进制的字符串,每一个二进制位有 0 和 1 两种状态,通过不同的排列组合,使用 0 和 1 就可以表示世界上所有的东西,感觉有点中国“太极”的感觉——“太极生两仪,两仪生四象,四象生八卦”。 在计算机种中,1 字节对应 8 位二进制数,而每位二进制数有 0、1 两种状态,因此 1 字节可以组合出 256 种状态。如果这 256 中状态每一个都对应一个符号,就能通过 1 字节的数据表示 256 个字符。美国人于是就制定了一套编码(其实就是个字典),描述英语中的字符和这 8 位二进制数的对应关系,这被称为 ASCII 码。 ASCII 码一共定义了 128 个字符,例如大写的字母 A 是 65(这是十进制数,对应二进制是0100 0001)。这 128 个字符只使用了 8 位二进制数中的后面 7 位

iOS - Unicode编码

走远了吗. 提交于 2020-03-06 13:35:52
一、来历 为了统一编码,各大龙头企业就决定坐下来对全世界的字符进行编码,并且尽量兼容现有字符集,这就有了unicode编码。Unicode用了21个二进制位,能够编码一百多万个字符,但实际上并没有编码这么多。U+XXXX中XXXX就是码点,就是字符在unicode的数字表示。 编码空间被分成 17 个平面(plane),每个平面有 65,536 个字符。0 号平面叫做「基本多文种平面」(Basic Multilingual Plane, BMP),涵盖了几乎所有你能遇到的字符,除了 emoji。其它平面叫做补充平面,大多是空的。 二、UTF-32、UTF-16、UTF-8 什么字符被表示成什么样子的规定有了,就要考虑怎么存起来了,这就分成了UTF-32、UTF-16、UTF-8三种了。所以UTF-32、UTF-16、UTF-8只是unicode的三种实现方式。 三、UTF-32 unicode用了21位,那我就用4个字节存,准错不了,这就是UTF-32,由于它的极度浪费,所以基本上没人用。 四、UTF-16 UTF-16编码介于UTF-32与UTF-8之间,同时结合了定长和变长两种编码方法的特点。UTF-16把字符存储成2个字节或者4个字节。 具体如下: 基本平面的字符占用2个字节,辅助平面的字符占用4个字节。也就是说,UTF-16的编码长度要么是2个字节(U+0000到U