surrogate-pairs

How to remove surrogate characters in Java?

喜欢而已 提交于 2019-11-28 18:53:41
I am facing a situation where i get Surrogate characters in text that i am saving to MySql 5.1. As the UTF-16 is not supported in this, I want to remove these surrogate pairs manually by a java method before saving it to the database. I have written the following method for now and I am curious to know if there is a direct and optimal way to handle this. Thanks in advance for your help. public static String removeSurrogates(String query) { StringBuffer sb = new StringBuffer(); for (int i = 0; i < query.length() - 1; i++) { char firstChar = query.charAt(i); char nextChar = query.charAt(i+1); if

Split JavaScript string into array of codepoints? (taking into account “surrogate pairs” but not “grapheme clusters”)

半世苍凉 提交于 2019-11-28 10:54:25
Splitting a JavaScript string into "characters" can be done trivially but there are problems if you care about Unicode (and you should care about Unicode). JavaScript natively treats characters as 16-bit entities ( UCS-2 or UTF-16 ) but this does not allow for Unicode characters outside the BMP (Basic Multilingual Plane) . To deal with Unicode characters beyond the BMP, JavaScript must take into account " surrogate pairs ", which it does not do natively. I'm looking for how to split a js string by codepoint, whether the codepoints require one or two JavaScript "characters" (code units).

Detecting and Retrieving codepoints and surrogates from a Delphi String

て烟熏妆下的殇ゞ 提交于 2019-11-28 09:21:46
I am trying to better understand surrogate pairs and Unicode implementation in Delphi. If I call length() on the Unicode string S := 'Ĥà̲V̂e' in Delphi, I will get back, 8. This is because the lengths of the individual characters [Ĥ],[à̲],[V̂], and [e] are 2, 3, 2, and 1 respectively. This is because Ĥ has a surrogate, à̲ has two additional surrogates, V̂ has a surrogate and e has no surrogates. If I wanted to return the second element in the string including all surrogates, [à̲], how would I do that? I know I would need to do some sort of testing of the individual bytes. I ran some

Java charAt used with characters that have two code units

假如想象 提交于 2019-11-27 22:57:56
From Core Java , vol. 1, 9th ed., p. 69: The character ℤ requires two code units in the UTF-16 encoding. Calling String sentence = "ℤ is the set of integers"; // for clarity; not in book char ch = sentence.charAt(1) doesn't return a space but the second code unit of ℤ. But it seems that sentence.charAt(1) does return a space. For example, the if statement in the following code evaluates to true . String sentence = "ℤ is the set of integers"; if (sentence.charAt(1) == ' ') System.out.println("sentence.charAt(1) returns a space"); Why? I'm using JDK SE 1.7.0_09 on Ubuntu 12.10, if it's relevant.

Java Can't Open a File with Surrogate Unicode Values in the Filename?

半城伤御伤魂 提交于 2019-11-27 15:06:06
I'm dealing with code that does various IO operations with files, and I want to make it able to deal with international filenames. I'm working on a Mac with Java 1.5, and if a filename contains Unicode characters that require surrogates, the JVM can't seem to locate the file. For example, my test file is: "草鷗外.gif" which gets broken into the Java characters \u8349\uD85B\uDFF6\u9DD7\u5916.gif If I create a file from this filename, I can't open it because I get a FileNotFound exception. Even using this on the folder containing the file will fail: File[] files = folder.listFiles(); for (File file

Python: Find equivalent surrogate pair from non-BMP unicode char

半腔热情 提交于 2019-11-27 14:29:06
The answer presented here: How to work with surrogate pairs in Python? tells you how to convert a surrogate pair, such as '\ud83d\ude4f' into a single non-BMP unicode character (the answer being "\ud83d\ude4f".encode('utf-16', 'surrogatepass').decode('utf-16') ). I would like to know how to do this in reverse. How can I, using Python, find the equivalent surrogate pair from a non-BMP character, converting '\U0001f64f' (🙏) back to '\ud83d\ude4f' . I couldn't find a clear answer to that. You'll have to manually replace each non-BMP point with the surrogate pair. You could do this with a regular

How to remove surrogate characters in Java?

心已入冬 提交于 2019-11-27 12:27:38
问题 I am facing a situation where i get Surrogate characters in text that i am saving to MySql 5.1. As the UTF-16 is not supported in this, I want to remove these surrogate pairs manually by a java method before saving it to the database. I have written the following method for now and I am curious to know if there is a direct and optimal way to handle this. Thanks in advance for your help. public static String removeSurrogates(String query) { StringBuffer sb = new StringBuffer(); for (int i = 0;

How to use unicode in Android resource?

China☆狼群 提交于 2019-11-27 11:19:17
I want to use this unicode character in my resource file. But whatever I do, I end with dalvikvm crash (tested with Android 2.3 and 4.2.2): W/dalvikvm( 8797): JNI WARNING: input is not valid Modified UTF-8: illegal start byte 0xf0 W/dalvikvm( 8797): string: '📡' W/dalvikvm( 8797): in Landroid/content/res/StringBlock;.nativeGetString:(II)Ljava/lang/String; (NewStringUTF) E/dalvikvm( 8797): VM aborting F/libc ( 8797): Fatal signal 11 (SIGSEGV) at 0xdeadd00d (code=1), thread 8797 (cz.ipex...) I tried these version in my resource file: <string name="geolocation_icon" translatable="false">📡</string>

Java charAt used with characters that have two code units

馋奶兔 提交于 2019-11-27 04:37:16
问题 From Core Java , vol. 1, 9th ed., p. 69: The character ℤ requires two code units in the UTF-16 encoding. Calling String sentence = "ℤ is the set of integers"; // for clarity; not in book char ch = sentence.charAt(1) doesn't return a space but the second code unit of ℤ. But it seems that sentence.charAt(1) does return a space. For example, the if statement in the following code evaluates to true . String sentence = "ℤ is the set of integers"; if (sentence.charAt(1) == ' ') System.out.println(

What are the most common non-BMP Unicode characters in actual use? [closed]

若如初见. 提交于 2019-11-26 19:40:51
In your experience which Unicode characters, codepoints, ranges outside the BMP (Basic Multilingual Plane) are the most common so far? These are the ones which require 4 bytes in UTF-8 or surrogates in UTF-16. I would've expected the answer to be Chinese and Japanese characters used in names but not included in the most widespread CJK multibyte character sets, but on the project I do most work on, the English Wiktionary, we have found that the Gothic alphabet is far more common so far. UPDATE I've written a couple of software tools to scan entire Wikipedias for non-BMP characters and found to