surrogate-pairs | 易学教程

How to remove surrogate characters in Java?

阅读更多关于 How to remove surrogate characters in Java?

I am facing a situation where i get Surrogate characters in text that i am saving to MySql 5.1. As the UTF-16 is not supported in this, I want to remove these surrogate pairs manually by a java method before saving it to the database. I have written the following method for now and I am curious to know if there is a direct and optimal way to handle this. Thanks in advance for your help. public static String removeSurrogates(String query) { StringBuffer sb = new StringBuffer(); for (int i = 0; i < query.length() - 1; i++) { char firstChar = query.charAt(i); char nextChar = query.charAt(i+1); if

Split JavaScript string into array of codepoints? (taking into account “surrogate pairs” but not “grapheme clusters”)

阅读更多关于 Split JavaScript string into array of codepoints? (taking into account “surrogate pairs” but not “grapheme clusters”)

Splitting a JavaScript string into "characters" can be done trivially but there are problems if you care about Unicode (and you should care about Unicode). JavaScript natively treats characters as 16-bit entities ( UCS-2 or UTF-16 ) but this does not allow for Unicode characters outside the BMP (Basic Multilingual Plane) . To deal with Unicode characters beyond the BMP, JavaScript must take into account " surrogate pairs ", which it does not do natively. I'm looking for how to split a js string by codepoint, whether the codepoints require one or two JavaScript "characters" (code units).

Detecting and Retrieving codepoints and surrogates from a Delphi String

阅读更多关于 Detecting and Retrieving codepoints and surrogates from a Delphi String

I am trying to better understand surrogate pairs and Unicode implementation in Delphi. If I call length() on the Unicode string S := 'Ĥà̲V̂e' in Delphi, I will get back, 8. This is because the lengths of the individual characters [Ĥ],[à̲],[V̂], and [e] are 2, 3, 2, and 1 respectively. This is because Ĥ has a surrogate, à̲ has two additional surrogates, V̂ has a surrogate and e has no surrogates. If I wanted to return the second element in the string including all surrogates, [à̲], how would I do that? I know I would need to do some sort of testing of the individual bytes. I ran some

Java charAt used with characters that have two code units

阅读更多关于 Java charAt used with characters that have two code units

From Core Java , vol. 1, 9th ed., p. 69: The character ℤ requires two code units in the UTF-16 encoding. Calling String sentence = "ℤ is the set of integers"; // for clarity; not in book char ch = sentence.charAt(1) doesn't return a space but the second code unit of ℤ. But it seems that sentence.charAt(1) does return a space. For example, the if statement in the following code evaluates to true . String sentence = "ℤ is the set of integers"; if (sentence.charAt(1) == ' ') System.out.println("sentence.charAt(1) returns a space"); Why? I'm using JDK SE 1.7.0_09 on Ubuntu 12.10, if it's relevant.

Java Can't Open a File with Surrogate Unicode Values in the Filename?

阅读更多关于 Java Can't Open a File with Surrogate Unicode Values in the Filename?

I'm dealing with code that does various IO operations with files, and I want to make it able to deal with international filenames. I'm working on a Mac with Java 1.5, and if a filename contains Unicode characters that require surrogates, the JVM can't seem to locate the file. For example, my test file is: "草鷗外.gif" which gets broken into the Java characters \u8349\uD85B\uDFF6\u9DD7\u5916.gif If I create a file from this filename, I can't open it because I get a FileNotFound exception. Even using this on the folder containing the file will fail: File[] files = folder.listFiles(); for (File file

Python: Find equivalent surrogate pair from non-BMP unicode char

阅读更多关于 Python: Find equivalent surrogate pair from non-BMP unicode char

The answer presented here: How to work with surrogate pairs in Python? tells you how to convert a surrogate pair, such as '\ud83d\ude4f' into a single non-BMP unicode character (the answer being "\ud83d\ude4f".encode('utf-16', 'surrogatepass').decode('utf-16') ). I would like to know how to do this in reverse. How can I, using Python, find the equivalent surrogate pair from a non-BMP character, converting '\U0001f64f' (🙏) back to '\ud83d\ude4f' . I couldn't find a clear answer to that. You'll have to manually replace each non-BMP point with the surrogate pair. You could do this with a regular

How to remove surrogate characters in Java?

阅读更多关于 How to remove surrogate characters in Java?

问题 I am facing a situation where i get Surrogate characters in text that i am saving to MySql 5.1. As the UTF-16 is not supported in this, I want to remove these surrogate pairs manually by a java method before saving it to the database. I have written the following method for now and I am curious to know if there is a direct and optimal way to handle this. Thanks in advance for your help. public static String removeSurrogates(String query) { StringBuffer sb = new StringBuffer(); for (int i = 0;

How to use unicode in Android resource?

阅读更多关于 How to use unicode in Android resource?

I want to use this unicode character in my resource file. But whatever I do, I end with dalvikvm crash (tested with Android 2.3 and 4.2.2): W/dalvikvm( 8797): JNI WARNING: input is not valid Modified UTF-8: illegal start byte 0xf0 W/dalvikvm( 8797): string: '📡' W/dalvikvm( 8797): in Landroid/content/res/StringBlock;.nativeGetString:(II)Ljava/lang/String; (NewStringUTF) E/dalvikvm( 8797): VM aborting F/libc ( 8797): Fatal signal 11 (SIGSEGV) at 0xdeadd00d (code=1), thread 8797 (cz.ipex...) I tried these version in my resource file: <string name="geolocation_icon" translatable="false">📡</string>

Java charAt used with characters that have two code units

阅读更多关于 Java charAt used with characters that have two code units

问题 From Core Java , vol. 1, 9th ed., p. 69: The character ℤ requires two code units in the UTF-16 encoding. Calling String sentence = "ℤ is the set of integers"; // for clarity; not in book char ch = sentence.charAt(1) doesn't return a space but the second code unit of ℤ. But it seems that sentence.charAt(1) does return a space. For example, the if statement in the following code evaluates to true . String sentence = "ℤ is the set of integers"; if (sentence.charAt(1) == ' ') System.out.println(

What are the most common non-BMP Unicode characters in actual use? [closed]

阅读更多关于 What are the most common non-BMP Unicode characters in actual use? [closed]

In your experience which Unicode characters, codepoints, ranges outside the BMP (Basic Multilingual Plane) are the most common so far? These are the ones which require 4 bytes in UTF-8 or surrogates in UTF-16. I would've expected the answer to be Chinese and Japanese characters used in names but not included in the most widespread CJK multibyte character sets, but on the project I do most work on, the English Wiktionary, we have found that the Gothic alphabet is far more common so far. UPDATE I've written a couple of software tools to scan entire Wikipedias for non-BMP characters and found to