utf | 易学教程

What are surrogate characters in UTF-8?

阅读更多关于 What are surrogate characters in UTF-8?

问题 I have a strange validation program that validates wheather a utf-8 string is a valid host name(Zend Framework Hostname valdiator in PHP). It allows IDNs(internationalized domain names). It will compare each subdomain with sets of characters defined by their HEX bytes representation. Two such sets are D800-DB7F and DC00-DFFF . Php regexp comparing function called preg_match fails during these comparsions and it says that DC00-DFFF characters are not allowed in this function. From wikipedia I

What is a surrogate pair?

阅读更多关于 What is a surrogate pair?

问题 I came across this code in a javascript open source project. validator.isLength = function (str, min, max) // match surrogate pairs in string or declare an empty array if none found in string var surrogatePairs = str.match(/[\uD800-\uDBFF][\uDC00-\uDFFF]/g) || []; // subtract the surrogate pairs string length from main string length var len = str.length - surrogatePairs.length; // now compare string length with min and max ... also make sure max is defined(in other words, max param is

What is a surrogate pair?

阅读更多关于 What is a surrogate pair?

Python zipfile module - zipfile.write() file with turkish chars in filename

阅读更多关于 Python zipfile module - zipfile.write() file with turkish chars in filename

问题 On my system there are many Word documents and I want to zip them using the Python module zipfile . I have found this solution to my problem, but on my system there are files which contain German umlauts and Turkish characters in their filename. I have adapted the method from the solution like this, so it can process German umlauts in the filenames: def zipdir(path, ziph): for root, dirs, files in os.walk(path): for file in files: current_file = os.path.join(root, file) print "Adding to

Java or Scala. How to convert characters like \x22 into String

阅读更多关于 Java or Scala. How to convert characters like \x22 into String

问题 I have a string that looks like this: {\x22documentReferer\x22:\x22http:\x5C/\x5C/pikabu.ru\x5C/freshitems.php\x22} How could I convert this into a readable JSON? I've found different slow solutions like here with regEx Have already tried: URL.decode StringEscapeUtils JSON.parse // from different libraries For example python has simple solution like decode from 'string_escape' Linked possible duplicate applies to Python, and my question is about Java or Scala Working but also very slow

Character showing up as diamond question mark only at end of line (Python>Text)

阅读更多关于 Character showing up as diamond question mark only at end of line (Python>Text)

问题 I'm working on a Python file that inputs a text file with Japanese characters (UTF-8) in it, takes some of the text, and writes it into a new UTF-8 text file. The problem I'm coming across is that for some reason whenever the Japanese character だ appears at the end of a line in the original input file, it comes out as a diamond question mark in the output file. Instances of だ before the end of a line read perfectly fine and the original input file has it reading perfectly fine even if it's at

Difference between readAsBinaryString and readAsText using FileReader

阅读更多关于 Difference between readAsBinaryString and readAsText using FileReader

问题 So as an example, when I read the π character ( \u03C0 ) from a File using the FileReader API, I get the pi character back to me when I read it using FileReader.readAsText(blob) which is expected. But when I use FileReader.readAsBinaryString(blob) , I get the result \xcf\x80 instead, which doesn't seem to have any visible correlation with the pi character. What's going on? (This probably has something to do with the way UTF-8/16 is encoded...) 回答1: Oh well, if that's all you needed... :) CF80

Control configure set Apache Spark UTF encoding for writting as saveAsTextFile

阅读更多关于 Control configure set Apache Spark UTF encoding for writting as saveAsTextFile

问题 So how does one tell spark which UTF to use when using saveAsTextFile(path) ? Of course if it's known that all the Strings are UTF-8 then it will save space on disk by 2x! (assuming the default UTF is 16 like java) 回答1: saveAsTextFile actually uses Text from hadoop which is encoded as UTF-8. def saveAsTextFile(path: String, codec: Class[_ <: CompressionCodec]) { this.map(x => (NullWritable.get(), new Text(x.toString))) .saveAsHadoopFile[TextOutputFormat[NullWritable, Text]](path, codec) }

HTML Unicode Issue: How to display special characters

阅读更多关于 HTML Unicode Issue: How to display special characters

问题 Currently, I have my webpage set to Unicode/UTF-8. When trying to display a special character (for example, em dash, double arrow, etc), it shows up as a question mark symbol. I cannot change these characters to the HTML entity equivalent. How can I circumvent this issue? 回答1: A question mark in a lozenge, �, indicates a character-level error: the data contains bytes that do no represent any character, according to the character encoding being applied. This typically happens when the document

fatal error: high- and low-surrogate code points are not valid Unicode scalar values [duplicate]

阅读更多关于 fatal error: high- and low-surrogate code points are not valid Unicode scalar values [duplicate]

问题 This question already has answers here : How can I generate a random unicode character in Swift? (2 answers) Closed 4 years ago . Sometimes while initializing a UnicodeScalar with a value like 57292 yields the following error: fatal error: high- and low-surrogate code points are not valid Unicode scalar values What is this error, why does it occur and how can I prevent it in the future? 回答1: Background: UTF-16 represents a sequence of Unicode characters ("code points") as a sequence of 16-bit