utf

What are surrogate characters in UTF-8?

烈酒焚心 提交于 2019-12-24 11:16:04
问题 I have a strange validation program that validates wheather a utf-8 string is a valid host name(Zend Framework Hostname valdiator in PHP). It allows IDNs(internationalized domain names). It will compare each subdomain with sets of characters defined by their HEX bytes representation. Two such sets are D800-DB7F and DC00-DFFF . Php regexp comparing function called preg_match fails during these comparsions and it says that DC00-DFFF characters are not allowed in this function. From wikipedia I

What is a surrogate pair?

大憨熊 提交于 2019-12-24 05:36:08
问题 I came across this code in a javascript open source project. validator.isLength = function (str, min, max) // match surrogate pairs in string or declare an empty array if none found in string var surrogatePairs = str.match(/[\uD800-\uDBFF][\uDC00-\uDFFF]/g) || []; // subtract the surrogate pairs string length from main string length var len = str.length - surrogatePairs.length; // now compare string length with min and max ... also make sure max is defined(in other words, max param is

What is a surrogate pair?

时光毁灭记忆、已成空白 提交于 2019-12-24 05:36:05
问题 I came across this code in a javascript open source project. validator.isLength = function (str, min, max) // match surrogate pairs in string or declare an empty array if none found in string var surrogatePairs = str.match(/[\uD800-\uDBFF][\uDC00-\uDFFF]/g) || []; // subtract the surrogate pairs string length from main string length var len = str.length - surrogatePairs.length; // now compare string length with min and max ... also make sure max is defined(in other words, max param is

Python zipfile module - zipfile.write() file with turkish chars in filename

≡放荡痞女 提交于 2019-12-24 00:39:19
问题 On my system there are many Word documents and I want to zip them using the Python module zipfile . I have found this solution to my problem, but on my system there are files which contain German umlauts and Turkish characters in their filename. I have adapted the method from the solution like this, so it can process German umlauts in the filenames: def zipdir(path, ziph): for root, dirs, files in os.walk(path): for file in files: current_file = os.path.join(root, file) print "Adding to

Java or Scala. How to convert characters like \x22 into String

纵然是瞬间 提交于 2019-12-24 00:35:27
问题 I have a string that looks like this: {\x22documentReferer\x22:\x22http:\x5C/\x5C/pikabu.ru\x5C/freshitems.php\x22} How could I convert this into a readable JSON? I've found different slow solutions like here with regEx Have already tried: URL.decode StringEscapeUtils JSON.parse // from different libraries For example python has simple solution like decode from 'string_escape' Linked possible duplicate applies to Python, and my question is about Java or Scala Working but also very slow

Character showing up as diamond question mark only at end of line (Python>Text)

雨燕双飞 提交于 2019-12-23 17:10:15
问题 I'm working on a Python file that inputs a text file with Japanese characters (UTF-8) in it, takes some of the text, and writes it into a new UTF-8 text file. The problem I'm coming across is that for some reason whenever the Japanese character だ appears at the end of a line in the original input file, it comes out as a diamond question mark in the output file. Instances of だ before the end of a line read perfectly fine and the original input file has it reading perfectly fine even if it's at

Difference between readAsBinaryString and readAsText using FileReader

自古美人都是妖i 提交于 2019-12-23 08:07:15
问题 So as an example, when I read the π character ( \u03C0 ) from a File using the FileReader API, I get the pi character back to me when I read it using FileReader.readAsText(blob) which is expected. But when I use FileReader.readAsBinaryString(blob) , I get the result \xcf\x80 instead, which doesn't seem to have any visible correlation with the pi character. What's going on? (This probably has something to do with the way UTF-8/16 is encoded...) 回答1: Oh well, if that's all you needed... :) CF80

Control configure set Apache Spark UTF encoding for writting as saveAsTextFile

…衆ロ難τιáo~ 提交于 2019-12-22 13:33:13
问题 So how does one tell spark which UTF to use when using saveAsTextFile(path) ? Of course if it's known that all the Strings are UTF-8 then it will save space on disk by 2x! (assuming the default UTF is 16 like java) 回答1: saveAsTextFile actually uses Text from hadoop which is encoded as UTF-8. def saveAsTextFile(path: String, codec: Class[_ <: CompressionCodec]) { this.map(x => (NullWritable.get(), new Text(x.toString))) .saveAsHadoopFile[TextOutputFormat[NullWritable, Text]](path, codec) }

HTML Unicode Issue: How to display special characters

为君一笑 提交于 2019-12-22 11:04:20
问题 Currently, I have my webpage set to Unicode/UTF-8. When trying to display a special character (for example, em dash, double arrow, etc), it shows up as a question mark symbol. I cannot change these characters to the HTML entity equivalent. How can I circumvent this issue? 回答1: A question mark in a lozenge, �, indicates a character-level error: the data contains bytes that do no represent any character, according to the character encoding being applied. This typically happens when the document

fatal error: high- and low-surrogate code points are not valid Unicode scalar values [duplicate]

五迷三道 提交于 2019-12-22 07:06:48
问题 This question already has answers here : How can I generate a random unicode character in Swift? (2 answers) Closed 4 years ago . Sometimes while initializing a UnicodeScalar with a value like 57292 yields the following error: fatal error: high- and low-surrogate code points are not valid Unicode scalar values What is this error, why does it occur and how can I prevent it in the future? 回答1: Background: UTF-16 represents a sequence of Unicode characters ("code points") as a sequence of 16-bit