non-ascii-characters

How to convert \xXY encoded characters to UTF-8 in Python?

☆樱花仙子☆ 提交于 2019-12-22 06:00:30
问题 I have a text which contains characters such as "\xaf", "\xbe", which, as I understand it from this question, are ASCII encoded characters. I want to convert them in Python to their UTF-8 equivalents. The usual string.encode("utf-8") throws UnicodeDecodeError . Is there some better way, e.g., with the codecs standard library? Sample 200 characters here. 回答1: Your file is already a UTF-8 encoded file. # saved encoding-sample to /tmp/encoding-sample import codecs fp= codecs.open("/tmp/encoding

Compare two string and ignore (but not replace) accents. PHP

僤鯓⒐⒋嵵緔 提交于 2019-12-22 04:43:34
问题 I got (for example) two strings: $a = "joao"; $b = "joão"; if ( strtoupper($a) == strtoupper($b)) { echo $b; } I want it to be true even tho the accentuation. However I need it to ignore the accentuation instead of replacing because I need it to echo "joão" and not "joao". All answers I've seen replace "ã" for "a" instead of making the comparison true. I've been reading about normalizing it, but I can't make it work either. Any ideas? Thank you. 回答1: Just convert the accents to their non

Solr accent removal

非 Y 不嫁゛ 提交于 2019-12-22 00:29:18
问题 i have read various threads about how to remove accents during index/query time. The current fieldtype i have come up with looks like the following: <fieldType name="text_general" class="solr.TextField"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.ASCIIFoldingFilterFactory"/> <filter class="solr.LowerCaseFilterFactory" /> </analyzer> </fieldType> After having added a couple of test information to index i have checked via http://localhost:8080/solr/test

Similar looking UTF8 characters for ASCII

眉间皱痕 提交于 2019-12-22 00:13:12
问题 I'm looking for a table which contains ASCII characters and same looking UTF8 characters. I know it also depends on the font is they look the same, but something generic to start with is enough. >>> # PY3 code: >>> a='H' # ascii >>> b='Н' # utf8 >>> a==b False >>> ' '.join(format(ord(x), 'b') for x in a) '1001000' >>> ' '.join(format(ord(x), 'b') for x in b) '10000011101' >>> a='P' # ascii >>> b='Ρ' # utf8 >>> a==b False >>> ' '.join(format(ord(x), 'b') for x in a) '1010000' >>> ' '.join

Convert special character (i.e. Umlaut) to most likely representation in ascii [duplicate]

☆樱花仙子☆ 提交于 2019-12-21 10:31:10
问题 This question already has answers here : PHP: Replace umlauts with closest 7-bit ASCII equivalent in an UTF-8 string (7 answers) Closed 6 years ago . i am looking for a method or maybe a conversion table that knows how to convert Umlauts and special characters to their most likely representation in ascii. Example: Ärger = aerger Bôhme = bohme Søren = soeren pjérà = pjera Anyone any idea? Update : Apart from the good accepted Answer, i also found PECLs Normalizer to be quite interesting,

How to remove non-ascii characters from XML data

青春壹個敷衍的年華 提交于 2019-12-20 06:17:32
问题 I have some XML data that is in the following format. My application is supposed to read this using a XMLReader and do some processing to it . However, for that to happen, I need to remove or replace the first portion of each line, specifically the <��� . <���<XML>....data....</XML> <���<XML>....data....</XML <���<XML>....data....</XML> and so on... I tried the following after looking at some posts in SO but no success so far. Any help will be appreciated! private static Regex

grep/regex can't find accented word

蓝咒 提交于 2019-12-20 01:07:30
问题 I'm trying mount a regex that get some words on a file where all letters of this word match with a word pattern. My problem is, the regex can't find accented words, but in my text file there are alot of accented words. My command line is: cat input/words.txt | grep '^[éra]\{1,4\}$' > output/words_era.txt cat input/words.txt | grep '^[carroça]\{1,7\}$' > output/words_carroca.txt And the content of file is: carroça éra éssa roça roco rato onça orça roca How can I fix it? 回答1: If your file is

Node JS crypto, cannot create hmac on chars with accents

情到浓时终转凉″ 提交于 2019-12-18 19:49:43
问题 I am having an issue generating the correct signature in NodeJS (using crypto.js) when the text I am trying to encrypt has accented characters (such as ä,ï,ë) generateSignature = function (str, secKey) { var hmac = crypto.createHmac('sha1', secKey); var sig = hmac.update(str).digest('hex'); return sig; }; This function will return the correct HMAC signature if 'str' contains no accented characters (chars such as ä,ï,ë). If there are accented chars present in the text, it will not return the

Regex accent insensitive?

删除回忆录丶 提交于 2019-12-17 18:52:59
问题 I need a Regex in a C# program. I've to capture a name of a file with a specific structure. I used the \w char class, but the problem is that this class doesn't match any accented char. Then how to do this? I just don't want to put the most used accented letter in my pattern because we can theoretically put every accent on every letter. So I though there is maybe a syntax, to say we want a case insensitive(or a class which takes in account accent), or a Regex option which allows me to be case

Remove non-ASCII non-printable characters from a String

蓝咒 提交于 2019-12-17 10:26:06
问题 I get user input including non-ASCII characters and non-printable characters, such as \xc2d \xa0 \xe7 \xc3\ufffdd \xc3\ufffdd \xc2\xa0 \xc3\xa7 \xa0\xa0 for example: email : abc@gmail.com\xa0\xa0 street : 123 Main St.\xc2\xa0 desired output: email : abc@gmail.com street : 123 Main St. What is the best way to removing them using Java? I tried the following, but doesn't seem to work public static void main(String args[]) throws UnsupportedEncodingException { String s = "abc@gmail\\xe9.com";