diacritics | 易学教程

Replace diacritic characters with “equivalent” ASCII in PHP?

阅读更多关于 Replace diacritic characters with “equivalent” ASCII in PHP?

问题 Related questions: How to replace characters in a java String? How to replace special characters with their equivalent (such as " á " for " a") in C#? As in the questions above, I'm looking for a reliable, robust way to reduce any unicode character to near-equivalent ASCII using PHP. I really want to avoid rolling my own look up table. For example (stolen from 1st referenced question): Gračišće becomes Gracisce 回答1: The iconv module can do this, more specifically, the iconv() function: $str =

Javascript Regex + Unicode Diacritic Combining Characters`

阅读更多关于 Javascript Regex + Unicode Diacritic Combining Characters`

问题 I want to match this character in the African Yoruba language 'ẹ́'. Usually this is made by combining an 'é' with a '\u0323' under dot diacritic. I found that: 'é\u0323'.match(/[é]\u0323/) works but 'ẹ́'.match(/[é]\u0323/) does not work. I don't just want to match e. I want to match all combinations. Right now, my solution involves enumerating all combinations. Like so: /[ÁÀĀÉÈĒẸE̩Ẹ́É̩Ẹ̀È̩Ẹ̄Ē̩ÍÌĪÓÒŌỌO̩Ọ́Ó̩Ọ̀Ò̩Ọ̄Ō̩ÚÙŪṢS̩áàāéèēẹe̩ẹ́é̩ẹ̀è̩ẹ̄ē̩íìīóòōọo̩ọ́ó̩ọ̀ò̩ọ̄ō̩úùūṣs̩]/ Could there not be a

removing accent and special characters [duplicate]

阅读更多关于 removing accent and special characters [duplicate]

问题 This question already has answers here : Closed 7 years ago . Possible Duplicate: What is the best way to remove accents in a python unicode string? Python and character normalization I would like to remove accents, turn all characters to lowercase, and delete any numbers and special characters. Example : Frédér8ic@ --> frederic Proposal: def remove_accents(data): return ''.join(x for x in unicodedata.normalize('NFKD', data) if \ unicodedata.category(x)[0] == 'L').lower() Is there any better

What are the unicode ranges for Hindi accented characters?

阅读更多关于 What are the unicode ranges for Hindi accented characters?

问题 I'm trying to gather a Unicode list of all the 'o' like shapes in the Hindi character-set. In fact, a list of any characters (in any language) that makes uses of separate characters to indicate an accent would be better. I intend to use this unicode-list in a RegExp. I been trying to edit a list of character-ranges by outputting them in an Input TextField, but editing this text causes weird issues (the keyboard-cursor isn't place on the correct character, selections suddenly dissappear /

How do i replace accents (german) in .NET

阅读更多关于 How do i replace accents (german) in .NET

问题 I need to replace accents in the string to their english equivalents for example ä = ae ö = oe Ö = Oe ü = ue I know to strip of them from string but i was unaware about replacement. Please let me know if you have some suggestions. I am coding in C# 回答1: If you need to use this on larger strings, multiple calls to Replace() can get inefficient pretty quickly. You may be better off rebuilding your string character-by-character: var map = new Dictionary<char, string>() { { 'ä', "ae" }, { 'ö',

Get all possible string sequences where each element comes from different set in R

阅读更多关于 Get all possible string sequences where each element comes from different set in R

问题 Assume I have 4 vectors of character elements: s1 <- c("o", "ó") s2 <- c("c", "ć") s3 <- c("o", "ó") s4 <- c("z", "ź", "ż") I want to build 4-element vectors that are all possible combinations of elements from s1 , s2 , s3 , s4 in a way that in one of a result vectors 1-st, -2nd, 3-rd and 4-th element comes from s1 , s2 , s3 , s4 , respectively. For example, I would like to get the following result vectors: [1] "o", "c", "o", "z" [1] "ó", "c", "o", "z" [1] "o", "ć", "o", "z" ... [ My general

German 'ue' -> 'u' conversion in Lucene

阅读更多关于 German 'ue' -> 'u' conversion in Lucene

问题 I have two questions regarding handling German umlauts in Lucene: I'm trying to find a way to convert German Umlauts written as 'ue', 'ae', etc to folded form 'u', 'a' and so on. This is done by GermanAnalyzer (and German2StemFilter used by it), but unfortunately it also does stemming which is very undesired in my case. Is there any other filter that can do only the 'ue' -> 'u' conversion? Is there any filter that does 'ü' -> 'ue' (NOT 'u' like ASCIIFoldingFilter does) conversion? What I'm

Extend Endeca's diacritic folding mapping

阅读更多关于 Extend Endeca's diacritic folding mapping

问题 We have an index with mixed Greek, English data for an ATG-Endeca application. Indexed Greek data have words with accents. If the search terms are without accents they don't match to any data (or they match due to autoccorection that happens for the character without the accent to the character withthe accent and this is not desired functionality). Dgidx flag --diacritic folding configuration doesn't include mapping for Greek caracters (https://docs.oracle.com/cd/E29584_01/webhelp/mdex

mysql query select like with diacritic Turkish letters

阅读更多关于 mysql query select like with diacritic Turkish letters

问题 I have a token table in Turkish ; it's default collation is utf8_general_ci On FreeBSD server, mysql version is 5.6.15 I want to query; select * from tokens where type like 'âmâ'; or select * from tokens where type='âmâ'; With these queries, result must be one unique for 'âmâ' (it means 'blind' in Turkish also) But i have four raw result; result 1 "amâ" means 'but' result 2 "ama" means 'but' result 3 "âma" means 'blind' result 4 "âmâ" means 'blind' that didnt i want. I tried different

replace only matches the beginning of the string

阅读更多关于 replace only matches the beginning of the string

问题 I'm trying to write a function to replace the Romanian diacritic letters ( ĂÂÎȘȚ ) to their Latin letter equivalents ( AAIST , respectively). SQL Server's replace function deals with Ă , Â , and Î just fine. It seems to have a weird problem with Ș and Ț , though: they are only replaced if they are found at the beginning of the string. For example: select replace(N'Ș', N'Ș', N'-') -- '-' # OK select replace(N'ȘA', N'Ș', N'-') -- '-A' # OK select replace(N'AȘ', N'Ș', N'-') -- 'AȘ' # WHAT??