I\'m trying to convert some strings that are in French Canadian and basically, I\'d like to be able to take out the French accent marks in the letters while keeping the lett
TL;DR - C# string extension method
I think the best solution to preserve the meaning of the string is to convert the characters instead of stripping them, which is well illustrated in the example crème brûlée
to crme brle
vs. creme brulee
.
I checked out Alexander's comment above and saw the Lucene.Net code is Apache 2.0 licensed, so I've modified the class into a simple string extension method. You can use it like this:
var originalString = "crème brûlée";
var maxLength = originalString.Length; // limit output length as necessary
var foldedString = originalString.FoldToASCII(maxLength);
// "creme brulee"
The function is too long to post in a StackOverflow answer (~139k characters of 30k allowed lol) so I made a gist and attributed the authors:
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
///
/// This class converts alphabetic, numeric, and symbolic Unicode characters
/// which are not in the first 127 ASCII characters (the "Basic Latin" Unicode
/// block) into their ASCII equivalents, if one exists.
///
/// Characters from the following Unicode blocks are converted; however, only
/// those characters with reasonable ASCII alternatives are converted:
///
///
/// C1 Controls and Latin-1 Supplement: http://www.unicode.org/charts/PDF/U0080.pdf
/// Latin Extended-A: http://www.unicode.org/charts/PDF/U0100.pdf
/// Latin Extended-B: http://www.unicode.org/charts/PDF/U0180.pdf
/// Latin Extended Additional: http://www.unicode.org/charts/PDF/U1E00.pdf
/// Latin Extended-C: http://www.unicode.org/charts/PDF/U2C60.pdf
/// Latin Extended-D: http://www.unicode.org/charts/PDF/UA720.pdf
/// IPA Extensions: http://www.unicode.org/charts/PDF/U0250.pdf
/// Phonetic Extensions: http://www.unicode.org/charts/PDF/U1D00.pdf
/// Phonetic Extensions Supplement: http://www.unicode.org/charts/PDF/U1D80.pdf
/// General Punctuation: http://www.unicode.org/charts/PDF/U2000.pdf
/// Superscripts and Subscripts: http://www.unicode.org/charts/PDF/U2070.pdf
/// Enclosed Alphanumerics: http://www.unicode.org/charts/PDF/U2460.pdf
/// Dingbats: http://www.unicode.org/charts/PDF/U2700.pdf
/// Supplemental Punctuation: http://www.unicode.org/charts/PDF/U2E00.pdf
/// Alphabetic Presentation Forms: http://www.unicode.org/charts/PDF/UFB00.pdf
/// Halfwidth and Fullwidth Forms: http://www.unicode.org/charts/PDF/UFF00.pdf
///
///
/// See: http://en.wikipedia.org/wiki/Latin_characters_in_Unicode
///
/// For example, 'à' will be replaced by 'a'.
///
public static partial class StringExtensions
{
///
/// Converts characters above ASCII to their ASCII equivalents. For example,
/// accents are removed from accented characters.
///
/// The string of characters to fold
/// The length of the folded return string
/// length of output
public static string FoldToASCII(this string input, int? length = null)
{
// See https://gist.github.com/andyraddatz/e6a396fb91856174d4e3f1bf2e10951c
}
}
Hope that helps someone else, this is the most robust solution I've found!