C# How to replace Microsoft's Smart Quotes with straight quotation marks?

后端 未结 12 1399
花落未央
花落未央 2020-12-08 00:50

My post below asked what the curly quotation marks were and why my app wouldn\'t work with them, my question now is how can I replace them when my program comes across them,

相关标签:
12条回答
  • 2020-12-08 01:19

    I have a whole great big... program... that does precisely this. You can rip out the script and use it at your leasure. It does all sorts of replacements, and is located at http://bitbucket.org/nesteruk/typografix

    0 讨论(0)
  • 2020-12-08 01:20

    just chiming in, I had done this with Regex replace just to handle a few at a time based on what I'm replacing them with:

            public static string ReplaceWordChars(this string text)
            {
                var s = text;
                // smart single quotes and apostrophe,  single low-9 quotation mark, single high-reversed-9 quotation mark, prime
                s = Regex.Replace(s, "[\u2018\u2019\u201A\u201B\u2032]", "'");
                // smart double quotes, double prime
                s = Regex.Replace(s, "[\u201C\u201D\u201E\u2033]", "\"");
                // ellipsis
                s = Regex.Replace(s, "\u2026", "...");
                // em dashes
                s = Regex.Replace(s, "[\u2013\u2014]", "-");
                // horizontal bar
                s = Regex.Replace(s, "\u2015", "-");
                // double low line
                s = Regex.Replace(s, "\u2017", "-");
                // circumflex
                s = Regex.Replace(s, "\u02C6", "^");
                // open angle bracket
                s = Regex.Replace(s, "\u2039", "<");
                // close angle bracket
                s = Regex.Replace(s, "\u203A", ">");
                // weird tilde and nonblocking space
                s = Regex.Replace(s, "[\u02DC\u00A0]", " ");
                // half
                s = Regex.Replace(s, "[\u00BD]", "1/2");
                // quarter
                s = Regex.Replace(s, "[\u00BC]", "1/4");
                // dot
                s = Regex.Replace(s, "[\u2022]", "*");
                // degrees 
                s = Regex.Replace(s, "[\u00B0]", " degrees");
    
                return s;
            }
    

    Also a few more replacements in there.

    0 讨论(0)
  • 2020-12-08 01:21

    Note that what you have is inherently a corrupt CSV file. Indiscriminately replacing all typographer's quotes with straight quotes won't necessarily fix your file. For all you know, some of the typographer's quotes were supposed to be there, as part of a field's value. Replacing them with straight quotes might not leave you with a valid CSV file, either.

    I don't think there is an algorithmic way to fix a file that is corrupt in the way you describe. Your time might be better spent investigating how you come to have such invalid files in the first place, and then putting a stop to it. Is someone using Word to edit your data files, for instance?

    0 讨论(0)
  • 2020-12-08 01:21

    Using Nick and Barbara's answers, here is example code with performance stats for 1,000,000 loops on my machine:

    input = "shmB6BhLe0gdGU8OxYykZ21vuxLjBo5I1ZTJjxWfyRTTlqQlgz0yUtPu8iNCCcsx78EPsObiPkCpRT8nqRtvM3Bku1f9nStmigaw";
    input.Replace('\u2013', '-'); // en dash
    input.Replace('\u2014', '-'); // em dash
    input.Replace('\u2015', '-'); // horizontal bar
    input.Replace('\u2017', '_'); // double low line
    input.Replace('\u2018', '\''); // left single quotation mark
    input.Replace('\u2019', '\''); // right single quotation mark
    input.Replace('\u201a', ','); // single low-9 quotation mark
    input.Replace('\u201b', '\''); // single high-reversed-9 quotation mark
    input.Replace('\u201c', '\"'); // left double quotation mark
    input.Replace('\u201d', '\"'); // right double quotation mark
    input.Replace('\u201e', '\"'); // double low-9 quotation mark
    input.Replace("\u2026", "..."); // horizontal ellipsis
    input.Replace('\u2032', '\''); // prime
    input.Replace('\u2033', '\"'); // double prime
    

    Time: 958.1011 milliseconds

    input = "shmB6BhLe0gdGU8OxYykZ21vuxLjBo5I1ZTJjxWfyRTTlqQlgz0yUtPu8iNCCcsx78EPsObiPkCpRT8nqRtvM3Bku1f9nStmigaw";
    var inputArray = input.ToCharArray();
    for (int i = 0; i < inputArray.Length; i++)
    {
        switch (inputArray[i])
        {
            case '\u2013':
                inputArray[i] = '-';
                break;
            // en dash
            case '\u2014':
                inputArray[i] = '-';
                break;
            // em dash
            case '\u2015':
                inputArray[i] = '-';
                break;
            // horizontal bar
            case '\u2017':
                inputArray[i] = '_';
                break;
            // double low line
            case '\u2018':
                inputArray[i] = '\'';
                break;
            // left single quotation mark
            case '\u2019':
                inputArray[i] = '\'';
                break;
            // right single quotation mark
            case '\u201a':
                inputArray[i] = ',';
                break;
            // single low-9 quotation mark
            case '\u201b':
                inputArray[i] = '\'';
                break;
            // single high-reversed-9 quotation mark
            case '\u201c':
                inputArray[i] = '\"';
                break;
            // left double quotation mark
            case '\u201d':
                inputArray[i] = '\"';
                break;
            // right double quotation mark
            case '\u201e':
                inputArray[i] = '\"';
                break;
            // double low-9 quotation mark
            case '\u2026':
                inputArray[i] = '.';
                break;
            // horizontal ellipsis
            case '\u2032':
                inputArray[i] = '\'';
                break;
            // prime
            case '\u2033':
                inputArray[i] = '\"';
                break;
            // double prime
        }
    }
    input = new string(inputArray);
    

    Time: 362.0858 milliseconds

    0 讨论(0)
  • 2020-12-08 01:22

    Try this for smart single quotes if the above don't work:

    string.Replace("\342\200\230", "'")
    string.Replace("\342\200\231", "'")
    

    Try this as well for smart double quotes:

    string.Replace("\342\200\234", '"')
    string.Replace("\342\200\235", '"')
    
    0 讨论(0)
  • 2020-12-08 01:32

    When I encountered this problem I wrote an extension method to the String class in C#.

    public static class StringExtensions
    {
        public static string StripIncompatableQuotes(this string s)
        {
            if (!string.IsNullOrEmpty(s))
                return s.Replace('\u2018', '\'').Replace('\u2019', '\'').Replace('\u201c', '\"').Replace('\u201d', '\"');
            else
                return s;
        }
    }
    

    This simply replaces the silly 'smart quotes' with normal quotes.

    [EDIT] Fixed to also support replacement of 'double smart quotes'.

    0 讨论(0)
提交回复
热议问题