Parsing CSV File enclosed with quotes in C#

后端 未结 10 1975
一整个雨季
一整个雨季 2021-01-21 05:50

I\'ve seen lots of samples in parsing CSV File. but this one is kind of annoying file...

so how do you parse this kind of CSV

\"1\",1/2/2010,\"The sample (\"adas

10条回答
  •  盖世英雄少女心
    2021-01-21 06:22

    The best answer in most cases is probably @Jim Mischel's. TextFieldParser seems to be exactly what you want for most conventional cases -- though it strangely lives in the Microsoft.VisualBasic namespace! But this case isn't conventional.

    The last time I ran into a variation on this issue where I needed something unconventional, I embarrassingly gave up on regexp'ing and bullheaded a char by char check. Sometimes, that's not-wrong enough to do. Splitting a string isn't as difficult a problem if you byte push.

    So I rewrote for this case as a string extension. I think this is close.

    Do note that, "I was pooping in the door "Stinky", so I'll be damn", is an especially nasty case. Without the *** STINKY CONDITION *** code, below, you'd get I was pooping in the door "Stinky as one value and so I'll be damn" as the other.

    The only way to do better than that for any anonymous weird splitter/escape case would be to have some sort of algorithm to determine the "usual" number of columns in each row, and then check for, in this case, fixed length fields like your AK state entry or some other possible landmark as a sort of normalizing backstop for nonconformist columns. But that's serious crazy logic that likely isn't called for, as much fun as it'd be to code. As @Vash points out, you're better off following some standard and coding a little more OFfensively.

    But the problem here is probably easier than that. The only lexically meaningful case is the one in your example -- ", -- double quote, comma, and then a space. So that's what the *** STINKY CONDITION *** code checks. Even so, this code is getting nastier than I'd like, which means you have ever stranger edge cases, like "This is also stinky," a f a b","Now what?" Heck, even "A,"B","C" doesn't work in this code right now, iirc, since I treat the begin and end chars as having been escape pre- and post-fixed. So we're largely back to @Vash's comment!

    Apologies for all the brackets for one-line if statements, but I'm stuck in a StyleCop world right now. I'm not necessarily suggesting you use this -- that strictEscapeToSplitEvaluation plus the STINKY CONDITION makes this a little complex. But it's worth keeping in mind that a normal csv parser that's intelligent about quotes is significantly more straightforward to the point of being tedious, but otherwise trivial.

    namespace YourFavoriteNamespace 
    {
        using System;
        using System.Collections.Generic;
        using System.Text;
    
        public static class Extensions
        {
            public static Queue SplitSeeingQuotes(this string valToSplit, char splittingChar = ',', char escapeChar = '"', 
                bool strictEscapeToSplitEvaluation = true, bool captureEndingNull = false)
            {
                Queue qReturn = new Queue();
                StringBuilder stringBuilder = new StringBuilder();
    
                bool bInEscapeVal = false;
    
                for (int i = 0; i < valToSplit.Length; i++)
                {
                    if (!bInEscapeVal)
                    {
                        // Escape values must come immediately after a split.
                        // abc,"b,ca",cab has an escaped comma.
                        // abc,b"ca,c"ab does not.
                        if (escapeChar == valToSplit[i] && (!strictEscapeToSplitEvaluation || (i == 0 || (i != 0 && splittingChar == valToSplit[i - 1]))))
                        {
                            bInEscapeVal = true;    // not capturing escapeChar as part of value; easy enough to change if need be.
                        }
                        else if (splittingChar == valToSplit[i])
                        {
                            qReturn.Enqueue(stringBuilder.ToString());
                            stringBuilder = new StringBuilder();
                        }
                        else
                        {
                            stringBuilder.Append(valToSplit[i]);
                        }
                    }
                    else
                    {
                        // Can't use switch b/c we're comparing to a variable, I believe.
                        if (escapeChar == valToSplit[i])
                        {
                            // Repeated escape always reduces to one escape char in this logic.
                            // So if you wanted "I'm ""double quote"" crazy!" to come out with 
                            // the double double quotes, you're toast.
                            if (i + 1 < valToSplit.Length && escapeChar == valToSplit[i + 1])
                            {
                                i++;
                                stringBuilder.Append(escapeChar);
                            }
                            else if (!strictEscapeToSplitEvaluation)
                            {
                                bInEscapeVal = false;
                            }
                            // *** STINKY CONDITION ***  
                            // Kinda defense, since only `", ` really makes sense.
                            else if ('"' == escapeChar && i + 2 < valToSplit.Length &&
                                valToSplit[i + 1] == ',' && valToSplit[i + 2] == ' ')
                            {
                                i = i+2;
                                stringBuilder.Append("\", ");
                            }
                            // *** EO STINKY CONDITION ***  
                            else if (i+1 == valToSplit.Length || (i + 1 < valToSplit.Length && valToSplit[i + 1] == splittingChar))
                            {
                                bInEscapeVal = false;
                            }
                            else
                            {
                                stringBuilder.Append(escapeChar);
                            }
                        }
                        else
                        {
                            stringBuilder.Append(valToSplit[i]);
                        }
                    }
                }
    
                // NOTE: The `captureEndingNull` flag is not tested.
                // Catch null final entry?  "abc,cab,bca," could be four entries, with the last an empty string.
                if ((captureEndingNull && splittingChar == valToSplit[valToSplit.Length-1]) || (stringBuilder.Length > 0))
                {
                    qReturn.Enqueue(stringBuilder.ToString());
                }
    
                return qReturn;
            }
        }
    }
    

    Probably worth mentioning that the "answer" you gave yourself doesn't have the "Stinky" problem in its sample string. ;^)

    [Understanding that we're three years after you asked,] I will say that your example isn't as insane as folks here make out. I can see wanting to treat escape characters (in this case, ") as escape characters only when they're the first value after the splitting character or, after finding an opening escape, stopping only if you find the escape character before a splitter; in this case, the splitter is obviously ,.

    If the row of your csv is abc,bc"a,ca"b, I would expect that to mean we've got three values: abc, bc"a, and ca"b.

    Same deal in your "The sample ("adasdad") asdada" column -- quotes that don't begin and end a cell value aren't escape characters and don't necessarily need doubling to maintain meaning. So I added a strictEscapeToSplitEvaluation flag here.

    Enjoy. ;^)

提交回复
热议问题