Parsing CSV File enclosed with quotes in C#

后端 未结 10 1971
一整个雨季
一整个雨季 2021-01-21 05:50

I\'ve seen lots of samples in parsing CSV File. but this one is kind of annoying file...

so how do you parse this kind of CSV

\"1\",1/2/2010,\"The sample (\"adas

相关标签:
10条回答
  • 2021-01-21 06:22

    The best answer in most cases is probably @Jim Mischel's. TextFieldParser seems to be exactly what you want for most conventional cases -- though it strangely lives in the Microsoft.VisualBasic namespace! But this case isn't conventional.

    The last time I ran into a variation on this issue where I needed something unconventional, I embarrassingly gave up on regexp'ing and bullheaded a char by char check. Sometimes, that's not-wrong enough to do. Splitting a string isn't as difficult a problem if you byte push.

    So I rewrote for this case as a string extension. I think this is close.

    Do note that, "I was pooping in the door "Stinky", so I'll be damn", is an especially nasty case. Without the *** STINKY CONDITION *** code, below, you'd get I was pooping in the door "Stinky as one value and so I'll be damn" as the other.

    The only way to do better than that for any anonymous weird splitter/escape case would be to have some sort of algorithm to determine the "usual" number of columns in each row, and then check for, in this case, fixed length fields like your AK state entry or some other possible landmark as a sort of normalizing backstop for nonconformist columns. But that's serious crazy logic that likely isn't called for, as much fun as it'd be to code. As @Vash points out, you're better off following some standard and coding a little more OFfensively.

    But the problem here is probably easier than that. The only lexically meaningful case is the one in your example -- ", -- double quote, comma, and then a space. So that's what the *** STINKY CONDITION *** code checks. Even so, this code is getting nastier than I'd like, which means you have ever stranger edge cases, like "This is also stinky," a f a b","Now what?" Heck, even "A,"B","C" doesn't work in this code right now, iirc, since I treat the begin and end chars as having been escape pre- and post-fixed. So we're largely back to @Vash's comment!

    Apologies for all the brackets for one-line if statements, but I'm stuck in a StyleCop world right now. I'm not necessarily suggesting you use this -- that strictEscapeToSplitEvaluation plus the STINKY CONDITION makes this a little complex. But it's worth keeping in mind that a normal csv parser that's intelligent about quotes is significantly more straightforward to the point of being tedious, but otherwise trivial.

    namespace YourFavoriteNamespace 
    {
        using System;
        using System.Collections.Generic;
        using System.Text;
    
        public static class Extensions
        {
            public static Queue<string> SplitSeeingQuotes(this string valToSplit, char splittingChar = ',', char escapeChar = '"', 
                bool strictEscapeToSplitEvaluation = true, bool captureEndingNull = false)
            {
                Queue<string> qReturn = new Queue<string>();
                StringBuilder stringBuilder = new StringBuilder();
    
                bool bInEscapeVal = false;
    
                for (int i = 0; i < valToSplit.Length; i++)
                {
                    if (!bInEscapeVal)
                    {
                        // Escape values must come immediately after a split.
                        // abc,"b,ca",cab has an escaped comma.
                        // abc,b"ca,c"ab does not.
                        if (escapeChar == valToSplit[i] && (!strictEscapeToSplitEvaluation || (i == 0 || (i != 0 && splittingChar == valToSplit[i - 1]))))
                        {
                            bInEscapeVal = true;    // not capturing escapeChar as part of value; easy enough to change if need be.
                        }
                        else if (splittingChar == valToSplit[i])
                        {
                            qReturn.Enqueue(stringBuilder.ToString());
                            stringBuilder = new StringBuilder();
                        }
                        else
                        {
                            stringBuilder.Append(valToSplit[i]);
                        }
                    }
                    else
                    {
                        // Can't use switch b/c we're comparing to a variable, I believe.
                        if (escapeChar == valToSplit[i])
                        {
                            // Repeated escape always reduces to one escape char in this logic.
                            // So if you wanted "I'm ""double quote"" crazy!" to come out with 
                            // the double double quotes, you're toast.
                            if (i + 1 < valToSplit.Length && escapeChar == valToSplit[i + 1])
                            {
                                i++;
                                stringBuilder.Append(escapeChar);
                            }
                            else if (!strictEscapeToSplitEvaluation)
                            {
                                bInEscapeVal = false;
                            }
                            // *** STINKY CONDITION ***  
                            // Kinda defense, since only `", ` really makes sense.
                            else if ('"' == escapeChar && i + 2 < valToSplit.Length &&
                                valToSplit[i + 1] == ',' && valToSplit[i + 2] == ' ')
                            {
                                i = i+2;
                                stringBuilder.Append("\", ");
                            }
                            // *** EO STINKY CONDITION ***  
                            else if (i+1 == valToSplit.Length || (i + 1 < valToSplit.Length && valToSplit[i + 1] == splittingChar))
                            {
                                bInEscapeVal = false;
                            }
                            else
                            {
                                stringBuilder.Append(escapeChar);
                            }
                        }
                        else
                        {
                            stringBuilder.Append(valToSplit[i]);
                        }
                    }
                }
    
                // NOTE: The `captureEndingNull` flag is not tested.
                // Catch null final entry?  "abc,cab,bca," could be four entries, with the last an empty string.
                if ((captureEndingNull && splittingChar == valToSplit[valToSplit.Length-1]) || (stringBuilder.Length > 0))
                {
                    qReturn.Enqueue(stringBuilder.ToString());
                }
    
                return qReturn;
            }
        }
    }
    

    Probably worth mentioning that the "answer" you gave yourself doesn't have the "Stinky" problem in its sample string. ;^)

    [Understanding that we're three years after you asked,] I will say that your example isn't as insane as folks here make out. I can see wanting to treat escape characters (in this case, ") as escape characters only when they're the first value after the splitting character or, after finding an opening escape, stopping only if you find the escape character before a splitter; in this case, the splitter is obviously ,.

    If the row of your csv is abc,bc"a,ca"b, I would expect that to mean we've got three values: abc, bc"a, and ca"b.

    Same deal in your "The sample ("adasdad") asdada" column -- quotes that don't begin and end a cell value aren't escape characters and don't necessarily need doubling to maintain meaning. So I added a strictEscapeToSplitEvaluation flag here.

    Enjoy. ;^)

    0 讨论(0)
  • 2021-01-21 06:23

    I don't see how you could if each line is different. This line is a malformed for CSV. Quotes contained within a value must be doubled as shown below. I can't even tell for sure where the values should be terminated.

    "1",1/2/2010,"The sample (""adasdad"") asdada","I was pooping in the door ""Stinky"", so I'll be damn","AK"
    

    Here's my code to parse a CSV file but I don't see how any code would know how to handle your line because it's malformed.

    0 讨论(0)
  • 2021-01-21 06:28

    I found a way to parse this malformed CSV. I looked for a pattern and found it.... I first replace (",") with a character... like "¤" and then split it...

    from this:

    "Annoying","CSV File","poop@mypants.com",1999,01-20-2001,"oh,boy",01-20-2001,"yeah baby","yeah!"
    

    to this:

    "Annoying¤CSV File¤poop@mypants.com",1999,01-20-2001,"oh,boy",01-20-2001,"yeah baby¤yeah!"
    

    then split it:

    ArrayA[0]: "Annoying //this value will be trimmed by replace("\"","") same as the array[4]
    ArrayA[1]: CSV File
    ArrayA[2]: poop@mypants.com",1999,01-20-2001,"oh,boy",01-20-2001,"yeah baby
    ArrayA[3]: yeah!"
    

    after splitting it, I will replace strings from ArrayA[2] ", and ," with ¤ and then split it again

    from this

    ArrayA[2]: poop@mypants.com",1999,01-20-2001,"oh,boy",01-20-2001,"yeah baby
    

    to this

    ArrayA[2]: poop@mypants.com¤1999,01-20-2001¤oh,boy¤01-20-2001¤yeah baby
    

    then split it again and would turn to this

    ArrayB[0]: poop@mypants.com
    ArrayB[1]: 1999,01-20-2001
    ArrayB[2]: oh,boy
    ArrayB[3]: 01-20-2001
    ArrayB[4]: yeah baby
    

    and lastly... I'll split the Year only and the date from ArrayB[1] with , to ArrayC

    It's tedious but there's no other way to do it...

    0 讨论(0)
  • 2021-01-21 06:28

    You might want to give CsvReader a try. It will handle quoted string fine, so you just will have to remove leading and trailing quotes.

    It will fail if your strings contains a coma. To avoid this, the quotes needs to be doubled as said in other answers.

    0 讨论(0)
提交回复
热议问题