CSV Parsing

前端 未结 13 2080
攒了一身酷
攒了一身酷 2020-12-17 04:45

I am trying to use C# to parse CSV. I used regular expressions to find \",\" and read string if my header counts were equal to my match count.

Now this

相关标签:
13条回答
  • 2020-12-17 05:38

    There's an oft quoted saying:

    Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. (Jamie Zawinski)

    Given that there's no official standard for CSV files (instead there are a large number of slightly incompatible styles), you need to make sure that what you implement suits the files you will be receiving. No point in implementing anything fancier than what you need - and I'm pretty sure you don't need Regular Expressions.

    Here's my stab at a simple method to extract the terms - basically, it loops through the line looking for commas, keeping track of whether the current index is within a string or not:

        public IEnumerable<string> SplitCSV(string line)
        {
            int index = 0;
            int start = 0;
            bool inString = false;
    
            foreach (char c in line)
            {
                switch (c)
                {
                    case '"':
                        inString = !inString;
                        break;
    
                    case ',':
                        if (!inString)
                        {
                            yield return line.Substring(start, index - start);
                            start = index + 1;
                        }
                        break;
                }
                index++;
            }
    
            if (start < index)
                yield return line.Substring(start, index - start);
        }
    

    Standard caveat - untested code, there may be off-by-one errors.

    Limitations

    • The quotes around a value aren't removed automatically.
      To do this, add a check just before the yield return statement near the end.

    • Single quotes aren't supported in the same way as double quotes
      You could add a separate boolean inSingleQuotedString, renaming the existing boolean to inDoubleQuotedString and treating both the same way. (You can't make the existing boolean do double work because you need the string to end with the same quote that started it.)

    • Whitespace isn't automatically removed
      Some tools introduce whitespace around the commas in CSV files to "pretty" the file; it then becomes difficult to tell intentional whitespace from formatting whitespace.

    0 讨论(0)
  • 2020-12-17 05:40

    I would use FileHelpers if I were you. Regular Expressions are fine but hard to read, especially if you go back, after a while, for a quick fix.

    Just for sake of exercising my mind, quick & dirty working C# procedure:

    public static List<string> SplitCSV(string line)
    {
        if (string.IsNullOrEmpty(line))
            throw new ArgumentException();
    
        List<string> result = new List<string>();
    
        bool inQuote = false;
        StringBuilder val = new StringBuilder();
    
        // parse line
        foreach (var t in line.Split(','))
        {
            int count = t.Count(c => c == '"');
    
            if (count > 2 && !inQuote)
            {
                inQuote = true;
                val.Append(t);
                val.Append(',');
                continue;
            }
    
            if (count > 2 && inQuote)
            {
                inQuote = false;
                val.Append(t);
                result.Add(val.ToString());
                continue;
            }
    
            if (count == 2 && !inQuote)
            {
                result.Add(t);
                continue;
            }
    
            if (count == 2 && inQuote)
            {
                val.Append(t);
                val.Append(',');
                continue;
            }
        }
    
        // remove quotation
        for (int i = 0; i < result.Count; i++)
        {
            string t = result[i];
            result[i] = t.Substring(1, t.Length - 2);
        }
    
        return result;
    }
    
    0 讨论(0)
  • 2020-12-17 05:41

    CSV, when dealing with things like multi-line, quoted, different delimiters* etc - can get trickier than you might think... perhaps consider a pre-rolled answer? I use this, and it works very well.

    *=remember that some locales use [tab] as the C in CSV...

    0 讨论(0)
  • 2020-12-17 05:43

    FileHelpers supports multiline fields.

    You could parse files like these:

    a,"line 1
    line 2
    line 3"
    b,"line 1
    line 2
    line 3"
    

    Here is the datatype declaration:

    [DelimitedRecord(",")]
    public class MyRecord
    { 
     public string field1;
     [FieldQuoted('"', QuoteMode.OptionalForRead, MultilineMode.AllowForRead)]
     public string field2;
    }
    

    Here is the usage:

    static void Main()
    {
     FileHelperEngine engine = new FileHelperEngine(typeof(MyRecord));
     MyRecord[] res = engine.ReadFile("file.csv");       
    }
    
    0 讨论(0)
  • 2020-12-17 05:46

    See the link "Regex fun with CSV" at:

    http://snippets.dzone.com/posts/show/4430

    0 讨论(0)
  • 2020-12-17 05:47

    CSV is a great example for code reuse - No matter which one of the csv parsers you choose, don't choose your own. Stop Rolling your own CSV parser

    0 讨论(0)
提交回复
热议问题