Dealing with fields containing unescaped double quotes with TextFieldParser

后端 未结 6 887
佛祖请我去吃肉
佛祖请我去吃肉 2021-01-04 09:56

I am trying to import a CSV file using TextFieldParser. A particular CSV file is causing me problems due to its nonstandard formatting. The CSV in question has its fields

相关标签:
6条回答
  • 2021-01-04 10:13

    It may be easier to just do this manually, and it would certainly give you more control:

    Edit: For your clarified example, i still suggest manually handling the parsing:

    using System.IO;
    
    string[] csvFile = File.ReadAllLines(pathToCsv);
    foreach (string line in csvFile)
    {
        // get the first comma in the line
        // everything before this index is the row number
        // everything after is the row value
        int firstCommaIndex = line.IndexOf(',');
    
        //Note: SubString used here is (startIndex, length) 
        string row = line.Substring(0, firstCommaIndex+1);
        string rowValue = line.Substring(firstCommaIndex+1).Trim();
    
        Console.WriteLine("This line was parsed as:\n{0},{1}",
                row, rowValue);
    }
    

    For a generic CSV that does not allow commas in the fields:

    using System.IO;
    
    string[] csvFile = File.ReadAllLines(pathToCsv);
    foreach (string line in csvFile)
    {
        string[] fields = line.Split(',');
        Console.WriteLine("This line was parsed as:\n{0},{1}",
                fields[0], fields[1]);
    }
    
    0 讨论(0)
  • 2021-01-04 10:23

    Working Solution :

    using (TextFieldParser csvReader = new TextFieldParser(csv_file_path))
                {
                    csvReader.SetDelimiters(new string[] { "," });
                    csvReader.HasFieldsEnclosedInQuotes = false;
                    string[] colFields = csvReader.ReadFields();
    
                    while (!csvReader.EndOfData)
                    {
                        string[] fieldData = csvReader.ReadFields();
                        for (i = 0; i < fieldData.Length; i++)
                        {
                            if (fieldData[i] == "")
                            {
                                fieldData[i] = null;
                            }
                            else
                            {
                                if (fieldData[i][0] == '"' && fieldData[i][fieldData[i].Length - 1] == '"')
                                {
                                    fieldData[i] = fieldData[i].Substring(1, fieldData[i].Length - 2);
                                }
                            }
                        }
                        csvData.Rows.Add(fieldData);
                       }
                }
    
    0 讨论(0)
  • 2021-01-04 10:25

    Please set HasFieldsEnclosedInQuotes = true on TextFieldParser object before you start reading file.

    0 讨论(0)
  • 2021-01-04 10:26

    I agree with Hans Passant's advice that it is not your responsibility to parse malformed data. However, in accord with the Robustness Principle, some one faced with this situation may attempt to handle specific types of malformed data. The code I wrote below works on the data set specified in the question. Basically it detects the parser error on the malformed line, determines if it is double-quote wrapped based on the first character, and then splits/strips all the wrapping double-quotes manually.

    using (TextFieldParser parser = new TextFieldParser(reader))
    {
        parser.Delimiters = new[] { "," };
    
        while (!parser.EndOfData)
        {
            string[] fields = null;
            try
            {
                fields = parser.ReadFields();
            }
            catch (MalformedLineException ex)
            {
                if (parser.ErrorLine.StartsWith("\""))
                {
                    var line = parser.ErrorLine.Substring(1, parser.ErrorLine.Length - 2);
                    fields = line.Split(new string[] { "\",\"" }, StringSplitOptions.None);
                }
                else
                {
                    throw;
                }
            }
            Console.WriteLine("This line was parsed as:\n{0},{1}", fields[0], fields[1]);
        }
    }
    

    I'm sure it is possible to concoct a pathological example where this fails (e.g. commas adjacent to double-quotes within a field value) but any such examples would probably be unparseable in the strictest sense, whereas the problem line given in the question is decipherable despite being malformed.

    0 讨论(0)
  • 2021-01-04 10:27

    If you dont set HasFieldsEnclosedInQuotes = true the resultant list of columns will be more if the data contains (,) comma. e.g "Col1","Col2","Col3" "Test1", 100, "Test1,Test2" "Test2", 200, "Test22" This file should have 3 columns but while parsing you will get 4 fields which is wrong.

    0 讨论(0)
  • 2021-01-04 10:29

    Jordan's solution is quite good, but it makes an incorrect assumption that the error line will always begin with a double-quote. My error line was this:

    170,"CMS ALT",853,,,NON_MOVEX,COM,NULL,"2014-04-25",""  204 Route de Trays"
    

    Notice the last field had extra/unescaped double quotes, but the first field was fine. So Jordan's solution didn't work. Here is my modified solution based on Jordan's:

    using(TextFieldParser parser = new TextFieldParser(new StringReader(csv))) {
     parser.Delimiters = new [] {","};
    
     while (!parser.EndOfData) {
      string[] fields = null;
      try {
       fields = parser.ReadFields();
      } catch (MalformedLineException ex) {
       string errorLine = SafeTrim(parser.ErrorLine);
       fields = errorLine.Split(',');
      }
     }
    }
    

    You may want to handle the catch block differently, but the general concept works great for me.

    0 讨论(0)
提交回复
热议问题