How to programmatically guess whether a CSV file is comma or semicolon delimited

前端 未结 5 1775
滥情空心
滥情空心 2021-01-04 08:29

In most cases, CSV files are text files with records delimited by commas. However, sometimes these files will come semicolon delimited. (Excel will use semicolon delimiter

相关标签:
5条回答
  • 2021-01-04 09:06

    Depending on what you are working with, if you will guaranteeing have a header row, your approach of trying both, could be the best overall practice. Then once you determine what is going on, if you get to a row further down that doesn't have the required number of columns then you know that the format isn't correct.

    Typically i would see this as a user specified option on upload, rather than a programmatic test.

    0 讨论(0)
  • 2021-01-04 09:07

    Let's say you have the following in your csv:

    title,url,date,copyright,hdurl,explanation,media_type,service_version
    

    then you can use python's in-built CSV module as follows:

    import csv
    data = "title,url,date,copyright,hdurl,explanation,media_type,service_version"
    sn = csv.Sniffer()
    delimiter = sn.sniff(data).delimiter
    

    Printing the variable named delimiter will return ',' and this is the delimiter here. You can test by using some different delimiters.

    0 讨论(0)
  • 2021-01-04 09:13

    You can read the first line

    FileReader fileReader = new FileReader(filePath);
        BufferedReader bufferedReader = new BufferedReader(fileReader);
        String s = bufferedReader.readLine();
        String substring = s.substring(s.indexOf(firstColumnName) + 3, s.indexOf(firstColumnName) + 4);
        bufferedReader.close();
        fileReader.close();
        substring.charAt(0);
    

    Then you capture this value

    substring.charAt(0)

    depending on whether the CSV is comma or semicolon can use the last value

    0 讨论(0)
  • 2021-01-04 09:25

    If every row should have the same number of columns, which I believe is the case with Excel, then, using both commas and semicolons, figure out the number of columns for lines N and N+1. Whichever method (commas or semicolons) produces a different answer is wrong (not the format of the file). You can start at the beginning and you only have to go until one of them is proven incorrect. You don't need header lines or anything. You don't have to read more of the file than is necessary, and it can't ever give you a wrong answer for the format of the file, it just might reach the end and not yet have come to a conclusion. All you need is for the every row has the same number of columns property to hold.

    0 讨论(0)
  • 2021-01-04 09:27

    This is my code (no validation on text)... perhaps it could help or make a base :-) !

    using System;
    using System.Collections.Generic;
    using System.Linq;
    using System.Text.RegularExpressions;
    using MoreLinq; // http://stackoverflow.com/questions/15265588/how-to-find-item-with-max-value-using-linq
    
    namespace HQ.Util.General.CSV
    {
        public class CsvHelper
        {
            public static Dictionary<LineSeparator, Func<string, string[]>>  DictionaryOfLineSeparatorAndItsFunc = new Dictionary<LineSeparator, Func<string, string[]>>();
    
            static CsvHelper()
            {
                DictionaryOfLineSeparatorAndItsFunc[LineSeparator.Unknown] = ParseLineNotSeparated;
                DictionaryOfLineSeparatorAndItsFunc[LineSeparator.Tab] = ParseLineTabSeparated;
                DictionaryOfLineSeparatorAndItsFunc[LineSeparator.Semicolon] = ParseLineSemicolonSeparated;
                DictionaryOfLineSeparatorAndItsFunc[LineSeparator.Comma] = ParseLineCommaSeparated;
            }
    
            // ******************************************************************
            public enum LineSeparator
            {
                Unknown = 0,
                Tab,
                Semicolon,
                Comma
            }
    
            // ******************************************************************
            public static LineSeparator GuessCsvSeparator(string oneLine)
            {
                List<Tuple<LineSeparator, int>> listOfLineSeparatorAndThereFirstLineSeparatedValueCount = new List<Tuple<LineSeparator, int>>();
    
                listOfLineSeparatorAndThereFirstLineSeparatedValueCount.Add(new Tuple<LineSeparator, int>(LineSeparator.Tab, CsvHelper.ParseLineTabSeparated(oneLine).Count()));
                listOfLineSeparatorAndThereFirstLineSeparatedValueCount.Add(new Tuple<LineSeparator, int>(LineSeparator.Semicolon, CsvHelper.ParseLineSemicolonSeparated(oneLine).Count()));
                listOfLineSeparatorAndThereFirstLineSeparatedValueCount.Add(new Tuple<LineSeparator, int>(LineSeparator.Comma, CsvHelper.ParseLineCommaSeparated(oneLine).Count()));
    
                Tuple<LineSeparator, int> bestBet = listOfLineSeparatorAndThereFirstLineSeparatedValueCount.MaxBy((n)=>n.Item2);
    
                if (bestBet != null && bestBet.Item2 > 1)
                {
                    return bestBet.Item1;
                }
    
                return LineSeparator.Unknown;
            }
    
            // ******************************************************************
            public static string[] ParseLineCommaSeparated(string line)
            {
                // CSV line parsing : From "jgr4" in http://www.kimgentes.com/worshiptech-web-tools-page/2008/10/14/regex-pattern-for-parsing-csv-files-with-embedded-commas-dou.html
                var matches = Regex.Matches(line, @"\s?((?<x>(?=[,]+))|""(?<x>([^""]|"""")+)""|""(?<x>)""|(?<x>[^,]+)),?",
                                            RegexOptions.ExplicitCapture);
    
                string[] values = (from Match m in matches
                                   select m.Groups["x"].Value.Trim().Replace("\"\"", "\"")).ToArray();
    
                return values;
            }
    
            // ******************************************************************
            public static string[] ParseLineTabSeparated(string line)
            {
                var matchesTab = Regex.Matches(line, @"\s?((?<x>(?=[\t]+))|""(?<x>([^""]|"""")+)""|""(?<x>)""|(?<x>[^\t]+))\t?",
                                RegexOptions.ExplicitCapture);
    
                string[] values = (from Match m in matchesTab
                                    select m.Groups["x"].Value.Trim().Replace("\"\"", "\"")).ToArray();
    
                return values;
            }
    
            // ******************************************************************
            public static string[] ParseLineSemicolonSeparated(string line)
            {
                // CSV line parsing : From "jgr4" in http://www.kimgentes.com/worshiptech-web-tools-page/2008/10/14/regex-pattern-for-parsing-csv-files-with-embedded-commas-dou.html
                var matches = Regex.Matches(line, @"\s?((?<x>(?=[;]+))|""(?<x>([^""]|"""")+)""|""(?<x>)""|(?<x>[^;]+));?",
                                            RegexOptions.ExplicitCapture);
    
                string[] values = (from Match m in matches
                                   select m.Groups["x"].Value.Trim().Replace("\"\"", "\"")).ToArray();
    
                return values;
            }
    
            // ******************************************************************
            public static string[] ParseLineNotSeparated(string line)
            {
                string [] lineValues = new string[1];
                lineValues[0] = line;
                return lineValues;
            }
    
            // ******************************************************************
            public static List<string[]> ParseText(string text)
            {
                string[] lines = text.Split(new string[] { "\r\n" }, StringSplitOptions.None);
                return ParseString(lines);
            }
    
            // ******************************************************************
            public static List<string[]> ParseString(string[] lines)
            {
                List<string[]> result = new List<string[]>();
    
                LineSeparator lineSeparator = LineSeparator.Unknown;
                if (lines.Any())
                {
                    lineSeparator = GuessCsvSeparator(lines[0]);
                }
    
                Func<string, string[]> funcParse = DictionaryOfLineSeparatorAndItsFunc[lineSeparator];
    
                foreach (string line in lines)
                {
                    if (string.IsNullOrWhiteSpace(line))
                    {
                        continue;
                    }
    
                    result.Add(funcParse(line));
                }
    
                return result;
            }
    
            // ******************************************************************
        }
    }
    
    0 讨论(0)
提交回复
热议问题