How to check if a CSV has a header using Python?

前端 未结 5 551
再見小時候
再見小時候 2021-01-17 22:54

I have a CSV file and I want to check if the first row has only strings in it (ie a header). I\'m trying to avoid using any extras like pandas etc. I\'m thinking I\'ll use a

相关标签:
5条回答
  • 2021-01-17 23:28

    Here is a function I use with pandas in order analyze whether header should be set to 'infer' or None:

    def identify_header(path, n=5, th=0.9):
        df1 = pd.read_csv(path, header='infer', nrows=n)
        df2 = pd.read_csv(path, header=None, nrows=n)
        sim = (df1.dtypes.values == df2.dtypes.values).mean()
        return 'infer' if sim < th else None
    

    Based on a small sample, the function checks the similarity of dtypes with and without a header row. If the dtypes match for a certain percentage of columns, it is assumed that there is no header present. I found a threshold of 0.9 to work well for my use cases. This function is also fairly fast as it only reads a small sample of the csv file.

    0 讨论(0)
  • 2021-01-17 23:30

    Python has a built in CSV module that could help. E.g.

    import csv
    with open('example.csv', 'rb') as csvfile:
        sniffer = csv.Sniffer()
        has_header = sniffer.has_header(csvfile.read(2048))
        csvfile.seek(0)
        # ...
    
    0 讨论(0)
  • 2021-01-17 23:45

    I'd do something like this:

    is_header = not any(cell.isdigit() for cell in csv_table[0])
    

    Given a CSV table csv_table, grab the top (zeroth) row. Iterate through the cells and check if they contain any pure digit strings. If so, it's not a header. Negate that with a not in front of the whole expression.

    Results:

    In [1]: not any(cell.isdigit() for cell in ['2','1'])
    Out[1]: False
    
    In [2]: not any(cell.isdigit() for cell in ['2','gravy'])
    Out[2]: False
    
    In [3]: not any(cell.isdigit() for cell in ['gravy','gravy'])
    Out[3]: True
    
    0 讨论(0)
  • 2021-01-17 23:46

    Well i faced exactly the same problem with the wrong return of has_header for sniffer.has_header and even made a very simple checker that worked in my case

        has_header = ''.join(next(some_csv_reader)).isalpha()
    

    I knew that it wasn't perfect but it seemed it was working...and why not it was a simple replace and check if the the result was alpha or not...and then i put it on my def and it failed.... :( and then i saw the "light"
    The trouble is not with the has_header the trouble was with my code because i wanted to also check the delimiter before i parse the actual .csv ...but all the sniffing has a "cost" as they advance one line at a time in the csv. !!!
    So in order to have has_header working as it should you should make sure you have reset everything before using it. In my case my method is :

      def _get_data(self, filename):
            sniffer = csv.Sniffer()
            training_data = ''
            with open(filename, 'rt') as csvfile:
                dialect = csv.Sniffer().sniff(csvfile.read(2048))
                training_data = csv.reader(csvfile, delimiter=dialect.delimiter)
                csvfile.seek(0)
                has_header=csv.Sniffer().has_header(csvfile.read(2048))
                #has_header = ''.join(next(training_data)).isalpha()
                csvfile.seek(0)
    
    0 讨论(0)
  • 2021-01-17 23:46

    I think the best way to check this is -> simply reading 1st line from file and then match your string instead of any library.

    0 讨论(0)
提交回复
热议问题