Autodetect Presence of CSV Headers in a File

后端 未结 6 1517
心在旅途
心在旅途 2020-12-25 13:23

Short question: How do I automatically detect whether a CSV file has headers in the first row?

Details: I\'ve written a small CSV parsing engine th

相关标签:
6条回答
  • 2020-12-25 13:29

    In the most general sense, this is impossible. This is a valid csv file:
    Name
    Jim
    Tom
    Bill

    Most csv readers will just take hasHeader as an option, and allow you to pass in your own header if you want. Even in the case you think you can detect, that being character headers and numeric data, you can run into a catastrophic failure. What if your column is a list of BMW series?
    M
    3
    5
    7

    You will process this incorrectly. Worst of all, you will lose the best car!

    0 讨论(0)
  • 2020-12-25 13:29

    If you CSV has a header like this.

    ID, Name, Email, Date 1, john, john@john.com, 12 jan 2020

    Then doing a filter_var(str, FILTER_VALIDATE_EMAIL) on the header row will fail. Since the email address is only in the row data. So check header row for an email address (assuming your CSV has email addresses in it).

    Second idea. http://php.net/manual/en/function.is-numeric.php Check header row for is_numeric, most likely a header row does not have numeric data in it. But most likely a data row would have numeric data.

    If you know you have dates in your columns, then checking the header row for a date would also work.

    Obviously you need to what type of data you are expecting. I am "expecting" email addresses.

    0 讨论(0)
  • 2020-12-25 13:30

    In the purely abstract sense, I don't think there is an foolproof algorithmic answer to your question since it boils down to: "How do I distinguish dataA from dataB if I know nothing about either of them?". There will always be the potential for dataA to be indistinguishable from dataB. That said, I would start with the simple and only add complexity as needed. For example, if examining the first five rows, for a given column (or columns) if the datatype in rows 2-5 are all the same but differ from the datatype in row 1, there's a good chance that a header row is present (increased sample sizes reduce the possibility of error). This would (sorta) solve #1/#3 - perhaps throw an exception if the rows are all populated but the data is indistinguishable to allow the calling program to decide what to do next. For #2, simply don't count a row as a row unless and until it pulls non-null data....that would work in all but an empty file (in which case you'd hit EOF). It would never be foolproof, but it might be "close enough".

    0 讨论(0)
  • 2020-12-25 13:31

    As others have pointed out, you can't do this with 100% reliability. There are cases where getting it 'mostly right' is useful, however - for example, spreadsheet tools with CSV import functionality often try to figure this out on their own. Here's a few heuristics that would tend to indicate the first line isn't a header:

    • The first row has columns that are not strings or are empty
    • The first row's columns are not all unique
    • The first row appears to contain dates or other common data formats (eg, xx-xx-xx)
    0 讨论(0)
  • 2020-12-25 13:43

    This article provides some good guidance:

    Basically, you do statistical analysis on columns based on whether the first row contains a string and the rest of the rows numbers, or something like that.

    http://penndsg.com/blog/detect-headers/

    0 讨论(0)
  • 2020-12-25 13:51

    It really depends on just how "general" you want your tool to be. If the data will always be numeric, you have it easy as long as you assume non-numeric headers (which seems like a pretty fair assumption).

    But beyond that, if you don't already know what patterns are present in the data, then you can't really test for them ahead of time.

    FWIW, I actually just wrote a script for parsing out some stuff from TSVs, all from the same source. The source's approach to headers/formatting was so scattered that it made sense to just make the script ask me questions from the command line while executing. (Is this a header? Which columns are important?). So no automation, but it let's me fly through the data sets I'm working on, instead of trying to anticipate each funny formatting case. Also, my answers are saved in a file, so I only have to be involved once per file. Not ideal, but efficient.

    0 讨论(0)
提交回复
热议问题