Ignoring comma in field of CSV file with awk

后端 未结 2 1172
星月不相逢
星月不相逢 2021-01-14 05:32

I\'m trying to get a number from the second field of the last row of a CSV file. So far, I have this:

awk -F\",\" \'END {print $2}\' /file/path/fileName.csv
         


        
相关标签:
2条回答
  • 2021-01-14 06:16

    I think your requirement is the perfect use case for using FPAT in GNU Awk,

    Quoting as-is from the man page,

    Normally, when using FS, gawk defines the fields as the parts of the record that occur in between each field separator. In other words, FS defines what a field is not, instead of what a field is. However, there are times when you really want to define the fields by what they are, and not by what they are not.

    The most notorious such case is so-called comma-separated values (CSV) data. If commas only separated the data, there wouldn’t be an issue. The problem comes when one of the fields contains an embedded comma. In such cases, most programs embed the field in double quotes.

    In the case of CSV data as presented here, each field is either “anything that is not a comma,” or “a double quote, anything that is not a double quote, and a closing double quote.” If written as a regular expression constant (see Regexp), we would have /([^,]+)|("[^"]+")/. Writing this as a string requires us to escape the double quotes, leading to:

    FPAT = "([^,]+)|(\"[^\"]+\")"
    

    Using that on your input file,

    awk 'BEGIN{FPAT = "([^,]+)|(\"[^\"]+\")"}{print $1}' file
    "Company Name, LLC"
    
    0 讨论(0)
  • 2021-01-14 06:28

    There is no general answer to this question, since regular expressions aren't powerful enough (in the general case) to parse csv. My solution is a C program that preprocesses the input using a finite state machine, the output of which can be input to Awk:

    /* NAME
     *
     *     csv -- convert comma-separated values file to character-delimited
     *
     *
     * SYNOPSIS
     *
     *     csv [-Cc] [-Fc] [filename ...]
     *
     *
     * DESCRIPTION
     *
     *     Csv reads from standard input or from one or more files named on
     *     the command line a sequence of records in comma-separated values
     *     format and writes on standard output the same records in character-
     *     delimited format.  Csv returns 0 on success, 1 for option errors,
     *     and 2 if any file couldn't be opened.
     *
     *     The comma-separated values format has developed over time as a
     *     set of conventions that has never been formally defined, and some
     *     implementations are in conflict about some of the details.  In
     *     general, the comma-separated values format is used by databases,
     *     spreadsheets, and other programs that need to write data consisting
     *     of records containing fields.  The data is written as ascii text,
     *     with records terminated by newlines and fields containing zero or
     *     more characters separated by commas.  Leading and trailing space in
     *     unquoted fields is preserved.  Fields may be surrounded by double-
     *     quote characters (ascii \042); such fields may contain newlines,
     *     literal commas (ascii \054), and double-quote characters
     *     represented as two successive double-quotes.  The examples shown
     *     below clarify many irregular situations that may arise.
     *
     *     The field separator is normally a comma, but can be changed to an
     *     arbitrary character c with the command line option -Cc.  This is
     *     useful in those european countries that use a comma instead of a
     *     decimal point, where the field separator is normally changed to a
     *     semicolon.
     *
     *     Character-delimited format has records terminated by newlines and
     *     fields separated by a single character, which is \034 by default
     *     but may be changed with the -Fc option on the command line.
     *
     *
     * EXAMPLE
     *
     *     Each record below has five fields.  For readability, the three-
     *     character sequence TAB represents a single tab character (ascii
     *     \011).
     *
     *         $ cat testdata.csv
     *         1,abc,def ghi,jkl,unquoted character strings
     *         2,"abc","def ghi","jkl",quoted character strings
     *         3,123,456,789,numbers
     *         4,   abc,def   ,   ghi   ,strings with whitespace
     *         5,   "abc","def"   ,   "ghi"   ,quoted strings with whitespace
     *         6,   123,456   ,   789   ,numbers with whitespace
     *         7,TAB123,456TAB,TAB789TAB,numbers with tabs for whitespace
     *         8,   -123,   +456,   1E3,more numbers with whitespace
     *         9,123 456,123"456,  123 456   ,strange numbers
     *         10,abc",de"f,g"hi,embedded quotes
     *         11,"abc""","de""f","g""hi",quoted embedded quotes
     *         12,"","" "",""x"",doubled quotes
     *         13,"abc"def,abc"def","abc" "def",strange quotes
     *         14,,"",   ,empty fields
     *         15,abc,"def
     *         ghi",jkl,embedded newline
     *         16,abc,"def",789,multiple types of fields
     *
     *         $ csv -F'|' testdata.csv
     *         1|abc|def ghi|jkl|unquoted character strings
     *         2|abc|def ghi|jkl|quoted character strings
     *         3|123|456|789|numbers
     *         4|   abc|def   |   ghi   |strings with whitespace
     *         5|   "abc"|def   |   "ghi"   |quoted strings with whitespace
     *         6|   123|456   |   789   |numbers with whitespace
     *         7|TAB123|456TAB|TAB789TAB|numbers with tabs for whitespace
     *         8|   -123|   +456|   1E3|more numbers with whitespace
     *         9|123 456|123"456|  123 456   |strange numbers
     *         10|abc"|de"f|g"hi|embedded quotes
     *         11|abc"|de"f|g"hi|quoted embedded quotes
     *         12|| ""|x""|doubled quotes
     *         13|abcdef|abc"def"|abc "def"|strange quotes
     *         14|||   |empty fields
     *         15|abc|def
     *         ghi|jkl|embedded newline
     *         16|abc|def|789|multiple types of fields
     *
     *     It is particularly easy to pipe the output from csv into any of
     *     the unix tools that accept character-delimited fielded text data
     *     files, such as sort, join, or cut.  For example:
     *
     *         csv datafile.csv | awk -F'\034' -f program.awk
     *
     *
     * BUGS
     *
     *     On DOS, Windows, and OS/2 systems, processing of each file stops
     *     at the first appearance of the ascii \032 (control-Z) end of file
     *     character.
     *
     *     Because newlines embedded in quoted fields are treated literally,
     *     a missing closing quote can suck up all remaining input.
     *
     *
     * LICENSE
     *
     *     This program was written by Philip L. Bewig of Saint Louis,
     *     Missouri, United States of America on February 28, 2002 and
     *     placed in the public domain.
     */
    
    #include <stdio.h>
    
    /* dofile -- convert one file from comma-separated to delimited */
    void dofile(char ofs, char fs, FILE *f) {
        int c; /* current input character */
    
        START:
            c = fgetc(f);
            if (c == EOF)  {                     return; }
            if (c == '\r') {                     goto CARRIAGE_RETURN; }
            if (c == '\n') {                     goto LINE_FEED; }
            if (c == '\"') {                     goto QUOTED_FIELD; }
            if (c == fs)   { putchar(ofs);       goto NOT_FIELD; }
            /* default */  { putchar(c);         goto UNQUOTED_FIELD; }
    
        NOT_FIELD:
            c = fgetc(f);
            if (c == EOF)  { putchar('\n');      return; }
            if (c == '\r') {                     goto CARRIAGE_RETURN; }
            if (c == '\n') {                     goto LINE_FEED; }
            if (c == '\"') {                     goto QUOTED_FIELD; }
            if (c == fs)   { putchar(ofs);       goto NOT_FIELD; }
            /* default */  { putchar(c);         goto UNQUOTED_FIELD; }
    
        QUOTED_FIELD:
            c = fgetc(f);
            if (c == EOF)  { putchar('\n');      return; }
            if (c == '\"') {                     goto MAY_BE_DOUBLED_QUOTES; }
            /* default */  { putchar(c);         goto QUOTED_FIELD; }
    
        MAY_BE_DOUBLED_QUOTES:
            c = fgetc(f);
            if (c == EOF)  { putchar('\n');      return; }
            if (c == '\r') {                     goto CARRIAGE_RETURN; }
            if (c == '\n') {                     goto LINE_FEED; }
            if (c == '\"') { putchar('\"');      goto QUOTED_FIELD; }
            if (c == fs)   { putchar(ofs);       goto NOT_FIELD; }
            /* default */  { putchar(c);         goto UNQUOTED_FIELD; }
    
        UNQUOTED_FIELD:
            c = fgetc(f);
            if (c == EOF)  { putchar('\n');      return; }
            if (c == '\r') {                     goto CARRIAGE_RETURN; }
            if (c == '\n') {                     goto LINE_FEED; }
            if (c == fs)   { putchar(ofs);       goto NOT_FIELD; }
            /* default */  { putchar(c);         goto UNQUOTED_FIELD; }
    
        CARRIAGE_RETURN:
            c = fgetc(f);
            if (c == EOF)  { putchar('\n');      return; }
            if (c == '\r') { putchar('\n');      goto CARRIAGE_RETURN; }
            if (c == '\n') { putchar('\n');      goto START; }
            if (c == '\"') { putchar('\n');      goto QUOTED_FIELD; }
            if (c == fs)   { printf("\n%c",ofs); goto NOT_FIELD; }
            /* default */  { printf("\n%c",c);   goto UNQUOTED_FIELD; }
    
        LINE_FEED:
            c = fgetc(f);
            if (c == EOF)  { putchar('\n');      return; }
            if (c == '\r') { putchar('\n');      goto START; }
            if (c == '\n') { putchar('\n');      goto LINE_FEED; }
            if (c == '\"') { putchar('\n');      goto QUOTED_FIELD; }
            if (c == fs)   { printf("\n%c",ofs); goto NOT_FIELD; }
            /* default */  { printf("\n%c",c);   goto UNQUOTED_FIELD; }
    }
    
    /* main -- process command line, call appropriate conversion */
    int main(int argc, char *argv[]) {
        char ofs = '\034'; /* output field separator */
        char fs = ',';     /* input field separator */
        int  status = 0;   /* error status for return to operating system */
        char *progname;    /* name of program for error messages */
    
        FILE *f;
        int i;
    
        progname = (char *) malloc(strlen(argv[0])+1);
        strcpy(progname, argv[0]);
    
        while (argc > 1 && argv[1][0] == '-') {
            switch (argv[1][1]) {
                case 'c':
                case 'C':
                    fs = argv[1][2];
                    break;
                case 'f':
                case 'F':
                    ofs = argv[1][2];
                    break;
                default:
                    fprintf(stderr, "%s: unknown argument %s\n",
                        progname, argv[1]);
                    fprintf(stderr,
                       "usage: %s [-Cc] [-Fc] [filename ...]\n",
                        progname);
                    exit(1);
            }
            argc--;
            argv++;
        }
    
        if (argc == 1)
            dofile(ofs, fs, stdin);
        else
            for (i = 1; i < argc; i++)
                if ((f = fopen(argv[i], "r")) == NULL) {
                    fprintf(stderr, "%s: can't open %s\n",
                        progname, argv[i]);
                    status = 2;
                } else {
                    dofile(ofs, fs, f);
                    fclose(f);
                }
    
        exit(status);
    }
    
    0 讨论(0)
提交回复
热议问题