Ignoring comma in field of CSV file with awk

后端未结

关注

 2  1172

I\'m trying to get a number from the second field of the last row of a CSV file. So far, I have this:

awk -F\",\" \'END {print $2}\' /file/path/fileName.csv


                      
              相关标签:


      
      
        
          2条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  南旧        
                
              
                            
                2021-01-14 06:16
              
            
            
                                                                       
I think your requirement is the perfect use case for using FPAT in GNU Awk,

Quoting as-is from the man page,

Normally, when using FS, gawk defines the fields as the parts of the record that occur in between each field separator. In other words, FS defines what a field is not, instead of what a field is. However, there are times when you really want to define the fields by what they are, and not by what they are not.

The most notorious such case is so-called comma-separated values (CSV) data.  If commas only separated the data, there wouldn’t be an issue. The problem comes when one of the fields contains an embedded comma. In such cases, most programs embed the field in double quotes.

In the case of CSV data as presented here, each field is either “anything that is not a comma,” or “a double quote, anything that is not a double quote, and a closing double quote.” If written as a regular expression constant (see Regexp), we would have /([^,]+)|("[^"]+")/. Writing this as a string requires us to escape the double quotes, leading to:

FPAT = "([^,]+)|(\"[^\"]+\")"


Using that on your input file,

awk 'BEGIN{FPAT = "([^,]+)|(\"[^\"]+\")"}{print $1}' file
"Company Name, LLC"

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  野性不改        
                
              
                            
                2021-01-14 06:28
              
            
            
                                                                       
There is no general answer to this question, since regular expressions aren't powerful enough (in the general case) to parse csv. My solution  is a C program that preprocesses the input using a finite state machine, the output of which can be input to Awk:

/* NAME
 *
 *     csv -- convert comma-separated values file to character-delimited
 *
 *
 * SYNOPSIS
 *
 *     csv [-Cc] [-Fc] [filename ...]
 *
 *
 * DESCRIPTION
 *
 *     Csv reads from standard input or from one or more files named on
 *     the command line a sequence of records in comma-separated values
 *     format and writes on standard output the same records in character-
 *     delimited format.  Csv returns 0 on success, 1 for option errors,
 *     and 2 if any file couldn't be opened.
 *
 *     The comma-separated values format has developed over time as a
 *     set of conventions that has never been formally defined, and some
 *     implementations are in conflict about some of the details.  In
 *     general, the comma-separated values format is used by databases,
 *     spreadsheets, and other programs that need to write data consisting
 *     of records containing fields.  The data is written as ascii text,
 *     with records terminated by newlines and fields containing zero or
 *     more characters separated by commas.  Leading and trailing space in
 *     unquoted fields is preserved.  Fields may be surrounded by double-
 *     quote characters (ascii \042); such fields may contain newlines,
 *     literal commas (ascii \054), and double-quote characters
 *     represented as two successive double-quotes.  The examples shown
 *     below clarify many irregular situations that may arise.
 *
 *     The field separator is normally a comma, but can be changed to an
 *     arbitrary character c with the command line option -Cc.  This is
 *     useful in those european countries that use a comma instead of a
 *     decimal point, where the field separator is normally changed to a
 *     semicolon.
 *
 *     Character-delimited format has records terminated by newlines and
 *     fields separated by a single character, which is \034 by default
 *     but may be changed with the -Fc option on the command line.
 *
 *
 * EXAMPLE
 *
 *     Each record below has five fields.  For readability, the three-
 *     character sequence TAB represents a single tab character (ascii
 *     \011).
 *
 *         $ cat testdata.csv
 *         1,abc,def ghi,jkl,unquoted character strings
 *         2,"abc","def ghi","jkl",quoted character strings
 *         3,123,456,789,numbers
 *         4,   abc,def   ,   ghi   ,strings with whitespace
 *         5,   "abc","def"   ,   "ghi"   ,quoted strings with whitespace
 *         6,   123,456   ,   789   ,numbers with whitespace
 *         7,TAB123,456TAB,TAB789TAB,numbers with tabs for whitespace
 *         8,   -123,   +456,   1E3,more numbers with whitespace
 *         9,123 456,123"456,  123 456   ,strange numbers
 *         10,abc",de"f,g"hi,embedded quotes
 *         11,"abc""","de""f","g""hi",quoted embedded quotes
 *         12,"","" "",""x"",doubled quotes
 *         13,"abc"def,abc"def","abc" "def",strange quotes
 *         14,,"",   ,empty fields
 *         15,abc,"def
 *         ghi",jkl,embedded newline
 *         16,abc,"def",789,multiple types of fields
 *
 *         $ csv -F'|' testdata.csv
 *         1|abc|def ghi|jkl|unquoted character strings
 *         2|abc|def ghi|jkl|quoted character strings
 *         3|123|456|789|numbers
 *         4|   abc|def   |   ghi   |strings with whitespace
 *         5|   "abc"|def   |   "ghi"   |quoted strings with whitespace
 *         6|   123|456   |   789   |numbers with whitespace
 *         7|TAB123|456TAB|TAB789TAB|numbers with tabs for whitespace
 *         8|   -123|   +456|   1E3|more numbers with whitespace
 *         9|123 456|123"456|  123 456   |strange numbers
 *         10|abc"|de"f|g"hi|embedded quotes
 *         11|abc"|de"f|g"hi|quoted embedded quotes
 *         12|| ""|x""|doubled quotes
 *         13|abcdef|abc"def"|abc "def"|strange quotes
 *         14|||   |empty fields
 *         15|abc|def
 *         ghi|jkl|embedded newline
 *         16|abc|def|789|multiple types of fields
 *
 *     It is particularly easy to pipe the output from csv into any of
 *     the unix tools that accept character-delimited fielded text data
 *     files, such as sort, join, or cut.  For example:
 *
 *         csv datafile.csv | awk -F'\034' -f program.awk
 *
 *
 * BUGS
 *
 *     On DOS, Windows, and OS/2 systems, processing of each file stops
 *     at the first appearance of the ascii \032 (control-Z) end of file
 *     character.
 *
 *     Because newlines embedded in quoted fields are treated literally,
 *     a missing closing quote can suck up all remaining input.
 *
 *
 * LICENSE
 *
 *     This program was written by Philip L. Bewig of Saint Louis,
 *     Missouri, United States of America on February 28, 2002 and
 *     placed in the public domain.
 */

#include <stdio.h>

/* dofile -- convert one file from comma-separated to delimited */
void dofile(char ofs, char fs, FILE *f) {
    int c; /* current input character */

    START:
        c = fgetc(f);
        if (c == EOF)  {                     return; }
        if (c == '\r') {                     goto CARRIAGE_RETURN; }
        if (c == '\n') {                     goto LINE_FEED; }
        if (c == '\"') {                     goto QUOTED_FIELD; }
        if (c == fs)   { putchar(ofs);       goto NOT_FIELD; }
        /* default */  { putchar(c);         goto UNQUOTED_FIELD; }

    NOT_FIELD:
        c = fgetc(f);
        if (c == EOF)  { putchar('\n');      return; }
        if (c == '\r') {                     goto CARRIAGE_RETURN; }
        if (c == '\n') {                     goto LINE_FEED; }
        if (c == '\"') {                     goto QUOTED_FIELD; }
        if (c == fs)   { putchar(ofs);       goto NOT_FIELD; }
        /* default */  { putchar(c);         goto UNQUOTED_FIELD; }

    QUOTED_FIELD:
        c = fgetc(f);
        if (c == EOF)  { putchar('\n');      return; }
        if (c == '\"') {                     goto MAY_BE_DOUBLED_QUOTES; }
        /* default */  { putchar(c);         goto QUOTED_FIELD; }

    MAY_BE_DOUBLED_QUOTES:
        c = fgetc(f);
        if (c == EOF)  { putchar('\n');      return; }
        if (c == '\r') {                     goto CARRIAGE_RETURN; }
        if (c == '\n') {                     goto LINE_FEED; }
        if (c == '\"') { putchar('\"');      goto QUOTED_FIELD; }
        if (c == fs)   { putchar(ofs);       goto NOT_FIELD; }
        /* default */  { putchar(c);         goto UNQUOTED_FIELD; }

    UNQUOTED_FIELD:
        c = fgetc(f);
        if (c == EOF)  { putchar('\n');      return; }
        if (c == '\r') {                     goto CARRIAGE_RETURN; }
        if (c == '\n') {                     goto LINE_FEED; }
        if (c == fs)   { putchar(ofs);       goto NOT_FIELD; }
        /* default */  { putchar(c);         goto UNQUOTED_FIELD; }

    CARRIAGE_RETURN:
        c = fgetc(f);
        if (c == EOF)  { putchar('\n');      return; }
        if (c == '\r') { putchar('\n');      goto CARRIAGE_RETURN; }
        if (c == '\n') { putchar('\n');      goto START; }
        if (c == '\"') { putchar('\n');      goto QUOTED_FIELD; }
        if (c == fs)   { printf("\n%c",ofs); goto NOT_FIELD; }
        /* default */  { printf("\n%c",c);   goto UNQUOTED_FIELD; }

    LINE_FEED:
        c = fgetc(f);
        if (c == EOF)  { putchar('\n');      return; }
        if (c == '\r') { putchar('\n');      goto START; }
        if (c == '\n') { putchar('\n');      goto LINE_FEED; }
        if (c == '\"') { putchar('\n');      goto QUOTED_FIELD; }
        if (c == fs)   { printf("\n%c",ofs); goto NOT_FIELD; }
        /* default */  { printf("\n%c",c);   goto UNQUOTED_FIELD; }
}

/* main -- process command line, call appropriate conversion */
int main(int argc, char *argv[]) {
    char ofs = '\034'; /* output field separator */
    char fs = ',';     /* input field separator */
    int  status = 0;   /* error status for return to operating system */
    char *progname;    /* name of program for error messages */

    FILE *f;
    int i;

    progname = (char *) malloc(strlen(argv[0])+1);
    strcpy(progname, argv[0]);

    while (argc > 1 && argv[1][0] == '-') {
        switch (argv[1][1]) {
            case 'c':
            case 'C':
                fs = argv[1][2];
                break;
            case 'f':
            case 'F':
                ofs = argv[1][2];
                break;
            default:
                fprintf(stderr, "%s: unknown argument %s\n",
                    progname, argv[1]);
                fprintf(stderr,
                   "usage: %s [-Cc] [-Fc] [filename ...]\n",
                    progname);
                exit(1);
        }
        argc--;
        argv++;
    }

    if (argc == 1)
        dofile(ofs, fs, stdin);
    else
        for (i = 1; i < argc; i++)
            if ((f = fopen(argv[i], "r")) == NULL) {
                fprintf(stderr, "%s: can't open %s\n",
                    progname, argv[i]);
                status = 2;
            } else {
                dofile(ofs, fs, f);
                fclose(f);
            }

    exit(status);
}

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复