Extract specific columns from delimited file using Awk

后端 未结 8 1856
-上瘾入骨i
-上瘾入骨i 2020-11-28 05:44

Sorry if this is too basic. I have a csv file where the columns have a header row (v1, v2, etc.). I understand that to extract columns 1 and 2, I have to do: awk -F \

相关标签:
8条回答
  • 2020-11-28 06:13

    As mentioned by @Tom, the cut and awk approaches actually don't work for CSVs with quoted strings. An alternative is a module for python that provides the command line tool csvfilter. It works like cut, but properly handles CSV column quoting:

    csvfilter -f 1,3,5 in.csv > out.csv
    

    If you have python (and you should), you can install it simply like this:

    pip install csvfilter
    

    Please take note that the column indexing in csvfilter starts with 0 (unlike awk, which starts with $1). More info at https://github.com/codeinthehole/csvfilter/

    0 讨论(0)
  • 2020-11-28 06:16

    Others have answered your earlier question. For this:

    As an addendum, is there any way to extract directly with the header names rather than with column numbers?

    I haven't tried it, but you could store each header's index in a hash and then use that hash to get its index later on.

    for(i=0;i<$NF;i++){
        hash[$i] = i;
    }
    

    Then later on, use it:

    j = hash["header1"];
    print $j;
    
    0 讨论(0)
  • 2020-11-28 06:20

    Tabulator is a set of unix command line tools to work with csv files that have header lines. Here is an example to extract columns by name from a file test.csv:

    name,sex,house_nr,height,shoe_size
    arthur,m,42,181,11.5
    berta,f,101,163,8.5
    chris,m,1333,175,10
    don,m,77,185,12.5
    elisa,f,204,166,7
    

    Then tblmap -k name,height test.csv produces

    name,height
    arthur,181
    berta,163
    chris,175
    don,185
    elisa,166
    
    0 讨论(0)
  • 2020-11-28 06:25

    I don't know if it's possible to do ranges in awk. You could do a for loop, but you would have to add handling to filter out the columns you don't want. It's probably easier to do this:

    awk -F, '{OFS=",";print $1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$20,$21,$22,$23,$24,$25,$30,$33}' infile.csv > outfile.csv
    

    something else to consider - and this faster and more concise:

    cut -d "," -f1-10,20-25,30-33 infile.csv > outfile.csv
    

    As to the second part of your question, I would probably write a script in perl that knows how to handle header rows, parsing the columns names from stdin or a file and then doing the filtering. It's probably a tool I would want to have for other things. I am not sure about doing in a one liner, although I am sure it can be done.

    0 讨论(0)
  • 2020-11-28 06:25

    Not using awk but the simplest way I was able to get this done was to just use csvtool. I had other use cases as well to use csvtool and it can handle the quotes or delimiters appropriately if they appear within the column data itself.

    csvtool format '%(2)\n' input.csv
    csvtool format '%(2),%(3),%(4)\n' input.csv
    

    Replacing 2 with the column number will effectively extract the column data you are looking for.

    0 讨论(0)
  • 2020-11-28 06:34

    If Perl is an option:

    perl -F, -lane 'print join ",",@F[0,1,2,3,4,5,6,7,8,9,19,20,21,22,23,24,29,32]'

    -a autosplits line into @F fields array. Indices start at 0 (not 1 as in awk)
    -F, field separator is ,

    If your CSV file contains commas within quotes, fully fledged CSV parsers such as Perl's Text::CSV_XS are purpose-built to handle that kind of weirdness.

    perl -MText::CSV_XS -lne 'BEGIN{$csv=Text::CSV_XS->new()} if($csv->parse($_)){@f=$csv->fields();print (join ",",@f[0,1,2,3,4,5,6,7,8,9,19,20,21,22,23,24,29,32])}'

    I provided more explanation within my answer here: parse csv file using gawk

    0 讨论(0)
提交回复
热议问题