awk separate rows if “$2 are the same and max and min value <= 1” and “$2 are the same and max and min value < 1”

前端 未结 1 484
礼貌的吻别
礼貌的吻别 2021-01-28 16:07

If we have an input file: input.csv

cpdID,cpd_number,Cell_assay_id,Cell_alt_assay_id,Cell_type_desc,Cell_Operator,Cell_result_value,Cell_unit_value,assay_id,alt_         


        
相关标签:
1条回答
  • 2021-01-28 16:38

    Take a look at this, which is an example of processing each group as it's id changes:

    #!/usr/bin/awk -f
    
    BEGIN {FS=","; f1="a"; f2="b"}
    
    FNR==1 { print $0 > f1; print $0 > f2; next }
    
    $2!=last_id && FNR > 2 { handleBlock() }
    
    { a[++cnt]=$0; m[cnt]=$17; last_id=$2 }
    
    END { handleBlock() }
    
    function handleBlock() {
      if( m[1]-m[cnt]<=1 ) fname = f1
      else fname = f2
      for( i=1;i<=cnt;i++ ) { print a[i] > fname }
      cnt=0
    }
    

    It's an executable awk file. When put it into a file called awko and chmod +x awko it can be run like awko data for an input file called "data".

    The script I wrote for the other question was based on me assuming the the input order of the file elements were unknown - where the $2 fields could be in any order and that only the min and max values mattered. In this question, the OP would like to send all rows related to the $2 field to one file or another based on the min/max values.

    The input file for this question has the following properties which this script is dependent on:

    • The header is on the first line
    • The $2 fields are grouped
    • The max value is the first element of the group
    • The min value is the last value of the group

    Where there's a resource list that's sorted by the resource id, one common algorithm for minimally loading the data is to only load it when the resource id changes. The same can be done for processing grouped entries here. Take an example like:

    a
    a
    a
    b <- this is a good place to process all the prior "a" entries
    b
    c <- process "b" entries here
    c
    EOF <- the end of the file.  process the last group ( the "c" entries here )
    

    With that in mind, here's a break down of the script:

    • Set the FS and some output file names in BEGIN block ( "a" and "b" for my testing )
    • The first line is the header - put it in each file, f1 and f2.
    • If $2 != last_id, call the handleBlock() function to process it.
    • Store the whole line in array a, $17 in array m and set last_id=$2 ( the array names are horrible ).
    • The cnt variable indicates how many entries are in each group ( what I called a block )
    • handleBlock() will only get called when the $2 id changes or at the end of the file to catch the last group in the END block.
    • handleBlock() tests the OP's condition usingm( max ism[1]and min is m[cnt] ) to determine the output file name and then prints all elements froma` to the chosen filename.
    0 讨论(0)
提交回复
热议问题