If we have an input file: input.csv
cpdID,cpd_number,Cell_assay_id,Cell_alt_assay_id,Cell_type_desc,Cell_Operator,Cell_result_value,Cell_unit_value,assay_id,alt_
Take a look at this, which is an example of processing each group as it's id changes:
#!/usr/bin/awk -f
BEGIN {FS=","; f1="a"; f2="b"}
FNR==1 { print $0 > f1; print $0 > f2; next }
$2!=last_id && FNR > 2 { handleBlock() }
{ a[++cnt]=$0; m[cnt]=$17; last_id=$2 }
END { handleBlock() }
function handleBlock() {
if( m[1]-m[cnt]<=1 ) fname = f1
else fname = f2
for( i=1;i<=cnt;i++ ) { print a[i] > fname }
cnt=0
}
It's an executable awk file. When put it into a file called awko
and chmod +x awko
it can be run like awko data
for an input file called "data".
The script I wrote for the other question was based on me assuming the the input order of the file elements were unknown - where the $2
fields could be in any order and that only the min and max values mattered. In this question, the OP would like to send all rows related to the $2
field to one file or another based on the min/max values.
The input file for this question has the following properties which this script is dependent on:
$2
fields are groupedWhere there's a resource list that's sorted by the resource id, one common algorithm for minimally loading the data is to only load it when the resource id changes. The same can be done for processing grouped entries here. Take an example like:
a
a
a
b <- this is a good place to process all the prior "a" entries
b
c <- process "b" entries here
c
EOF <- the end of the file. process the last group ( the "c" entries here )
With that in mind, here's a break down of the script:
FS
and some output file names in BEGIN
block ( "a" and "b" for my testing )f1
and f2
.$2 != last_id
, call the handleBlock()
function to process it.a
, $17
in array m
and set last_id=$2
( the array names are horrible ).cnt
variable indicates how many entries are in each group ( what I called a block )handleBlock()
will only get called when the $2
id changes or at the end of the file to catch the last group in the END
block.handleBlock() tests the OP's condition using
m( max is
m[1]and min is m[cnt] ) to determine the output file name and then prints all elements from
a` to the chosen filename.