Split CSV to Multiple Files Containing a Set Number of Unique Field Values

问题

As a beginner of awk I am able to split the data with unique value by

awk -F, '{print >> $1".csv";close($1)}' myfile.csv

But I would like to split a large CSV file based on additional condition which is the occurrences of unique values in a specific column.

Specifically, with input

111,1,0,1
111,1,1,1
222,1,1,1
333,1,0,0
333,1,1,1
444,1,1,1
444,0,0,0
555,1,1,1
666,1,0,0

I would like the output files to be

111,1,0,1
111,1,1,1
222,1,1,1
333,1,0,0
333,1,1,1

and

444,1,1,1
444,1,0,1
555,1,1,1
666,1,0,0

each of which contains three(in this case) unique values, 111,222,333and 444,555,666respectively, in first column. Any help would be appreciated.

回答1:

This will do the trick and I find it pretty readable and easy to understand:

awk -F',' 'BEGIN { count=0; filename=1 }
            x[$1]++==0 {count++}
            count==4 { count=1; filename++}
            {print >> filename".csv"; close(filename".csv");}' file

We start with our count at 0 and our filename at 1. We then count each unique value we get from the fist column, and whenever its the 4th one, we reset our count and move to the next filename.

Here's some sample data I used, which is just yours with some additional lines.

~$ cat test.txt
111,1,0,1
111,1,1,1
222,1,1,1
333,1,0,0
333,1,1,1
444,1,1,1
444,0,0,0
555,1,1,1
666,1,0,0
777,1,1,1
777,1,0,1
777,1,1,0
777,1,1,1
888,1,0,1
888,1,1,1
999,1,1,1
999,0,0,0
999,0,0,1
101,0,0,0
102,0,0,0

And running the awk like so:

~$ awk -F',' 'BEGIN { count=0; filename=1 }
            x[$1]++==0 {count++}
            count==4 { count=1; filename++}
            {print >> filename".csv"; close(filename".csv");}' test.txt

We see the following output files and content:

~$ cat 1.csv
111,1,0,1
111,1,1,1
222,1,1,1
333,1,0,0
333,1,1,1

~$ cat 2.csv
444,1,1,1
444,0,0,0
555,1,1,1
666,1,0,0

~$ cat 3.csv
777,1,1,1
777,1,0,1
777,1,1,0
777,1,1,1
888,1,0,1
888,1,1,1
999,1,1,1
999,0,0,0
999,0,0,1

~$ cat 4.csv
101,0,0,0
102,0,0,0

回答2:

this one-liner would help:

awk -F, -v u=3 -v i=1 '{a[$1];
   if (length(a)>u){close(i".csv");++i;delete a;a[$1]}print>i".csv"}' file

You change the u=3 value into x to gain x unique values per file.

If you run this line with your input file, you should got 1.csv and 2.csv

Edit (add some test output):

kent$  ll
total 4.0K
drwxr-xr-x  2 kent kent  60 Mar 25 18:19 ./
drwxrwxrwt 19 root root 580 Mar 25 18:18 ../
-rw-r--r--  1 kent kent  90 Mar 25 17:57 f

kent$  cat f
111,1,0,1
111,1,1,1
222,1,1,1
333,1,0,0
333,1,1,1
444,1,1,1
444,0,0,0
555,1,1,1
666,1,0,0

kent$  awk -F, -v u=3 -v i=1 '{fn=i".csv";a[$1];if (length(a)>u){close(fn);++i;delete a;a[$1]}print>fn}' f  

kent$  head *.csv
==> 1.csv <==
111,1,0,1
111,1,1,1
222,1,1,1
333,1,0,0
333,1,1,1

==> 2.csv <==
444,1,1,1
444,0,0,0
555,1,1,1
666,1,0,0

来源：https://stackoverflow.com/questions/29261265/split-csv-to-multiple-files-containing-a-set-number-of-unique-field-values

标签

csv

awk

split

conditional-statements

find-occurrences