问题
As a beginner of awk
I am able to split the data with unique value by
awk -F, '{print >> $1".csv";close($1)}' myfile.csv
But I would like to split a large CSV file based on additional condition which is the occurrences of unique values in a specific column.
Specifically, with input
111,1,0,1
111,1,1,1
222,1,1,1
333,1,0,0
333,1,1,1
444,1,1,1
444,0,0,0
555,1,1,1
666,1,0,0
I would like the output files to be
111,1,0,1
111,1,1,1
222,1,1,1
333,1,0,0
333,1,1,1
and
444,1,1,1
444,1,0,1
555,1,1,1
666,1,0,0
each of which contains three(in this case) unique values, 111,222,333
and 444,555,666
respectively, in first column.
Any help would be appreciated.
回答1:
This will do the trick and I find it pretty readable and easy to understand:
awk -F',' 'BEGIN { count=0; filename=1 }
x[$1]++==0 {count++}
count==4 { count=1; filename++}
{print >> filename".csv"; close(filename".csv");}' file
We start with our count at 0 and our filename at 1. We then count each unique value we get from the fist column, and whenever its the 4th one, we reset our count and move to the next filename.
Here's some sample data I used, which is just yours with some additional lines.
~$ cat test.txt
111,1,0,1
111,1,1,1
222,1,1,1
333,1,0,0
333,1,1,1
444,1,1,1
444,0,0,0
555,1,1,1
666,1,0,0
777,1,1,1
777,1,0,1
777,1,1,0
777,1,1,1
888,1,0,1
888,1,1,1
999,1,1,1
999,0,0,0
999,0,0,1
101,0,0,0
102,0,0,0
And running the awk like so:
~$ awk -F',' 'BEGIN { count=0; filename=1 }
x[$1]++==0 {count++}
count==4 { count=1; filename++}
{print >> filename".csv"; close(filename".csv");}' test.txt
We see the following output files and content:
~$ cat 1.csv
111,1,0,1
111,1,1,1
222,1,1,1
333,1,0,0
333,1,1,1
~$ cat 2.csv
444,1,1,1
444,0,0,0
555,1,1,1
666,1,0,0
~$ cat 3.csv
777,1,1,1
777,1,0,1
777,1,1,0
777,1,1,1
888,1,0,1
888,1,1,1
999,1,1,1
999,0,0,0
999,0,0,1
~$ cat 4.csv
101,0,0,0
102,0,0,0
回答2:
this one-liner would help:
awk -F, -v u=3 -v i=1 '{a[$1];
if (length(a)>u){close(i".csv");++i;delete a;a[$1]}print>i".csv"}' file
You change the u=3
value into x
to gain x
unique values per file.
If you run this line with your input file, you should got 1.csv and 2.csv
Edit (add some test output):
kent$ ll
total 4.0K
drwxr-xr-x 2 kent kent 60 Mar 25 18:19 ./
drwxrwxrwt 19 root root 580 Mar 25 18:18 ../
-rw-r--r-- 1 kent kent 90 Mar 25 17:57 f
kent$ cat f
111,1,0,1
111,1,1,1
222,1,1,1
333,1,0,0
333,1,1,1
444,1,1,1
444,0,0,0
555,1,1,1
666,1,0,0
kent$ awk -F, -v u=3 -v i=1 '{fn=i".csv";a[$1];if (length(a)>u){close(fn);++i;delete a;a[$1]}print>fn}' f
kent$ head *.csv
==> 1.csv <==
111,1,0,1
111,1,1,1
222,1,1,1
333,1,0,0
333,1,1,1
==> 2.csv <==
444,1,1,1
444,0,0,0
555,1,1,1
666,1,0,0
来源:https://stackoverflow.com/questions/29261265/split-csv-to-multiple-files-containing-a-set-number-of-unique-field-values