Split CSV to Multiple Files Containing a Set Number of Unique Field Values

蓝咒 提交于 2019-12-13 16:30:26

问题


As a beginner of awk I am able to split the data with unique value by

awk -F, '{print >> $1".csv";close($1)}' myfile.csv

But I would like to split a large CSV file based on additional condition which is the occurrences of unique values in a specific column.

Specifically, with input

111,1,0,1
111,1,1,1
222,1,1,1
333,1,0,0
333,1,1,1
444,1,1,1
444,0,0,0
555,1,1,1
666,1,0,0

I would like the output files to be

111,1,0,1
111,1,1,1
222,1,1,1
333,1,0,0
333,1,1,1

and

444,1,1,1
444,1,0,1
555,1,1,1
666,1,0,0

each of which contains three(in this case) unique values, 111,222,333and 444,555,666respectively, in first column. Any help would be appreciated.


回答1:


This will do the trick and I find it pretty readable and easy to understand:

awk -F',' 'BEGIN { count=0; filename=1 }
            x[$1]++==0 {count++}
            count==4 { count=1; filename++}
            {print >> filename".csv"; close(filename".csv");}' file

We start with our count at 0 and our filename at 1. We then count each unique value we get from the fist column, and whenever its the 4th one, we reset our count and move to the next filename.

Here's some sample data I used, which is just yours with some additional lines.

~$ cat test.txt
111,1,0,1
111,1,1,1
222,1,1,1
333,1,0,0
333,1,1,1
444,1,1,1
444,0,0,0
555,1,1,1
666,1,0,0
777,1,1,1
777,1,0,1
777,1,1,0
777,1,1,1
888,1,0,1
888,1,1,1
999,1,1,1
999,0,0,0
999,0,0,1
101,0,0,0
102,0,0,0

And running the awk like so:

~$ awk -F',' 'BEGIN { count=0; filename=1 }
            x[$1]++==0 {count++}
            count==4 { count=1; filename++}
            {print >> filename".csv"; close(filename".csv");}' test.txt

We see the following output files and content:

~$ cat 1.csv
111,1,0,1
111,1,1,1
222,1,1,1
333,1,0,0
333,1,1,1

~$ cat 2.csv
444,1,1,1
444,0,0,0
555,1,1,1
666,1,0,0

~$ cat 3.csv
777,1,1,1
777,1,0,1
777,1,1,0
777,1,1,1
888,1,0,1
888,1,1,1
999,1,1,1
999,0,0,0
999,0,0,1

~$ cat 4.csv
101,0,0,0
102,0,0,0



回答2:


this one-liner would help:

awk -F, -v u=3 -v i=1 '{a[$1];
   if (length(a)>u){close(i".csv");++i;delete a;a[$1]}print>i".csv"}' file 

You change the u=3 value into x to gain x unique values per file.

If you run this line with your input file, you should got 1.csv and 2.csv

Edit (add some test output):

kent$  ll
total 4.0K
drwxr-xr-x  2 kent kent  60 Mar 25 18:19 ./
drwxrwxrwt 19 root root 580 Mar 25 18:18 ../
-rw-r--r--  1 kent kent  90 Mar 25 17:57 f

kent$  cat f
111,1,0,1
111,1,1,1
222,1,1,1
333,1,0,0
333,1,1,1
444,1,1,1
444,0,0,0
555,1,1,1
666,1,0,0

kent$  awk -F, -v u=3 -v i=1 '{fn=i".csv";a[$1];if (length(a)>u){close(fn);++i;delete a;a[$1]}print>fn}' f  

kent$  head *.csv
==> 1.csv <==
111,1,0,1
111,1,1,1
222,1,1,1
333,1,0,0
333,1,1,1

==> 2.csv <==
444,1,1,1
444,0,0,0
555,1,1,1
666,1,0,0


来源:https://stackoverflow.com/questions/29261265/split-csv-to-multiple-files-containing-a-set-number-of-unique-field-values

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!