How to use awk/shell scripting to do SQL Where clause and SQL join like filtering and merging of rows and columns?

问题

I have a huge data set with say 15 - 20 GB and it is a tab delimited file. While I can either do it in Python or in SQL, It would be easier and simple to have it done in Shell script to avoid moving the csv files

Say, For example, taking a pipe delimited file input:

----------------------------------------
Col1 | Col2 | Col3 | Col4 | Col5 | Col6
----------------------------------------
 A   |  H1  | 123  | abcd | a1   | b1   
----------------------------------------
 B   |  H1  | 124  | abcd | a2   | b1   
----------------------------------------
 C   |  H2  | 127  | abd  | a3   | b1   
----------------------------------------
 D   |  H1  | 128  | acd  | a4   | b1   
----------------------------------------

SQL Query would look like

SELECT Col1, Col4, Col5, Col6 FROM WHERE col2='H1'

Output:

--------------------------
Col1 | Col4 | Col5 | Col6
--------------------------
 A   | abcd | a1   | b1   
--------------------------
 B   | abcd | a2   | b1   
--------------------------
 D   | acd  | a4   | b1   
--------------------------

Then, I need to take in only the Col4 of this to do some string parsing and output below OutputFile1:

--------------------------------
Col1 | Col4 | Col5 | Col6 | New1
--------------------------------
 A   | abcd | a1   | b1   | a,b,c,d
--------------------------------
 B   | abcd | a2   | b1   | a,b,c,d
--------------------------------
 D   | acd  | a4   | b1   | a,c,d
--------------------------------

The Col4 is a URL. I need to parse the URL params. Refer Question - How to parse URL params in shell script

And I would like to know if I have another file where I have

File2 :

--------------
ColA | ColB | 
--------------
 A   | abcd | 
--------------
 B   | abcd | 
--------------
 D   | qst  | 
--------------

I need to generate a similar parsed output for ColB.

OutputFile2:

--------------
ColA | ColB | New1
--------------
 A   | abcd | a,b,c,d
--------------
 B   | abcd | a,b,c,d
--------------
 D   | qst  | q,s,t
--------------

SQL Query to merge OutputFile1 and OutputFile2 would do a inner join on

OutputFile1.Col1 = OutputFile2.ColA and OutputFile1.New1 = OutputFile2.New1

Final Output:

--------------------------------
Col1 | Col4 | Col5 | Col6 | New1
--------------------------------
 A   | abcd | a1   | b1   | a,b,c,d
--------------------------------
 B   | abcd | a2   | b1   | a,b,c,d
--------------------------------

Please share suggestions to implement the same.

The major constraint being the size of the file.

Thanks

回答1:

There's a very simple database management program named "unity" available for UNIX at http://open-innovation.alcatel-lucent.com/projects/unity/. In unity you have 2 main files:

a data file named whatever you like, e.g. "foo", and
a descriptor file with the same base name as the data file but prefixed with "D" for Descriptor, e.g. "Dfoo"

These are both simple text files that you can edit with whatever editor you like (or it has it's own database-aware editor named uedit).

Dfoo would have one row for each column in foo describing attributes of the data that appears in that column in foo and it's separator from the next column.

foo would have the data.

It's been a while since I used unity in the raw (I have scripts that use it behind the scenes) but for the first table you show above:

----------------------------------------
Col1 | Col2 | Col3 | Col4 | Col5 | Col6
----------------------------------------
 A   |  H1  | 123  | abcd | a1   | b1   
----------------------------------------
 B   |  H1  | 124  | abcd | a2   | b1   
----------------------------------------
 C   |  H2  | 127  | abd  | a3   | b1   
----------------------------------------
 D   |  H1  | 128  | acd  | a4   | b1   
----------------------------------------

the Descriptor file (Dfoo) would be something like:

Col1 | 5c
Col2 | 6c
Col3 | 6c
Col4 | 6c
Col5 | 6c
Col6 \n 6c

and the data file (foo) would be:

A|H1|123|abcd|a1|b1
B|H1|124|abcd|a2|b1
C|H2|127|abd|a3|b1
D|H1|128|acd|a4|b1

You can then run unity commands like:

uprint -d- foo

to print the table with rows separated by lines of underscores and cells of the width specified in your descriptor file (e.g. 6c = 6 characters Centered while 6r = 6 characters Right-justified).

uselect Col2 from foo where Col3 leq abd

to select the values from column Col2 where the corresponding value in Col3 is Lexically EQual to the string "abd".

There are unity commands to let you do joins, merges, inserts, deletes, etc. - basically whatever you'd expect to be able to do with a relational database but it's all just based on simple text files.

In unity you can specify different separators between each column but if all of the separators are the same (except the final one which will be '\n') then you can run awk scripts on the file too just by using awk -F with the separator.

A couple of other toolsets you could look at that might be easier to install but probably don't have as much functionality as unity (which has been around since the 1970s!) would be recutils (from GNU) and csvDB so your full homework/research list is:

unity: http://open-innovation.alcatel-lucent.com/projects/unity
recutils: http://www.gnu.org/software/recutils
csvDB: http://freecode.com/projects/csvdb

Note that recutils has rec2csv and csv2rec tools for converting between the recutils and CSV formats.

回答2:

For a pipe delimited file:

awk '$2=="H1"{y="";x=$4;for(i=1;i<=length($4);i++)y=y?y","substr(x,i,1):substr(x,i,1);print $1,$4,$5,$6,y;}' FS="|" OFS="|" file

For a tab-delimited file, leave the FS empty:

awk '$2=="H1"{y="";x=$4;for(i=1;i<=length($4);i++)y=y?y","substr(x,i,1):substr(x,i,1);print $1,$4,$5,$6,y;}'  OFS="\t" file

来源：https://stackoverflow.com/questions/15763436/how-to-use-awk-shell-scripting-to-do-sql-where-clause-and-sql-join-like-filterin

标签

shell

awk

gawk