问题
I have a large .csv file to to process and my elements are arranged randomly like this:
xxxxxx,xx,MLOCAL
,MREMOTE
,33222
,56
,22/10/2012
,18/10/2012
xxxxxx,xx,MREMOTE
,MLOCAL
,33222
,56
,22/10/2012
,18/10/2012
xxxxxx,xx,MLOCAL
,341993
,22/10/2012
xxxxxx,xx,MREMOTE
,9356828
,08/10/2012
xxxxxx,xx,LOCAL
,REMOTE
,19316
,15253
,22/10/2012
,22/10/2012
xxxxxx,xx,REMOTE
,LOCAL
,1865871
,383666
,22/10/2012
,22/10/2012
xxxxxx,xx,REMOTE
,1180306134
,19/10/2012
where fields LOCAL
, REMOTE
, MLOCAL
or MREMOTE
are displayed like:
- when they are displayed as pairs (LOCAL/REMOTE) if 3rd field is
MLOCAL
, and 4th field isMREMOTE
, then 5th and 7th field represent the value and date ofMLOCAL
, and 6th and 8th represent the value and date ofMREMOTE
- when they are displayed as single (only
LOCAL
or onlyREMOTE
) then the 4th and 5th fields represent the value and date of field 3.
Now, I have split these rows using:
nawk 'BEGIN{
while (getline < "'"$filedata"'")
split($0,ft,",");
name=ft[1];
ID=ft[2]
?=ft[3]
?=ft[4]
....................
but because I can't find a pattern for the 3rd and 4th field I'm pretty stuck to continue to assign var names for each of the array elements in order to use them for further processing.
Now, I tried to use "case" statement but isn't working for awk or nawk (only in gawk is working as expected). I also tried this:
if ( ft[3] == "MLOCAL" && ft[4]!= "MREMOTE" )
{
MLOCAL=ft[3];
MLOCAL_qty=ft[4];
MLOCAL_TIMESTAMP=ft[5];
}
else if ( ft[3] == MLOCAL && ft[4] == MREMOTE )
{
MLOCAL=ft[3];
MREMOTE=ft[4];
MOCAL_qty=ft[5];
MREMOTE_qty=ft[6];
MOCAL_TIMESTAMP=ft[7];
MREMOTE_TIMESTAMP=ft[8];
}
else if ( ft[3] == MREMOTE && ft[4] != MOCAL )
{
MREMOTE=ft[3];
MREMOTE_qty=ft[4];
MREMOTE_TIMESTAMP=ft[5];
..........................................
but it's not working as well.
So, if you have any idea how to handle this, I would be grateful to give me a hint in order to be able to find a pattern in order to cover all the possible situations from above.
EDIT
I don't know how to thank you for all this help. Now, what I have to do is more complex than I wrote above, I'll try to describe as simple as I can otherwise I'll make you guys pretty confused. My output should be like following:
NAME
,UNIQUE_ID
,VOLUME_ALOCATED
,MLOCAL_VALUE
,MLOCAL_TIMESTMP
,MLOCAL_limit
,LOCAL_VALUE
,LOCAL_TIMESTAMP
,LOCAL_limit
,MREMOTE_VALUE
,MREMOTE_TIMESTAMP
,REMOTE_VALUE
,REMOTE_TIMESTAMP
(where MLOCAL_limit
and LOCAL_limit
are a subtract result between VOLUME_ALOCATED
and MLOCAL_VALUE
or LOCAL_VALUE
)
So, in my output file, fields position should be arranged like:
4th field =MLOCAL_VALUE
,5th field =MLOCAL_TIMESTMP
,7th field=LOCAL_VALUE
,
8th field=LOCAL_TIMESTAMP
,10th field=MREMOTE_VALUE
,11th field=MREMOTE_TIMESTAMP
,12th field=REMOTE_VALUE
,13th field=REMOTE_TIMESTAMP
Now, an example would be this:
for the following input: name
,ID
,VOLUME_ALLOCATED
,MLOCAL
,MREMOTE
,33222
,56
,22/10/2012
,18/10/2012
name
,ID
,VOLUME_ALLOCATED
,REMOTE
,234455
,19/12/2012
I should process this line and the output should be this:
name
,ID
,VOLUME_ALLOCATED
,33222
,22/10/2012
,MLOCAL_LIMIT
, ,
,
,
56
,18/10/2012
,,
7th
,8th
, 9th
,12th
, and 13th
fields are empty because there is no info related to: LOCAL_VALUE
,LOCAL_TIMESTAMP
,LOCAL_limit
,REMOTE_VALUE
, and REMOTE_TIMESTAMP
OR
name
,ID
,VOLUME_ALLOCATED
,,
,
,
,
,
,
,
,
234455
,9/12/2012
4th
,5th
,6th
,7th
,8th
,9th
,10th
and ,11th
, fields should be empty values because there is no info about: MLOCAL_VALUE
,MLOCAL_TIMESTAMP
,MLOCAL_LIMIT
,LOCAL_VALUE
,LOCAL_TIMESTAMP
,LOCAL_LIMIT
,MREMOTE_VALUE
,MREMOTE_TIMESTAMP
VOLUME_ALLOCATED
is retrieved from other csv file (called "info.csv") based on the ID
field which is processed earlier in the script like:
info.csv
VOLUME_ALLOCATED
,ID
,CLIENT
5242881
,64
,subscriber
567743
,24
,visitor
data.csv
NAME
,64
,MLOCAL
,341993
,23/10/2012
NAME
,24
,LOCAL
$REMOTE
,2347
$4324
,19/12/2012
$18/12/2012
Now, my code is this:
#! /usr/bin/bash
input="info.csv"
filedata="data.csv"
outfile="out"
nawk 'BEGIN{
while (getline < "'"$input"'")
{
split($0,ft,",");
volume=ft[1];
id=ft[2];
client=ft[3];
key=id;
volumeArr[key]=volume;
clientArr[key]=client;
}
close("'"$input"'");
while (getline < "'"$filedata"'")
{
gsub(/\$/,","); # substitute the $ separator with comma
split($0,ft,",");
volume=volumeArr[id]; # Get the volume from the volumeArr, using "id" as key
segment=clientArr[id]; # Get the client mode from the clientArr, using "id" as key
NAME=ft[1];
id=ft[2];
here I'm stuck, I can't find the right way to set the rest of the fields since I don't know how to handle the 3rd and 4th fields.
? =ft[3];
? =ft[4];
Sorry, if I make you pretty confused but this is my current situation right now. Thanks
回答1:
You didn't provide the expected output from your sample input but here's a start to show how to get the values for the 2 different formats of input line:
$ cat tst.awk
BEGIN{ FS=","; OFS="\t" }
{
delete value # or use split("",value) if your awk cant delete arrays
if ($4 ~ /LOCAL|REMOTE/) {
value[$3] = $5
date[$3] = $7
value[$4] = $6
date[$4] = $8
}
else {
value[$3] = $4
date[$3] = $5
}
print
for (type in value) {
printf "%15s%15s%15s\n", type, value[type], date[type]
}
}
$ awk -f tst.awk file
xxxxxx,xx,MLOCAL,MREMOTE,33222,56,22/10/2012,18/10/2012
MREMOTE 56 18/10/2012
MLOCAL 33222 22/10/2012
xxxxxx,xx,MREMOTE,MLOCAL,33222,56,22/10/2012,18/10/2012
MREMOTE 33222 22/10/2012
MLOCAL 56 18/10/2012
xxxxxx,xx,MLOCAL,*341993,22/10/2012*
MLOCAL *341993 22/10/2012*
xxxxxx,xx,MREMOTE,9356828,08/10/2012
MREMOTE 9356828 08/10/2012
xxxxxx,xx,LOCAL,REMOTE,19316,15253,22/10/2012,22/10/2012
REMOTE 15253 22/10/2012
LOCAL 19316 22/10/2012
xxxxxx,xx,REMOTE,LOCAL,1865871,383666,22/10/2012,22/10/2012
REMOTE 1865871 22/10/2012
LOCAL 383666 22/10/2012
xxxxxx,xx,REMOTE,1180306134,19/10/2012
REMOTE 1180306134 19/10/2012
and if you post the expected output we could help you more.
来源:https://stackoverflow.com/questions/14391738/awk-set-elements-in-array