问题
I have two files:
candidates.csv
:
id,value
1,123
4,1
2,5
50,5
blacklist.csv
:
1
2
5
3
10
I'd like to remove all rows from candidates.csv
in which the first column (id
) has a value contained in blacklist.csv
. id
is always numeric. In this case I'd like my output to look like this:
id,value
4,1
50,5
So far, my script for identifying the duplicate lines looks like this:
cat candidates.csv | cut -d \, -f 1 | grep -f blacklist.csv -w
This gives me the output
1
2
Now I somehow need to pipe this information back into sed
/awk
/gawk
/... to delete the duplicates, but I don't know how. Any ideas how I can continue from here? Or is there a better solution altogether? My only restriction is that it has to run in bash.
回答1:
What about the following:
awk -F, '(NR==FNR){a[$1];next}!($1 in a)' blacklist.csv candidates.csv
How does this work?
An awk program is a series of pattern-action pairs, written as:
condition { action }
condition { action }
...
where condition
is typically an expression and action
a series of commands. Here, the first condition-action pairs read:
(NR==FNR){a[$1];next}
if the total record countNR
equals the record count of the fileFNR
(i.e. if we are reading the first file), store all values in arraya
and skip to the next record (do not do anything else)!($1 in a)
if the first field is not in the arraya
then perform the default action which is print the line. This will only work on the second file as the condition of the first condition-action pair does not hold.
回答2:
If you are not too concerned about the order of the lines in your candidates.csv
file you could use the following:
join -v 1 -t, <(sort -t, candidates.csv) <(sort blacklist.csv)
The -v 1
requests all the lines from the first file (the sorted candidates.csv
) which do not match on the first field with the second file (the blacklist.csv
). The -t,
just sets the comma as a separator.
If you're concerned about the header line in the candidates.csv
file you could strip it before the sorting or change the order.
回答3:
You can use sed
and grep
together to get the output
$ sed -e 's/[0-9]+/&\,/g' blacklist.csv > filter.csv
$ grep -Fvf filter.csv candidates.csv
id,value
4,1
50,5
sed
command adds a ,
to each id
and output to a filter.csv
. E
is to interprete regex in MacOSX/FreeBSD, same as -r
in GNU sed
.
grep
uses the option f
to compare between files, then remove the lines using v
. F
is for fixed string.
来源:https://stackoverflow.com/questions/52649404/how-to-delete-rows-from-a-csv-file-based-on-a-list-values-from-another-file