问题
I have a large (2GB) comma separated textfile containing some data from Sensors. Sometimes the sensors are off and there is no data. I want to delete the rows if there are more than specified number of No Data
or Off
or any non-numeric
values in each row; excluding the header. I am only interested in counting from 3rd column onwards. For example: my data looks like:
Tag, Description,2015/01/01,2015/01/01 00:01:00,2015/01/01 00:02:00, 2015/01/01 00:02:00
1827XYZR/KB.SAT,Data from Process Value,2.1,Off,2.7
1871XYZR/KB.RAT,Data from process value,Off,No Data, No Data
1962XYMK/KB.GAT,Data from Process Value,No Data,5,3
1867XYST/KB.FAT,Data from process value,1.05,5.87,7.80
1871XKZR/KB.VAT,Data from process value,No Data,Off,2
Here first row is the header and I want to keep it as is. But I want delete those rows that have 2 or more than 2 No Data
or Off
or any non numeric
fields in any columns/fields from 3rd column onwards. In other words, rows having 4 or text fields out of five. In the example, above 3rd and 6th row have 2 or more than 2 No Data
or Off
fields and I want to delete them. Therefore, my preferred output would be
Tag, Description,2015/01/01,2015/01/01 00:01:00,2015/01/01 00:02:00, 2015/01/01 00:02:00
1827XYZR/KB.SAT,Data from Process Value,2.1,Off,2.7
1962XYMK/KB.GAT,Data from Process Value,No Data,5,3
1867XYST/KB.FAT,Data from process value,1.05,5.87,7.80
I can do this for specific case with a loop as:
awk -F, '{ non_numeric=0;
for(i=1;i<=NF;i++){
if($i ~ // ) non_numeric++
}
if(non_numeric<2) print $0
}' testfile.txt
Here, I am considering only No Data
and Off
. How can I count all non-numeric strings. If I change the if statement to
if($i ~ /[^0-9]/ ) non_numeric++
it does not work and gives no output. Also since I am using loop, I reckon it is going to be slow. Can we speed this up, somehow. Any Commandline solution is Ok.
回答1:
You could do this with grep
:
grep -vP '((?<=,|^)(No Data|Off)(?=,|$).*){2,}' input
Tag, Description,2015/01/01,2015/01/01 00:01:00,2015/01/01 00:02:00, 2015/01/01 00:02:00
1827XYZR/KB.SAT,Data from Process Value,2.1,Off,2.7
1962XYMK/KB.GAT,Data from Process Value,No Data,5,3
1867XYST/KB.FAT,Data from process value,1.05,5.87,7.80
Explanation: (No Data|Off)
matches with either No Data
or Off
. We surround it by (?<=,|^)
and (?=,|$)
; these are a zero-width lookbehind and lookahead that match with a ,
or the beginning (or the end) of the string. This ensures that we are matching with a whole field only. Since we want to match with a field multiple times, we put everything inside a quantified (...){2,}
and we also add a .*
to account for the stuff between the fields.
回答2:
awk -F, '
{ nonnum = 0;
for (i = 3; i <= NF; i++) {
if ($i ~ /[^.0-9]/) {
nonnum++;
if(nonnum >= 2) { next; }
}
}
} 1' infile > outfile
The 1
at the end prints the line if the loop never executed next
to skip remaining patterns for the current line.
回答3:
With GNU awk you can use this goody:
awk 'NF<2' FPAT='No Data' file
FPAT
specifies a pattern that describes what is a field in a line of text. It is a GNU extension. Setting it to the static string No Data
allows us to simply check the field count with NF<2
.
回答4:
WIth static strings:
$ awk '(a=$0) && gsub(/No Data|Off/,"",a)<2' file
Ie. copy the current record $0
to a temp variable a
, count the number of occurrances of Off
and No Data
using gsub
and print
if count is less than 2. Output:
Tag, Description,2015/01/01,2015/01/01 00:01:00,2015/01/01 00:02:00, 2015/01/01 00:02:00
1827XYZR/KB.SAT,Data from Process Value,2.1,Off,2.7
1962XYMK/KB.GAT,Data from Process Value,No Data,5,3
1867XYST/KB.FAT,Data from process value,1.05,5.87,7.80
If you want to match all non-numeric strings, use:
awk 'NR==1 || (a=$0) && gsub(/,[^\.,0-9]+/,"",a)<3' file
It outputs first record (NR==1
) and records with less than three non-numeric values (third one is the ,Data from process value
).
回答5:
$ perl -F, -ane 'print if $. == 1 || (grep {!/\d/} @F[2..$#F]) < 2' ip.txt
Tag, Description,2015/01/01,2015/01/01 00:01:00,2015/01/01 00:02:00, 2015/01/01 00:02:00
1827XYZR/KB.SAT,Data from Process Value,2.1,Off,2.7
1962XYMK/KB.GAT,Data from Process Value,No Data,5,3
1867XYST/KB.FAT,Data from process value,1.05,5.87,7.80
-F,
split input line on,
$. == 1
if line number is1
, i.e print the header(grep {!/\d/} @F[2..$#F]) < 2
print if number of non-numeric fields in columns 3 to end is less than two. The condition simply checks if digit is not present
The columns to check and number of times to check can easily be changed depending on requirement. For ex: @F[3..$#F]
checks 4th column onwards, < 3
checks number of non-numeric fields less than three
回答6:
lazy way: print iff fields 3-5 contain at least one number character:
awk -F, '$3$4$5 ~ "[0-9]"' data.csv
lazier way (works for your sample data): print iff row contains a comma followed by a number character:
grep ',[0-9]' data.csv
回答7:
This might work for you (GNU sed):
sed -r '/(.*No Data|.*Off){2}/d' file
Use alternation to delete lines with 2 or more of the designated strings.
来源:https://stackoverflow.com/questions/39501771/delete-the-row-if-it-contains-more-than-specific-number-of-non-numeric-values