Delete the row if it contains more than specific number of non numeric values

问题

I have a large (2GB) comma separated textfile containing some data from Sensors. Sometimes the sensors are off and there is no data. I want to delete the rows if there are more than specified number of No Data or Off or any non-numeric values in each row; excluding the header. I am only interested in counting from 3rd column onwards. For example: my data looks like:

Tag, Description,2015/01/01,2015/01/01 00:01:00,2015/01/01 00:02:00, 2015/01/01 00:02:00
1827XYZR/KB.SAT,Data from Process Value,2.1,Off,2.7
1871XYZR/KB.RAT,Data from process value,Off,No Data, No Data
1962XYMK/KB.GAT,Data from Process Value,No Data,5,3
1867XYST/KB.FAT,Data from process value,1.05,5.87,7.80
1871XKZR/KB.VAT,Data from process value,No Data,Off,2

Here first row is the header and I want to keep it as is. But I want delete those rows that have 2 or more than 2 No Data or Off or any non numeric fields in any columns/fields from 3rd column onwards. In other words, rows having 4 or text fields out of five. In the example, above 3rd and 6th row have 2 or more than 2 No Data or Off fields and I want to delete them. Therefore, my preferred output would be

Tag, Description,2015/01/01,2015/01/01 00:01:00,2015/01/01 00:02:00, 2015/01/01 00:02:00
1827XYZR/KB.SAT,Data from Process Value,2.1,Off,2.7
1962XYMK/KB.GAT,Data from Process Value,No Data,5,3
1867XYST/KB.FAT,Data from process value,1.05,5.87,7.80

I can do this for specific case with a loop as:

awk -F, '{ non_numeric=0;
  for(i=1;i<=NF;i++){
    if($i ~ // ) non_numeric++
  }
  if(non_numeric<2) print $0
}' testfile.txt

Here, I am considering only No Data and Off. How can I count all non-numeric strings. If I change the if statement to

if($i ~ /[^0-9]/ ) non_numeric++

it does not work and gives no output. Also since I am using loop, I reckon it is going to be slow. Can we speed this up, somehow. Any Commandline solution is Ok.

回答1:

You could do this with grep:

grep -vP '((?<=,|^)(No Data|Off)(?=,|$).*){2,}' input

Tag, Description,2015/01/01,2015/01/01 00:01:00,2015/01/01 00:02:00, 2015/01/01 00:02:00
1827XYZR/KB.SAT,Data from Process Value,2.1,Off,2.7
1962XYMK/KB.GAT,Data from Process Value,No Data,5,3
1867XYST/KB.FAT,Data from process value,1.05,5.87,7.80

Explanation: (No Data|Off) matches with either No Data or Off. We surround it by (?<=,|^) and (?=,|$); these are a zero-width lookbehind and lookahead that match with a , or the beginning (or the end) of the string. This ensures that we are matching with a whole field only. Since we want to match with a field multiple times, we put everything inside a quantified (...){2,} and we also add a .* to account for the stuff between the fields.

回答2:

awk -F, '
    {   nonnum = 0;
        for (i = 3; i <= NF; i++) { 
            if ($i ~ /[^.0-9]/) {
                nonnum++;
                if(nonnum >= 2) { next; }
            }
        }
    } 1' infile > outfile

The 1 at the end prints the line if the loop never executed next to skip remaining patterns for the current line.

回答3:

With GNU awk you can use this goody:

awk 'NF<2' FPAT='No Data' file

FPAT specifies a pattern that describes what is a field in a line of text. It is a GNU extension. Setting it to the static string No Data allows us to simply check the field count with NF<2.

回答4:

WIth static strings:

$ awk '(a=$0) && gsub(/No Data|Off/,"",a)<2' file

Ie. copy the current record $0 to a temp variable a, count the number of occurrances of Off and No Data using gsub and print if count is less than 2. Output:

Tag, Description,2015/01/01,2015/01/01 00:01:00,2015/01/01 00:02:00, 2015/01/01 00:02:00
1827XYZR/KB.SAT,Data from Process Value,2.1,Off,2.7
1962XYMK/KB.GAT,Data from Process Value,No Data,5,3
1867XYST/KB.FAT,Data from process value,1.05,5.87,7.80

If you want to match all non-numeric strings, use:

awk 'NR==1 || (a=$0) && gsub(/,[^\.,0-9]+/,"",a)<3' file

It outputs first record (NR==1) and records with less than three non-numeric values (third one is the ,Data from process value).

回答5:

$ perl -F, -ane 'print if $. == 1 || (grep {!/\d/} @F[2..$#F]) < 2' ip.txt 
Tag, Description,2015/01/01,2015/01/01 00:01:00,2015/01/01 00:02:00, 2015/01/01 00:02:00
1827XYZR/KB.SAT,Data from Process Value,2.1,Off,2.7
1962XYMK/KB.GAT,Data from Process Value,No Data,5,3
1867XYST/KB.FAT,Data from process value,1.05,5.87,7.80

-F, split input line on ,
$. == 1 if line number is 1, i.e print the header
(grep {!/\d/} @F[2..$#F]) < 2 print if number of non-numeric fields in columns 3 to end is less than two. The condition simply checks if digit is not present

The columns to check and number of times to check can easily be changed depending on requirement. For ex: @F[3..$#F] checks 4th column onwards, < 3 checks number of non-numeric fields less than three