how to trim file - remove the columns with the same value

前端未结

关注

 8  1945

I would like your help on trimming a file by removing the columns with the same value.

# the file I have (tab-delimited, millions of columns)
jack 1 5 9
joh


                      
              相关标签:


      
      
        
          8条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  你的背包        
                
              
                            
                2021-01-05 09:41
              
            
            
                                                                       
If you know which column to strip out in advance then cut will be helpful:

cut --complement -d' ' -f 3 filename

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  既然无缘        
                
              
                            
                2021-01-05 09:48
              
            
            
                                                                       
As far as I can tell, you'll need to make this a multi-pass program to meet your needs without blowing through memory. For starters, load a single line of the file into an array.

open FH,'datafile.txt' or die "$!";
my @mask;
my @first_line= split(/\s+/,<FH>);


Then you'll want to sequentially read in the other lines

while(my @next_line= split(/\s+/,<FH>)) {
/* compare each member of @first_line to @next_line
 * any match, make a mark in mask to true
 */


When you get to the bottom of the file, go back to the top and use mask to determine which colums to print.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  盖世英雄少女心        
                
              
                            
                2021-01-05 09:52
              
            
            
                                                                       
You can select the column to cut out like

# using bash/awk
# I had used 1000000 here, as you had written millions of columns but you should adjust it
for cols in `seq 2 1000000` ; do
    cut -d DELIMITER -f $cols FILE | awk -v c=$cols '{s+=$0} END {if (s/NR==$0) {printf("%i,",c)}}'
done | sed 's/,$//' > tmplist
cut --complement -d DELIMITER -f `cat tmplist` FILE


But it can be REALLY slow, because it's not optimized, and reads the file several times... so be careful with huge files.  

Or you can read the whole file once with awk and select the dumpable columns, then use cut.

cut --complement -d DELIMITER -f `awk '{for (i=1;i<=NF;i++) {sums[i]+=$i}} END {for (i=1;i<=NF; i++) {if (sums[i]/NR==$i) {printf("%i,",c)}}}' FILE | sed 's/,$//'` FILE


HTH
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  半阙折子戏        
                
              
                            
                2021-01-05 09:53
              
            
            
                                                                       
Not fully tested but this seems to work for the provided test set, note that it destroys the original file...

#!/bin/bash

#change 4 below to match number of columns
for i in {2..4}; do
    cut -f $i input | sort | uniq -c > tmp
    while read a b; do
        if [ $a -ge 2 ]; then
            awk -vfield=$i '{$field="_";print}' input > tmp2
            $(mv tmp2 input)
        fi
    done < tmp
done

$ cat input
jack    1   5   9
john    3   5   0
lisa    4   5   7

$ ./cnt.sh 

$ cat input
jack 1 _ 9
john 3 _ 0
lisa 4 _ 7


Using _ to make the output clearer...
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  梦如初夏        
                
              
                            
                2021-01-05 09:57
              
            
            
                                                                       
The main problem here is that you said "millions of columns", and did not specify how many rows. In order to check each value in each row against its counterpart in every other column.. you are looking at a great many checks.

Granted, you would be able to reduce the number of columns as you go, but you would still need to check each one down to the last row. So... much processing.

We can make a "seed" hash to start off with from the two first lines:

use strict;
use warnings;

open my $fh, '<', "inputfile.txt" or die;
my %matches;
my $line = <$fh>;
my $nextline = <$fh>;
my $i=0;
while ($line =~ s/\t(\d+)//) {
    my $num1 = $1;
    if ($nextline =~ s/\t(\d+)//) {
       if ($1 == $num1) { $matches{$i} = $num1 }
    } else {
       die "Mismatched line at line $.";
    }
    $i++;
}


Then with this "seed" hash, you could read the rest of the lines, and remove non-matching values from the hash, such as:

while($line = <$fh>) {
    my $i = 0;
    while ($line =~ s/\t(\d+)//) {
        if (defined $matches{$i}) {
            $matches{$i} = undef if ($matches{$i} != $1);
        }
        $i++;
    }
}


One could imagine a solution where one stripped away all the rows which were already proven to be unique, but in order to do that, you need to make an array of the row, or make a regex, and I am not sure that would not take equally long as simply passing through the string.

Then, after processing all the rows, you would have a hash with the values of duplicated numbers, so you could re-open the file, and do your print:

open my $fh, '<', "inputfile.txt" or die;
open my $outfile, '>', "outfile.txt" or die;
while ($line = <$fh>) {
    my $i = 0;
    if ($line =~ s/^([^\t]+)(?=\t)//) {
        print $outfile $1;
    } else { warn "Missing header at line $.\n"; }
    while ($line =~ s/(\t\d+)//) {
        if (defined $matches{$i}) { print $1 }
        $i++;
    }
    print "\n";
}


This is a rather heavy operation, and this code is untested. This will give you a hint to a solution, it will probably take a while to process the whole file. I suggest running some tests to see if it works with your data, and tweak it.

If you only have a few matching columns, it is much easier to simply extract them from the line, but I hesitate to use split on such long lines. Something like:

while ($line = <$fh>) {
    my @line = split /\t/, $line;
    for my $key (sort { $b <=> $a } keys %matches) {
        splice @line, $key + 1, 1;
    }
    $line = join ("\t", @line);
    $line =~ s/\n*$/\n/; # awkward way to make sure to get a single newline
    print $outfile $line;
}


Note that we would have to sort the keys in descending numerical order, so that we trim values from the end. Otherwise we screw up the uniqueness of the subsequent array numbers.

Anyway, it might be one way to go. It's a rather large operation, though. I'd keep backups. ;)
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  后悔当初        
                
              
                            
                2021-01-05 10:00
              
            
            
                                                                       
Here's a quick perl script to figure out which columns can be cut.

open FH, "file" or die $!;
my @baseline = split /\t/,<FH>;         #snag the first row
my @linemap = 0..$#baseline;            #list all equivalent columns (all of them)

while(<FH>) {                           #loop over the file
    my @line = split /\t/;
    @linemap = grep {$baseline[$_] eq $line[$_]}  @linemap; #filter out any that aren't equal
}
print join " ", @linemap;
print "\n";


You can use many of the above recommendations to actually remove the columns. My favorite would probably the cut implementation, partly because the above perl script could be modified to give you the precise command (or even run it for you).

@linemap = map {$_+1} @linemap;                   #Cut is 1-index based
print "cut --complement -f ".join(",",@linemap)." file\n";

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
   
          
     1
2
下一页
           
           
        
                                  
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复