Count subsequences in hundreds of GB of data

你说的曾经没有我的故事 提交于 2020-01-10 20:11:28

问题


I'm trying to process a very large file and tally the frequency of all sequences of a certain length in the file.

To illustrate what I'm doing, consider a small input file containing the sequence abcdefabcgbacbdebdbbcaebfebfebfeb

Below, the code reads the whole file in, and takes the first substring of length n (below I set this to 5, although I want to be able to change this) and counts its frequency:

abcde => 1

Next line, it moves one character to the right and does the same:

bcdef => 1

It then continues for the rest of the string and prints the 5 most frequent sequences:

open my $in, '<', 'in.txt' or die $!; # 'abcdefabcgbacbdebdbbcaebfebfebfeb'

my $seq = <$in>; # read whole file into string
my $len = length($seq);

my $seq_length = 5; # set k-mer length
my %data;

for (my $i = 0; $i <= $len - $seq_length; $i++) {
     my $kmer = substr($seq, $i, $seq_length);
     $data{$kmer}++;
}

# print the hash, showing only the 5 most frequent k-mers
my $count = 0;
foreach my $kmer (sort { $data{$b} <=> $data{$a} } keys %data ){
    print "$kmer $data{$kmer}\n";
    $count++;
    last if $count >= 5;
}

ebfeb 3
febfe 2
bfebf 2
bcaeb 1
abcgb 1

However, I would like to find a more efficient way of achieving this. If the input file was 10GB or 1000GB, then reading the whole thing into a string would be very memory expensive.

I thought about reading in blocks of characters, say 100 at a time and proceeding as above, but here, sequences that span 2 blocks would not be tallied correctly.

My idea then, is to only read in n number of characters from the string, and then move onto the next n number of characters and do the same, tallying their frequency in a hash as above.

  • Are there any suggestions about how I could do this? I've had a look a read using an offset, but can't get my head around how I could incorporate this here
  • Is substr the most memory efficient tool for this task?

回答1:


From your own code it's looking like your data file has just a single line of data -- not broken up by newline characters -- so I've assumed that in my solution below. Even if it's possible that the line has one newline character at the end, the selection of the five most frequent subsequences at the end will throw this out as it happens only once

This program uses sysread to fetch an arbitrarily-sized chunk of data from the file and append it to the data we already have in memory

The body of the loop is mostly similar to your own code, but I have used the list version of for instead of the C-style one as it is much clearer

After processing each chunk, the in-memory data is truncated to the last SEQ_LENGTH-1 bytes before the next cycle of the loop pulls in more data from the file

I've also use constants for the K-mer size and the chunk size. They are constant after all!

The output data was produced with CHUNK_SIZE set to 7 so that there would be many instances of cross-boundary subsequences. It matches your own required output except for the last two entries with a count of 1. That is because of the inherent random order of Perl's hash keys, and if you require a specific order of sequences with equal counts then you must specify it so that I can change the sort

use strict;
use warnings 'all';

use constant SEQ_LENGTH => 5;           # K-mer length
use constant CHUNK_SIZE => 1024 * 1024; # Chunk size - say 1MB

my $in_file = shift // 'in.txt';

open my $in_fh, '<', $in_file or die qq{Unable to open "$in_file" for input: $!};

my %data;
my $chunk;
my $length = 0;

while ( my $size = sysread $in_fh, $chunk, CHUNK_SIZE, $length ) {

    $length += $size;

    for my $offset ( 0 .. $length - SEQ_LENGTH ) {
         my $kmer = substr $chunk, $offset, SEQ_LENGTH;
         ++$data{$kmer};
    }

    $chunk = substr $chunk, -(SEQ_LENGTH-1);
    $length = length $chunk;
}

my @kmers = sort { $data{$b} <=> $data{$a} } keys %data;
print "$_ $data{$_}\n" for @kmers[0..4];

output

ebfeb 3
febfe 2
bfebf 2
gbacb 1
acbde 1

Note the line: $chunk = substr $chunk, -(SEQ_LENGTH-1); which sets $chunk as we pass through the while loop. This ensures that strings spanning 2 chunks get counted correctly.

The $chunk = substr $chunk, -4 statement removes all but the last four characters from the current chunk so that the next read appends CHUNK_SIZE bytes from the file to those remaining characters. This way the search will continue, but starts with the last 4 of the previous chunk's characters in addition to the next chunk: data doesn't fall into a "crack" between the chunks.




回答2:


Even if you don't read the entire file into memory before processing it, you could still run out of memory.

A 10 GiB file contains almost 11E9 sequences.

If your sequences are sequences of 5 characters chosen from a set of 5 characters, there are only 55 = 3,125 unique sequences, and this would easily fit in memory.

If your sequences are sequences of 20 characters chosen from a set of 5 characters, there are 520 = 95E12 unique sequences, so the all 11E9 sequences of a 10 GiB file could unique. That does not fit in memory.

In that case, I suggest doing the following:

  1. Create a file that contains all the sequences of the original file.

    The following reads the file in chunks rather than all at once. The tricky part is handling sequences that span two blocks. The following program uses sysread[1] to fetch an arbitrarily-sized chunk of data from the file and append it to the last few character of the previously read block. This last detail allows sequences that span blocks to be counted.

    perl -e'
       use strict;
       use warnings qw( all );
    
       use constant SEQ_LENGTH => 20;
       use constant CHUNK_SIZE => 1024 * 1024;
    
       my $buf = "";
       while (1) {
          my $size = sysread(\*STDIN, $buf, CHUNK_SIZE, length($buf));
          die($!) if !defined($size);
          last if !$size;
    
          for my $offset ( 0 .. length($buf) - SEQ_LENGTH ) {
             print(substr($buf, $offset, SEQ_LENGTH), "\n");
          }
    
          substr($buf, 0, -(SEQ_LENGTH-1), "");
       }
    ' <in.txt >sequences.txt
    
  2. Sort the sequences.

    sort sequences.txt >sorted_sequences.txt
    
  3. Count the number of instances of each sequeunces, and store the count along with the sequences in another file.

    perl -e'
       use strict;
       use warnings qw( all );
    
       my $last = "";           
       my $count;
       while (<>) {
          chomp;
          if ($_ eq $last) {
             ++$count;
          } else {
             print("$count $last\n") if $count;
             $last = $_;
             $count = 1;
          }
       }
    ' sorted_sequences.txt >counted_sequences.txt
    
  4. Sort the sequences by count.

    sort -rns counted_sequences.txt >sorted_counted_sequences.txt
    
  5. Extract the results.

    perl -e'
       use strict;
       use warnings qw( all );
    
       my $last_count;
       while (<>) {
          my ($count, $seq) = split;
          last if $. > 5 && $count != $last_count;
          print("$seq $count\n");
          $last_count = $count;
       }
    ' sorted_counted_sequences.txt
    

    This also prints ties for 5th place.

This can be optimized by tweaking the parameters passed to sort[2], but it should offer decent performance.


  1. sysread is faster than previously suggested read since the latter performs a series of 4 KiB or 8 KiB reads (depending on your version of Perl) internally.

  2. Given the fixed-length nature of the sequence, you could also compress the sequences into ceil(log256(520)) = 6 bytes then base64-encode them into ceil(6 * 4/3) = 8 bytes. That means 12 fewer bytes would be needed per sequence, greatly reducing the amount to read and to write.


Portions of this answer was adapted from content by user:622310 licensed under cc by-sa 3.0.




回答3:


Generally speaking Perl is really slow at character-by-character processing solutions like those posted above, it's much faster at something like regular expressions since essentially your overhead is mainly how many operators you're executing.

So if you can turn this into a regex-based solution that's much better.

Here's an attempt to do that:

$ perl -wE 'my $str = "abcdefabcgbacbdebdbbcaebfebfebfeb"; for my $pos (0..4) { $str =~ s/^.// if $pos; say for $str =~ m/(.{5})/g }'|sort|uniq -c|sort -nr|head -n 5
  3 ebfeb
  2 febfe
  2 bfebf
  1 gbacb
  1 fabcg

I.e. we have our string in $str, and then we pass over it 5 times generating sequences of 5 characters, after the first pass we start chopping off a character from the front of the string. In a lot of languages this would be really slow since you'd have to re-allocate the entire string, but perl cheats for this special case and just sets the index of the string to 1+ what it was before.

I haven't benchmarked this but I bet something like this is a much more viable way to do this than the algorithms above, you could also do the uniq counting in perl of course by incrementing a hash (with the /e regex option is probably the fastest way), but I'm just offloading that to |sort|uniq -c in this implementation, which is probably faster.

A slightly altered implementation that does this all in perl:

$ perl -wE 'my $str = "abcdefabcgbacbdebdbbcaebfebfebfeb"; my %occur; for my $pos (0..4) { substr($str, 0, 1) = "" if $pos; $occur{$_}++ for $str =~ m/(.{5})/gs }; for my $k (sort { $occur{$b} <=> $occur{$a} } keys %occur) { say "$occur{$k} $k" }'
3 ebfeb
2 bfebf
2 febfe
1 caebf
1 cgbac
1 bdbbc
1 acbde
1 efabc
1 aebfe
1 ebdbb
1 fabcg
1 bacbd
1 bcdef
1 cbdeb
1 defab
1 debdb
1 gbacb
1 bdebd
1 cdefa
1 bbcae
1 bcgba
1 bcaeb
1 abcgb
1 abcde
1 dbbca

Pretty formatting for the code behind that:

my $str = "abcdefabcgbacbdebdbbcaebfebfebfeb";
my %occur;
for my $pos (0..4) {
    substr($str, 0, 1) = "" if $pos;
    $occur{$_}++ for $str =~ m/(.{5})/gs;
}

for my $k (sort { $occur{$b} <=> $occur{$a} } keys %occur) {
    say "$occur{$k} $k";
}



回答4:


The most straightforward approach is to use the substr() function:

% time perl -e '$/ = \1048576; 
           while ($s = <>) { for $i (0..length $s) { 
             $hash{ substr($s, $i, 5) }++ } }  
           foreach my $k (sort { $hash{$b} <=> $hash{$a} } keys %hash) {
             print "$k $hash{$k}\n"; $it++; last if $it == 5;}' nucleotide.data  
NNCTA 337530
GNGGA 337362
NCACT 337304
GANGN 337290
ACGGC 337210
      269.79 real       268.92 user         0.66 sys    

The Perl Monks node on iterating along a string was a useful resource, as were the responses and comments from @Jonathan Leffler, @ÆvarArnfjörðBjarmason, @Vorsprung, @ThisSuitIsBlackNotm @borodin and @ikegami here in this SO posting. As was pointed out, the issue with very large files is memory, which in turn requires that files be read in chunks. When reading from a file in chunks, if your code is iterating through the data it has to properly handle switching from one chunk/source to the next without dropping any bytes.

As a simplistic example, next unless length $kmer == 5; will get checked during each 1048576 byte/character iteration in the script above, meaning strings that exist at the end of one chunk and the beginning of another will be missed (cf. @ikegami's and @Borodin's solutions). This will alter the resulting count, though perhaps not in a statistically significant way[1]. Both @borodin and @ikegami address the issue of missing/overlapping strings between chunks by appending each chunk to the remaining characters of the previous chunk as they sysread in their while() loops. See Borodin's response and comments for an explanation of how it works.


Using Stream::Reader

Since perl has been around for quite a while and has collected a lot of useful code, another perfectly valid approach is to look for a CPAN module that achieves the same end. Stream::Reader can create a "stream" interface to a file handle that wraps the solution to the chunking issue behind a set of convenient functions for accessing the data.

use Stream::Reader; 
use strict;
use warnings;

open( my $handler, "<", shift ); 
my $stream = Stream::Reader->new( $handler, { Mode => "UB" } ); 

my %hash;
my $string;
while ($stream->readto("\n", { Out => \$string }) ) { 
    foreach my $i (0..length $string) { 
       $hash{ substr($string, $i, 5) }++ 
    } 
} 

my $it;
foreach my $k (sort { $hash{$b} <=> $hash{$a} } keys %hash ) { 
       print "$k $hash{$k}\n"; 
       $it++; last if $it == 5;
}

On a test data file nucleotide.data, both Borodin's script and the Stream::Reader approach shown above produced the same top five results. Note the small difference compared to the results from the shell command above. This illustrates the need to properly handle reading data in chunks.

NNCTA 337530
GNGGA 337362
NCACT 337305
GANGN 337290
ACGGC 337210

The Stream::Reader based script was significantly faster:

time perl sequence_search_stream-reader.pl nucleotide.data   
252.12s
time perl sequence_search_borodin.pl nucleotide.data     
350.57s

The file nucleotide.data was a 1Gb in size, consisting of single string of approximately 1 billion characters:

% wc nucleotide.data
       0       0 1048576000 nucleotide.data
% echo `head -c 20 nucleotide.data`
NCCANGCTNGGNCGNNANNA

I used this command to create the file:

perl -MString::Random=random_regex -e '
 open (my $fh, ">>", "nucleotide.data");
 for (0..999) { print $fh random_regex(q|[GCNTA]{1048576}|) ;}'

Lists and Strings

Since the application is supposed to read a chunk at a time and move this $seq_length sized window along the length of the data building a hash for tracking string frequency, I thought a "lazy list" approach might work here. But, to move a window through a collection of data (or slide as with List::Gen) reading elements natatime, one needs a list.

I was seeing the data as one very long string which would first have to be made into a list for this approach to work. I'm not sure how efficient this can be made. Nevertheless, here is my attempt at a "lazy list" approach to the question:

use List::Gen 'slide';

$/ = \1048575; # Read a million character/bytes at a time.
my %hash;

while (my $seq = <>) {
  chomp $seq;
  foreach my $kmer (slide { join("", @_) } 5 => split //, $seq) {
    next unless length $kmer == 5;
    $hash{$kmer}++;
  }
}

foreach my $k (sort { $hash{$b} <=> $hash{$a} } keys %hash) {
  print "$k $hash{$k}\n";
  $it++; last if $it == 5;
}

I'm not sure this is "typical perl" (TIMTOWDI of course) and I suppose there are other techniques (cf. gather/take) and utilities suitable for this task. I like the response from @Borodin best since it seems to be the most common way to take on this task and is more efficient for the potentially large file sizes that were mentioned (100Gb).

Is there a fast/best way to turn a string into a list or object? Using an incremental read() or sysread() with substr wins on this point, but even with sysread a 1000Gb string would require a lot of memory just for the resulting hash. Perhaps a technique that serialized/cached the hash to disk as it grew beyond a certain size would work with very, very large strings that were liable to create very large hashes.


Postscript and Results

The List::Gen approach was consistently between 5 and 6 times slower than @Borodin's approach. The fastest script used the Stream::Reader module. Results were consistent and each script selected the same top five strings with the two smaller files:

1 million character nucleotide string

sequence_search_stream-reader.pl :     0.26s
sequence_search_borodin.pl       :     0.39s
sequence_search_listgen.pl       :     2.04s

83 million character nucleotide string

With the data in file xaa:

wc xaa
       0       1 83886080 xaa

% time perl sequence_search_stream-reader.pl xaa
GGCNG 31510
TAGNN 31182
AACTA 30944
GTCAN 30792
ANTAT 30756
       21.33 real        20.95 user         0.35 sys

% time perl sequence_search_borodin.pl xaa     
GGCNG 31510
TAGNN 31182
AACTA 30944
GTCAN 30792
ANTAT 30756
       28.13 real        28.08 user         0.03 sys

% time perl sequence_search_listgen.pl xaa 
GGCNG 31510
TAGNN 31182
AACTA 30944
GTCAN 30792
ANTAT 30756
      157.54 real       156.93 user         0.45 sys      

1 billion character nucleotide string

In a larger file the differences were of similar magnitude but, because as written it does not correctly handle sequences spanning chunk boundaries, the List::Gen script had the same discrepancy as the shell command line at the beginning of this post. The larger file meant a number of chunk boundaries and a discrepancy in the count.

sequence_search_stream-reader.pl :   252.12s
sequence_search_borodin.pl       :   350.57s
sequence_search_listgen.pl       :  1928.34s

The chunk boundary issue can of course be resolved, but I'd be interested to know about other potential errors or bottlenecks that are introduced using a "lazy list" approach. If there were any benefit in terms of CPU usage from using slide to "lazily" move along the string, it seems to be rendered moot by the need to make a list out of the string before starting.

I'm not surprised that reading data across chunk boundaries is left as an implementation exercise (perhaps it cannot be handled "magically") but I wonder what other CPAN modules or well worn subroutine style solutions might exist.


1. Skipping four characters - and thus four 5 character string combinations - at the end of each megabyte read of a terabyte file would mean the results would not include 3/10000 of 1% from the final count.

echo "scale=10; 100 *  (1024^4/1024^2 ) * 4 / 1024^4 " | bc
.0003814697


来源:https://stackoverflow.com/questions/36201884/count-subsequences-in-hundreds-of-gb-of-data

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!