Most efficient way to store a big DNA sequence?

前端 未结 7 1157
滥情空心
滥情空心 2021-02-04 11:58

I want to pack a giant DNA sequence with an iOS app (about 3,000,000,000 base pairs). Each base pair can have a value A, C, T or G

7条回答
  •  盖世英雄少女心
    2021-02-04 12:39

    I think you'll have to use two bits per base pair, plus implement compression as described in this paper.

    "DNA sequences... are not random; they contain repeating sections, palindromes, and other features that could be represented by fewer bits than is required to spell out the complete sequence in binary...

    With the proposed algorithm, sequence will be compressed by 75% irrespective of the number of repeated or non-repeated patterns within the sequence."

    DNA Compression Using Hash Based Data Structure, International Journal of Information Technology and Knowledge Management July-December 2010, Volume 2, No. 2, pp. 383-386.

    Edit: There is a program called GenCompress which claims to compress DNA sequences efficiently:

    http://www1.spms.ntu.edu.sg/~chenxin/GenCompress/

    Edit: See also this question on BioStar.

提交回复
热议问题