I want to pack a giant DNA sequence with an iOS app (about 3,000,000,000 base pairs). Each base pair can have a value A
, C
, T
or G>
I think you'll have to use two bits per base pair, plus implement compression as described in this paper.
"DNA sequences... are not random; they contain repeating sections, palindromes, and other features that could be represented by fewer bits than is required to spell out the complete sequence in binary...
With the proposed algorithm, sequence will be compressed by 75% irrespective of the number of repeated or non-repeated patterns within the sequence."
DNA Compression Using Hash Based Data Structure, International Journal of Information Technology and Knowledge Management July-December 2010, Volume 2, No. 2, pp. 383-386.
Edit: There is a program called GenCompress which claims to compress DNA sequences efficiently:
http://www1.spms.ntu.edu.sg/~chenxin/GenCompress/
Edit: See also this question on BioStar.