Simple dictionary in C++

前端未结

关注

 8  1111

Moving some code from Python to C++.

BASEPAIRS = { \"T\": \"A\", \"A\": \"T\", \"G\": \"C\", \"C\": \"G\" }

Thinking maps might be overkill? W

相关标签:

8条回答

囚心锁ツ

2021-01-30 16:53
While using a std::map is fine or using a 256-sized char table would be fine, you could save yourself an enormous amount of space agony by simply using an enum. If you have C++11 features, you can use enum class for strong-typing:
```
// First, we define base-pairs. Because regular enums
// Pollute the global namespace, I'm using "enum class". 
enum class BasePair {
    A,
    T,
    C,
    G
};

// Let's cut out the nonsense and make this easy:
// A is 0, T is 1, C is 2, G is 3.
// These are indices into our table
// Now, everything can be so much easier
BasePair Complimentary[4] = {
    T, // Compliment of A
    A, // Compliment of T
    G, // Compliment of C
    C, // Compliment of G
};
```
Usage becomes simple:
```
int main (int argc, char* argv[] ) {
    BasePair bp = BasePair::A;
    BasePair complimentbp = Complimentary[(int)bp];
}
```
If this is too much for you, you can define some helpers to get human-readable ASCII characters and also to get the base pair compliment so you're not doing (int) casts all the time:
```
BasePair Compliment ( BasePair bp ) {
    return Complimentary[(int)bp]; // Move the pain here
}

// Define a conversion table somewhere in your program
char BasePairToChar[4] = { 'A', 'T', 'C', 'G' };
char ToCharacter ( BasePair bp ) {
    return BasePairToChar[ (int)bp ];
}
```
It's clean, it's simple, and its efficient.

Now, suddenly, you don't have a 256 byte table. You're also not storing characters (1 byte each), and thus if you're writing this to a file, you can write 2 bits per Base pair instead of 1 byte (8 bits) per base pair. I had to work with Bioinformatics Files that stored data as 1 character each. The benefit is it was human-readable. The con is that what should have been a 250 MB file ended up taking 1 GB of space. Movement and storage and usage was a nightmare. Of coursse, 250 MB is being generous when accounting for even Worm DNA. No human is going to read through 1 GB worth of base pairs anyhow.
0 讨论(0)
发布评论:

提交评论
- 加载中...
忘了有多久

2021-01-30 16:56
If you are into optimization, and assuming the input is always one of the four characters, the function below might be worth a try as a replacement for the map:
```
char map(const char in)
{ return ((in & 2) ? '\x8a' - in : '\x95' - in); }
```
It works based on the fact that you are dealing with two symmetric pairs. The conditional works to tell apart the A/T pair from the G/C one ('G' and 'C' happen to have the second-least-significant bit in common). The remaining arithmetics performs the symmetric mapping. It's based on the fact that a = (a + b) - b is true for any a,b.
0 讨论(0)
发布评论:

提交评论
- 加载中...

傲寒

2021-01-30 16:56

This is the fastest, simplest, smallest space solution I can think of. A good optimizing compiler will even remove the cost of accessing the pair and name arrays. This solution works equally well in C.

#include <iostream>

enum Base_enum { A, C, T, G };
typedef enum Base_enum Base;
static const Base pair[4] = { T, G, A, C };
static const char name[4] = { 'A', 'C', 'T', 'G' };
static const Base base[85] = 
  { -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
    -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
    -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
    -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 
    -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
    -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
    -1, -1, -1, -1, -1,  A, -1,  C, -1, -1,
    -1,  G, -1, -1, -1, -1, -1, -1, -1, -1, 
    -1, -1, -1, -1,  T };

const Base
base2 (const char b)
{
  switch (b)
    {
    case 'A': return A;
    case 'C': return C;
    case 'T': return T;
    case 'G': return G;
    default: abort ();
    }
}

int
main (int argc, char *args) 
{
  for (Base b = A; b <= G; b++)
    {
      std::cout << name[b] << ":" 
                << name[pair[b]] << std::endl;
    }
  for (Base b = A; b <= G; b++)
    {
      std::cout << name[base[name[b]]] << ":" 
                << name[pair[base[name[b]]]] << std::endl;
    }
  for (Base b = A; b <= G; b++)
    {
      std::cout << name[base2(name[b])] << ":" 
                << name[pair[base2(name[b])]] << std::endl;
    }
};

base[] is a fast ascii char to Base (i.e. int between 0 and 3 inclusive) lookup that is a bit ugly. A good optimizing compiler should be able to handle base2() but I'm not sure if any do.

0 讨论(0)

自闭症患者

2021-01-30 16:57
BASEPAIRS = { "T": "A", "A": "T", "G": "C", "C": "G" } What would you use?

Maybe:
```
static const char basepairs[] = "ATAGCG";
// lookup:
if (const char* p = strchr(basepairs, c))
    // use p[1]
```
;-)
0 讨论(0)
发布评论:

提交评论
- 加载中...
一整个雨季

2021-01-30 16:59
You can use the following syntax:
```
#include <map>

std::map<char, char> my_map = {
    { 'A', '1' },
    { 'B', '2' },
    { 'C', '3' }
};
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

予麋鹿

2021-01-30 16:59

A table out of char array:

char map[256] = { 0 };
map['T'] = 'A'; 
map['A'] = 'T';
map['C'] = 'G';
map['G'] = 'C';
/* .... */

0 讨论(0)

1 2 下一页