Simple dictionary in C++

前端 未结 8 1111

Moving some code from Python to C++.

BASEPAIRS = { \"T\": \"A\", \"A\": \"T\", \"G\": \"C\", \"C\": \"G\" }

Thinking maps might be overkill? W

相关标签:
8条回答
  • 2021-01-30 16:53

    While using a std::map is fine or using a 256-sized char table would be fine, you could save yourself an enormous amount of space agony by simply using an enum. If you have C++11 features, you can use enum class for strong-typing:

    // First, we define base-pairs. Because regular enums
    // Pollute the global namespace, I'm using "enum class". 
    enum class BasePair {
        A,
        T,
        C,
        G
    };
    
    // Let's cut out the nonsense and make this easy:
    // A is 0, T is 1, C is 2, G is 3.
    // These are indices into our table
    // Now, everything can be so much easier
    BasePair Complimentary[4] = {
        T, // Compliment of A
        A, // Compliment of T
        G, // Compliment of C
        C, // Compliment of G
    };
    

    Usage becomes simple:

    int main (int argc, char* argv[] ) {
        BasePair bp = BasePair::A;
        BasePair complimentbp = Complimentary[(int)bp];
    }
    

    If this is too much for you, you can define some helpers to get human-readable ASCII characters and also to get the base pair compliment so you're not doing (int) casts all the time:

    BasePair Compliment ( BasePair bp ) {
        return Complimentary[(int)bp]; // Move the pain here
    }
    
    // Define a conversion table somewhere in your program
    char BasePairToChar[4] = { 'A', 'T', 'C', 'G' };
    char ToCharacter ( BasePair bp ) {
        return BasePairToChar[ (int)bp ];
    }
    

    It's clean, it's simple, and its efficient.

    Now, suddenly, you don't have a 256 byte table. You're also not storing characters (1 byte each), and thus if you're writing this to a file, you can write 2 bits per Base pair instead of 1 byte (8 bits) per base pair. I had to work with Bioinformatics Files that stored data as 1 character each. The benefit is it was human-readable. The con is that what should have been a 250 MB file ended up taking 1 GB of space. Movement and storage and usage was a nightmare. Of coursse, 250 MB is being generous when accounting for even Worm DNA. No human is going to read through 1 GB worth of base pairs anyhow.

    0 讨论(0)
  • 2021-01-30 16:56

    If you are into optimization, and assuming the input is always one of the four characters, the function below might be worth a try as a replacement for the map:

    char map(const char in)
    { return ((in & 2) ? '\x8a' - in : '\x95' - in); }
    

    It works based on the fact that you are dealing with two symmetric pairs. The conditional works to tell apart the A/T pair from the G/C one ('G' and 'C' happen to have the second-least-significant bit in common). The remaining arithmetics performs the symmetric mapping. It's based on the fact that a = (a + b) - b is true for any a,b.

    0 讨论(0)
  • 2021-01-30 16:56

    This is the fastest, simplest, smallest space solution I can think of. A good optimizing compiler will even remove the cost of accessing the pair and name arrays. This solution works equally well in C.

    #include <iostream>
    
    enum Base_enum { A, C, T, G };
    typedef enum Base_enum Base;
    static const Base pair[4] = { T, G, A, C };
    static const char name[4] = { 'A', 'C', 'T', 'G' };
    static const Base base[85] = 
      { -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
        -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
        -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
        -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 
        -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
        -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
        -1, -1, -1, -1, -1,  A, -1,  C, -1, -1,
        -1,  G, -1, -1, -1, -1, -1, -1, -1, -1, 
        -1, -1, -1, -1,  T };
    
    const Base
    base2 (const char b)
    {
      switch (b)
        {
        case 'A': return A;
        case 'C': return C;
        case 'T': return T;
        case 'G': return G;
        default: abort ();
        }
    }
    
    int
    main (int argc, char *args) 
    {
      for (Base b = A; b <= G; b++)
        {
          std::cout << name[b] << ":" 
                    << name[pair[b]] << std::endl;
        }
      for (Base b = A; b <= G; b++)
        {
          std::cout << name[base[name[b]]] << ":" 
                    << name[pair[base[name[b]]]] << std::endl;
        }
      for (Base b = A; b <= G; b++)
        {
          std::cout << name[base2(name[b])] << ":" 
                    << name[pair[base2(name[b])]] << std::endl;
        }
    };
    

    base[] is a fast ascii char to Base (i.e. int between 0 and 3 inclusive) lookup that is a bit ugly. A good optimizing compiler should be able to handle base2() but I'm not sure if any do.

    0 讨论(0)
  • 2021-01-30 16:57

    BASEPAIRS = { "T": "A", "A": "T", "G": "C", "C": "G" } What would you use?

    Maybe:

    static const char basepairs[] = "ATAGCG";
    // lookup:
    if (const char* p = strchr(basepairs, c))
        // use p[1]
    

    ;-)

    0 讨论(0)
  • 2021-01-30 16:59

    You can use the following syntax:

    #include <map>
    
    std::map<char, char> my_map = {
        { 'A', '1' },
        { 'B', '2' },
        { 'C', '3' }
    };
    
    0 讨论(0)
  • 2021-01-30 16:59

    A table out of char array:

    char map[256] = { 0 };
    map['T'] = 'A'; 
    map['A'] = 'T';
    map['C'] = 'G';
    map['G'] = 'C';
    /* .... */
    
    0 讨论(0)
提交回复
热议问题