`std::string` allocations are my current bottleneck - how can I optimize with a custom allocator?

后端 未结 6 1915
半阙折子戏
半阙折子戏 2021-02-04 06:44

I\'m writing a C++14 JSON library as an exercise and to use it in my personal projects.

By using callgrind I\'ve discovered that the current bottleneck

相关标签:
6条回答
  • 2021-02-04 06:54

    By default, std::string allocates memory as needed from the same heap as anything that you allocate with malloc or new. To get a performance gain from providing your own custom allocator, you will need to be managing your own "chunk" of memory in such a way that your allocator can deal out the amounts of memory that your strings ask for faster than malloc does. Your memory manager will make relatively few calls to malloc, (or new, depending on your approach) under the hood, requesting "large" amounts of memory at once, then deal out sections of this (these) memory block(s) through the custom allocator. To actually achieve better performance than malloc, your memory manager will usually have to be tuned based on known allocation patterns of your use cases.

    This kind of thing often comes down to the age-old trade off of memory use versus execution speed. For example: if you have a known upper bound on your string sizes in practice, you can pull tricks with over-allocating to always accommodate the largest case. While this is wasteful of your memory resources, it can alleviate the performance overhead that more generalized allocation runs into with memory fragmentation. As well as making any calls to realloc essentially constant time for your purposes.

    @sehe is exactly right. There are many ways.

    EDIT:

    To finally address your second question, strings using different allocators can play nicely together, and usage should be transparent.

    For example:

    class myalloc : public std::allocator<char>{};
    myalloc customAllocator;
    
    int main(void)
    {
      std::string mystring(customAllocator);
      std::string regularString = "test string";
      mystring = regularString;
      std::cout << mystring;
    
      return 0;
    }
    

    This is a fairly silly example and, of course, uses the same workhorse code under the hood. However, it shows assignment between strings using allocator classes of "different types". Implementing a useful allocator that supplies the full interface required by the STL without just disguising the default std::allocator is not as trivial. This seems to be a decent write up covering the concepts involved. The key to why this works, in the context of your question at least, is that using different allocators doesn't cause the strings to be of different type. Notice that the custom allocator is given as an argument to the constructor not a template parameter. The STL still does fun things with templates (such as rebind and Traits) to homogenize allocator interfaces and tracking.

    0 讨论(0)
  • 2021-02-04 06:54

    I think you'd be best served by reading up on the EASTL

    It has a section on allocators and you might find fixed_string useful.

    0 讨论(0)
  • 2021-02-04 07:02

    What often helps is the creation of a GlobalStringTable.

    See if you can find portions of the old NiMain library from the now defunct NetImmerse software stack. It contains an example implementation.

    Lifetime

    What is important to note is that this string table needs to be accessible between different DLL spaces, and that it is not a static object. R. Martinho Fernandes already warned that the object needs to be created when the application or DLL thread is created / attached, and disposed when the thread is destroyed or the dll is detached, and preferrably before any string object is actually used. This sounds easier than it actually is.

    Memory allocation

    Once you have a single point of access that exports correctly, you can have it allocate a memory buffer up-front. If the memory is not enough, you have to resize it and move the existing strings over. Strings essentially become handles to regions of memory in this buffer.

    Placement new

    Something that often works well is called the placement new() operator, where you can actually specify where in memory your new string object needs to be allocated. However, instead of allocating, the operator can simply grab the memory location that is passed in as an argument, zero the memory at that location, and return it. You can also keep track of the allocation, the actual size of the string etc.. in the Globalstringtable object.

    SOA

    Handling the actual memory scheduling is something that is up to you, but there are many possible ways to approach this. Often, the allocated space is partitioned in several regions so that you have several blocks per possible string size. A block for strings <= 4 bytes, one for <= 8 bytes, and so on. This is called a Small Object Allocator, and can be implemented for any type and buffer.

    If you expect many string operations where small strings are incremented repeatedly, you may change your strategy and allocate larger buffers from the start, so that the number of memmove operations are reduced. Or you can opt for a different approach and use string streams for those.

    String operations

    It is not a bad idea to derive from std::basic_str, so that most of the operations still work but the internal storage is actually in the GlobalStringTable, so that you can keep using the same stl conventions. This way, you also make sure that all the allocations are within a single DLL, so that there can be no heap corruption by linking different kinds of strings between different libraries, since all the allocation operations are essentially in your DLL (and are rerouted to the GlobalStringTable object)

    0 讨论(0)
  • 2021-02-04 07:04

    Custom allocators can help because most malloc()/new implementations are designed for maximum flexibility, thread-safety and bullet-proof workings. For instance, they must gracefully handle the case that one thread keeps allocating memory, sending the pointers to another thread that deallocates them. Things like these are difficult to handle in a performant way and drive the cost of malloc() calls.

    However, if you know that some things cannot happen in your application (like one thread deallocating stuff another thread allocated, etc.), you can optimize your allocator further than the standard implementation. This can yield significant results, especially when you don't need thread safety.

    Also, the standard implementation is not necessarily well optimized: Implementing void* operator new(size_t size) and void operator delete(void* pointer) by simply calling through to malloc() and free() gives an average performance gain of 100 CPU cycles on my machine, which proves that the default implementation is suboptimal.

    0 讨论(0)
  • 2021-02-04 07:09

    The best way to avoid a memory allocation is don't do it!
    BUT if I remember JSON correctly all the readStr values either gets used as keys or as identifiers so you will have to allocate them eventually, std::strings move semantics should insure that the allocated array are not copied around but reused until its final use. The default NRVO/RVO/Move should reduce any copying of the data if not of the string header itself.

    Method 1:
    Pass result as a ref from the caller which has reserved SomeResonableLargeValue chars, then clear it at the start of readStr. This is only usable if the caller actually can reuse the string.

    Method 2:
    Use the stack.

    // Reserve memory for the string (BOTTLENECK)
    if (end - idx < SomeReasonableValue) { // 32?
      char result[SomeReasonableValue] = {0};  // feel free to use std::array if you want bounds checking, but the preceding "if" should insure its not a problem.
      int ridx = 0;
    
      for(; idx < end; ++idx) {
        // Not an escape sequence
        if(!isC('\\')) { result[ridx++] = getC(); continue; }
        // Escape sequence: skip '\'
        ++idx;
        // Convert escape sequence
        result[ridx++] = getEscapeSequence(getC());
      }
    
      // Skip closing '"'
      ++idx;
      result[ridx] = 0; // 0-terminated.
      // optional assert here to insure nothing went wrong.
      return result; // the bottleneck might now move here as the data is copied to the receiving string.
    }
    // fallback code only if the string is long.
    // Your original code here
    

    Method 3:
    If your string by default can allocate some size to fill its 32/64 byte boundary, you might want to try to use that, construct result like this instead in case the constructor can optimize it.

    Str result(end - idx, 0);
    

    Method 4:
    Most systems already has some optimized allocator that like specific block sizes, 16,32,64 etc.

    siz = ((end - idx)&~0xf)+16; // if the allocator has chunks of 16 bytes already.
    Str result(siz);
    

    Method 5:
    Use either the allocator made by google or facebooks as global new/delete replacement.

    0 讨论(0)
  • 2021-02-04 07:12

    To understand how a custom allocator can help you, you need to understand what malloc and the heap does and why it is quite slow in comparison to the stack.

    The Stack

    The stack is a large block of memory allocated for your current scope. You can think of it as this

    ([] means a byte of memory)

    [P][][][][][][][][][][][][][][][]

    (P is a pointer that points to a specific byte of memory, in this case its pointing at the first byte)

    So the stack is a block with only 1 pointer. When you allocate memory, what it does is it performs a pointer arithmetic on P, which takes constant time. So declaring int i = 0; would mean this,

    P + sizeof(int).

    [i][i][i][i][P][][][][][][][][][][][], (i in [] is a block of memory occupied by an integer)

    This is blazing fast and as soon as you go out of scope, the entire chunk of memory is emptied simply by moving P back to the first position.

    The Heap

    The heap allocates memory from a reserved pool of bytes reserved by the c++ compiler at runtime, when you call malloc, the heap finds a length of contiguous memory that fits your malloc requirements, marks it as used so nothing else can use it, and returns that to you as a void*.

    So, a theoretical heap with little optimization calling new(sizeof(int)), would do this.

    Heap chunk

    At first : [][][][][][][][][][][][][][][][][][][][][][][][][]

    Allocate 4 bytes (sizeof(int)): A pointer goes though every byte of memory, finds one that is of correct length, and returns to you a pointer. After : [i][i][i][i][][][]][][][][][][][][][]][][][][][][][]

    This is not an accurate representation of the heap, but from this you can already see numerous reasons for being slow relative to the stack.

    1. The heap is required to keep track of all already allocated memory and their respective lengths. In our test case above, the heap was already empty and did not require much, but in worst case scenarios, the heap will be populated with multiple objects with gaps in between (heap fragmentation), and this will be much slower.

    2. The heap is required to cycle though all the bytes to find one that fits your length.

    3. The heap can suffer from fragmentation since it will never completely clean itself unless you specify it. So if you allocated an int, a char, and another int, your heap would look like this

    [i][i][i][i][c][i2][i2][i2][i2]

    (i stands for bytes occupied by int and c stands for bytes occupied by a char. When you de-allocate the char, it will look like this.

    [i][i][i][i][empty][i2][i2][i2][i2]

    So when you want to allocate another object into the heap,

    [i][i][i][i][empty][i2][i2][i2][i2][i3][i3][i3][i3]

    unless an object is the size of 1 char, the overall heap size for that allocation is reduced by 1 byte. In more complex programs with millions of allocations and deallocations, the fragmentation issue becomes severe and the program will become unstable.

    1. Worry about cases like thread safety (Someone else said this already).

    Custom Heap/Allocator

    So, a custom allocator usually needs to address these problems while providing the benefits of the heap, such as personalized memory management and object permanence.

    These are usually accomplished with specialized allocators. If you know you dont need to worry about thread safety or you know exactly how long your string will be or a predictable usage pattern you can make your allocator fast than malloc and new by quite a lot.

    For example, if your program requires a lot of allocations as fast as possible without lots of deallocations, you could implement a stack allocator, in which you allocate a huge chunk of memory with malloc at startup,

    e.g

    typedef char* buffer;
    //Super simple example that probably doesnt work.
    struct StackAllocator:public Allocator{
         buffer stack;
         char* pointer;
         StackAllocator(int expectedSize){ stack = new char[expectedSize];pointer = stack;}
         allocate(int size){ char* returnedPointer = pointer; pointer += size; return returnedPointer}
         empty() {pointer = stack;}
    
    };
    

    Get expected size, get a chunk of memory from the heap.

    Assign a pointer to the beginning.

    [P][][][][][][][][][] ..... [].

    then have one pointer that moves for each allocation. When you no longer need the memory, you simply move the pointer to the beginning of your buffer. This gives your the advantage of O(1) speed allocations and deallocations as well as object permanence for the lack of flexible deallocation and large initial memory requirements.

    For strings, you could try a chunk allocator. For every allocation, the allocator gives a set chunk of memory.

    Compatibility

    Compatibility with other strings is almost guaranteed. As long as you are allocating a contiguous chunk of memory and preventing anything else from using that block of memory, it will work.

    0 讨论(0)
提交回复
热议问题