Smart pointer wrapping penalty. Memoization with std::map

前端 未结 4 621
无人及你
无人及你 2021-01-05 17:21

I am currently in the middle of a project where performance is of vital importance. Following are some of the questions I had regarding this issue.

Question1

相关标签:
4条回答
  • 2021-01-05 17:59

    Answer to Q#1

    If the regular pointers are faster and I already have shared pointers what options do I have in order to call a method that the shared pointer points to?

    operator-> within boost::shared_ptr has assertion:

    typename boost::detail::sp_member_access< T >::type operator-> () const 
    {
        BOOST_ASSERT( px != 0 );
        return px;
    }
    

    So, first of all, be sure that you have NDEBUG defined (usually in release builds it is done automatically):

    #define NDEBUG
    

    I have made assembler comparison between dereferencing of boost::shared_ptr and raw pointer:

    template<int tag,typename T>
    NOINLINE void test(const T &p)
    {
        volatile auto anti_opti=0;
        ASM_MARKER<tag+0>();
        anti_opti = p->data;
        anti_opti = p->data;
        ASM_MARKER<tag+1>();
        (void)anti_opti;
    }
    

    test<1000>(new Foo);
    

    ASM code of test when T is Foo* is (don't be scared, I have diff below):

    _Z4testILi1000EP3FooEvRKT0_:
    .LFB4088:
    .cfi_startproc
    pushq %rbx
    .cfi_def_cfa_offset 16
    .cfi_offset 3, -16
    movq %rdi, %rbx
    subq $16, %rsp
    .cfi_def_cfa_offset 32
    movl $0, 12(%rsp)
    call _Z10ASM_MARKERILi1000EEvv
    movq (%rbx), %rax
    movl (%rax), %eax
    movl %eax, 12(%rsp)
    movl %eax, 12(%rsp)
    call _Z10ASM_MARKERILi1001EEvv
    movl 12(%rsp), %eax
    addq $16, %rsp
    .cfi_def_cfa_offset 16
    popq %rbx
    .cfi_def_cfa_offset 8
    ret
    .cfi_endproc
    

    test<2000>(boost::make_shared<Foo>());
    

    ASM code of test when T is boost::shared_ptr<Foo>:

    _Z4testILi2000EN5boost10shared_ptrI3FooEEEvRKT0_:
    .LFB4090:
    .cfi_startproc
    pushq %rbx
    .cfi_def_cfa_offset 16
    .cfi_offset 3, -16
    movq %rdi, %rbx
    subq $16, %rsp
    .cfi_def_cfa_offset 32
    movl $0, 12(%rsp)
    call _Z10ASM_MARKERILi2000EEvv
    movq (%rbx), %rax
    movl (%rax), %eax
    movl %eax, 12(%rsp)
    movl %eax, 12(%rsp)
    call _Z10ASM_MARKERILi2001EEvv
    movl 12(%rsp), %eax
    addq $16, %rsp
    .cfi_def_cfa_offset 16
    popq %rbx
    .cfi_def_cfa_offset 8
    ret
    .cfi_endproc
    

    Here is output of diff -U 0 foo_p.asm shared_ptr_foo_p.asm command:

    --- foo_p.asm   Fri Apr 12 10:38:05 2013
    +++ shared_ptr_foo_p.asm        Fri Apr 12 10:37:52 2013
    @@ -1,2 +1,2 @@
    -_Z4testILi1000EP3FooEvRKT0_:
    -.LFB4088:
    +_Z4testILi2000EN5boost10shared_ptrI3FooEEEvRKT0_:
    +.LFB4090:
    @@ -11 +11 @@
    -call _Z10ASM_MARKERILi1000EEvv
    +call _Z10ASM_MARKERILi2000EEvv
    @@ -16 +16 @@
    -call _Z10ASM_MARKERILi1001EEvv
    +call _Z10ASM_MARKERILi2001EEvv
    

    As you can see, difference is only in function signature, and tag non-type template argument value, rest of code is IDENTICAL.


    In general - shared_ptr is very costly - it's reference counting is syncronized between threads (usually via atomic operations). If you would use boost::intrusive_ptr instead, then you can implement your own increment/decrement without thread-synchronization, which would speed-up reference counting.

    If you can afford using unique_ptr or move semantic (via Boost.Move or C++11) - then there will be no any reference counting - it would be faster even more.

    LIVE DEMO WITH ASM OUTPUT

    #define NDEBUG
    
    #include <boost/make_shared.hpp>
    #include <boost/shared_ptr.hpp>
    
    #define NOINLINE __attribute__ ((noinline))
    
    template<int>
    NOINLINE void ASM_MARKER()
    {
        volatile auto anti_opti = 11;
        (void)anti_opti;
    }
    
    struct Foo
    {
        int data;
    };
    
    template<int tag,typename T>
    NOINLINE void test(const T &p)
    {
        volatile auto anti_opti=0;
        ASM_MARKER<tag+0>();
        anti_opti = p->data;
        anti_opti = p->data;
        ASM_MARKER<tag+1>();
        (void)anti_opti;
    }
    
    int main()
    {
        {
            auto p = new Foo;
            test<1000>(p);
            delete p;
        }
        {
            test<2000>(boost::make_shared<Foo>());
        }
    }
    

    Answer to Q#2

    I have an instance method(s) that is rapidly called that creates a std::vector on the stack every time.

    Generally, it is good idea to try to reuse vector's capacity in order to prevent costly re-allocations. For instance it is better to replace:

    {
        for(/*...*/)
        {
            std::vector<value> temp;
            // do work on temp
        }
    }
    

    with:

    {
        std::vector<value> temp;
        for(/*...*/)
        {
            // do work on temp
            temp.clear();
        }
    }
    

    But looks like due to type std::map<std::string,std::vector<std::string>*> you are trying to perfom some kind of memoization.

    As already suggested, instead of std::map which has O(ln(N)) lookup/insert you may try to use boost::unordered_map/std::unordered_map which has O(1) average and O(N) worst case complexity for lookup/insert, and better locality/compactess (cache-friendly).

    Also, cosider to try Boost.Flyweight:

    Flyweights are small-sized handle classes granting constant access to shared common data, thus allowing for the management of large amounts of entities within reasonable memory limits. Boost.Flyweight makes it easy to use this common programming idiom by providing the class template flyweight, which acts as a drop-in replacement for const T.

    0 讨论(0)
  • 2021-01-05 18:01

    For Question1:

    Major performance gain can be achived at an architecture design, algorithm used and while low level concerns are also important only when highlevel design is strong. Lets come to your question, Regular pointer performance is higher than shared_ptr. But the amount of overhead you see not using shared_ptr is also more which increases cost of maintaining code in longer run. Redundant object creation and destruction must be avoided in performance-critical applications. In such cases shared_ptr plays an important role which plays in sharing common objects accross threads by reducing overhead of releasing the resources. yes shared pointer consumes more time than regular pointers because of refcount,allocation(object,counter,deleter) etc. you can make shared_ptr faster by preventing unnecessary copy of them.use it as ref(shared_ptr const&). Moreover of you don't need shared resources accross threads don't use shared_ptr and regular ptr will give better performances in those case.

    Question 2

    If want to use reuse pool of shared_ptr objects you can better look into object pool design pattern approach. http://en.wikipedia.org/wiki/Object_pool_pattern

    0 讨论(0)
  • 2021-01-05 18:09

    Q1: Just look at the implementation:

    T * operator-> () const // never throws
    {
        BOOST_ASSERT(px != 0);
        return px;
    }
    

    Clearly it's returning a member variable and not calculating anything on the fly, so the performance will be as fast as dereferencing a plain pointer (subject to usual quirks of compiler optimisation / performance of an unoptimised build can always be expected to suck - not worth consideration).

    Q2: "is it worth searching a map for a vector address and returning back a valid address over just creating one on the stack like std::vector<std::string> somevector. I would also like an idea on the performance of std::map::find."

    Whether it's worth it depends on the amount of data that would have to be copied in the vector, and to lesser extents the number of nodes in the map, the length of common prefixes in the keys being compared etc.. As always, if you care, benchmark. Generally though, I'd expect the answer to be yes if the vectors contain a significant amount of data (or that data's slow to regenerate). std::map is a balance binary tree, so in general you expect lookups in O(log2N) where N is the current number of elements (i.e. size()).

    You could also use a hash table - that gives O(1) which is going to be faster for huge numbers of elements, but it's impossible to say where the threshold is. Performance still depends on expensiveness of the hash function you use on your keys, their length (some hash implementations like Microsoft's std::hash only incorporate max 10 characters spaced along the string being hashed, so there's an upper limit to the time taken but massively more collision potential), hash table collision handling approaches (e.g. displacement lists to search alternative buckets vs. alternative hash functions vs. containers chained from buckets), and the collision proneness itself.

    0 讨论(0)
  • 2021-01-05 18:21

    Question 1:

    I use shared pointers in my project extensively, but I wouldn't want to use shared_ptr<T>. It requires a heap object that is allocated separately from T itself, so memory allocation overhead is doubled and memory usage increases by an amount that depends on your runtime library's implementation. intrusive_ptr is more efficient, but there is one key problem that irks me, and that is function calling:

    void Foo(intrusive_ptr<T> x) {...}
    

    every time you call Foo, the reference count of the parameter x must be incremented with a relatively expensive atomic increment, and then decremented on the way out. But this is redundant, because you can usually assume that the caller already has a reference to x, and that the reference is valid for the duration of the call. There are possible ways that the caller might not already have a reference, but it's not hard to write your code in such a way that the caller's reference is always valid.

    Therefore, I prefer to use my own smart pointer class that is the same as intrusive_ptr except that it converts implicitly to and from T*. Then I always declare my methods to take plain pointers, avoiding unnecessary reference counting:

    void Foo(T* x) {...}
    

    This approach has proven to work well in my project, but to be honest I never actually measured the performance difference it makes.

    Also, prefer to use auto_ptr (C++03) or unique_ptr (C++11) where possible.

    Question 2:

    I don't understand why you are thinking about using a std::map. First of all, hash_map will faster (as long as it's not the VC++ Dinkumware implementation in VS2008/2010, details in here somewhere), and secondly if you only need one vector per method, why not use a static variable of type std::vector<std::string>?

    If you have to look up the vector in a hashtable every time the method is called, my guess is that you will save little or no time compared to creating a new vector each time. If you look up the vector in a std::map, it will take even longer.

    0 讨论(0)
提交回复
热议问题