C++ weak_ptr creation performance

前端 未结 2 1009
醉话见心
醉话见心 2021-02-10 04:38

I\'ve read that creating or copying a std::shared_ptr involves some overhead (atomic increment of reference counter etc..).

But what about creating a std::weak_ptr from

2条回答
  •  情深已故
    2021-02-10 05:11

    This is from my days with game engines

    The story goes:

    We need a fast shared pointer implementation, one that doesn't thrash the cache (caches are smarter now btw)

    A normal pointer:

    XXXXXXXXXXXX....
    ^--pointer to data
    

    Our shared pointer:

    iiiiXXXXXXXXXXXXXXXXX...
    ^   ^---pointer stored in shared pointer
    |
    +---the start of the allocation, the allocation is sizeof(unsigned int)+sizeof(T)
    

    The unsigned int* used for the count is at ((unsigned int*)ptr)-1

    that way a "shared pointer" is pointer sized,and the data it contains is the pointer to the actual data. So (because template=>inline and any compiler would inline an operator returning a data member) it was the same "overhead" for access as a normal pointer.

    Creation of pointers took like 3 more CPU instructions than normal (access to a location-4 is on operation, the add of 1 and the write to location -4)

    Now we'd only use weak-pointers when we were debugging (so we'd compile with DEBUG defined (macro definition)) because then we'd like to see all allocations and whats going on and such. It was useful.

    The weak-pointers must know when what they point to is gone, NOT keep the thing they point to alive (in my case, if the weak pointer kept the allocation alive the engine would never get to recycle or free any memory, then it's basically a shared pointer anyway)

    So each weak-pointer has a bool, alive or something, and is a friend of shared_pointer

    When debugging our allocation looked like this:

    vvvvvvvviiiiXXXXXXXXXXXXX.....
    ^       ^   ^ the pointer we stored (to the data)
    |       +that pointer -4 bytes = ref counter
    +Initial allocation now 
        sizeof(linked_list*>)+sizeof(unsigned int)+sizeof(T)
    

    The linked list structure you use depends on what you care about, we wanted to stay as close to sizeof(T) as we could (we managed memory using the buddy algorithm) so we stored a pointer to the weak_pointer and used the xor trick.... good times.

    Anyway: the weak pointers to something shared_pointers point to are put in a list, stored somehow in the "v"s above.

    When the reference count hits zero, you go through that list (which is a list of pointers to actual weak_pointers, they remove themselves when deleted obviously) and you set alive=false (or something) to each weak_pointer.

    The weak_pointers now know what they point to is no longer there (so threw when de-referenced)

    In this example

    There is no overhead (the alignment was 4 bytes with the system. 64 bit systems tend to like 8 byte alignments.... union the ref-counter with an int[2] in there to pad it out in that case. Remember this involves inplace news (nobody downvote because I mentioned them :P) and such. You need to make sure the struct you impose on the allocation matches what you allocated and made. Compilers can align stuff for themselves (hence int[2] not int,int).

    You can de-reference the shared_pointer with no overhead at all.

    New shared pointers being made do not thrash the cache at all and require 3 CPU instructions, they are not very... pipe-line-able but the compiler will inline getters and setters always (if not probably always :P) and there'll be something around the call-site that can fill the pipeline.

    The destructor of a shared pointer also does very little (decrements, that's it) so is great!

    High performance note

    If you have a situation like:

    f() {
       shared_pointer ptr;
       g(ptr);
    }
    

    There's no guarantee that the optimiser will dare to not do the adds and subtractions from passing shared_pointer "by value" to g.

    This is where you'd use a normal reference (which is implemented as a pointer)

    so you'd do g(ptr.extract_reference()); instead - again the compiler will inline the simple getter.

    now you have a T&, because ptr's scope entirely surrounds g (assuming g has no side-effects and so forth) that reference will be valid for the duration of g.

    deleting references is very ugly and you probably couldn't do it by accident (we relied on this fact).

    In hindsight

    I should have created a type called "extracted_pointer" or something, it'd be really hard to type that by mistake for a class member.

    The weak/shared pointers used by stdlib++

    http://gcc.gnu.org/onlinedocs/libstdc++/manual/shared_ptr.html

    Not as fast...

    But don't worry about the odd cache miss unless you're making a game engine that isn't running a decent workload > 120fps easily :P Still miles better than Java.

    The stdlib way is nicer. Each object has it's own allocation and job. With our shared_pointer it was a true case of "trust me it works, try not to worry about how" (not that it is hard) because the code looked really messy.

    If you undid the ... whatever they've done to the names of variables in their implementation it'd be far easier to read. See Boost's implementation, as it says in that documents.

    Other than variable names the GCC stdlib implementation is lovely. You can read it easily, it does it's job properly (following the OO principle) but is a little slower and MAY thrash the cache on crappy chips these days.

    UBER high performance note

    You may be thinking, why not have XXXX...XXXXiiii (the reference count at the end) then you'll get the alignment that's best fro the allocator!

    Answer:

    Because having to do pointer+sizeof(T) may not be one CPU instruction! (Subtracting 4 or 8 is something a CPU can do easy simply because it makes sense, it'll be doing this a lot)

提交回复
热议问题