Are stackless C++20 coroutines a problem?

后端未结

关注

 3  542

Happy的楠姐 2020-12-23 13:23

Based on the following it looks like coroutines in C++20 will be stackless.

https://en.cppreference.com/w/cpp/language/coroutines

I\'m concerned for many re

3条回答

有刺的猬 (楼主)

2020-12-23 13:45
I use stackless coroutines on small, hard-realtime ARM Cortex-M0 targets, with 32kb of RAM, where there's no heap allocator present at all: all memory is statically preallocated. The stackless coroutines are a make-or-break, and stackful coroutines that I had previously used were a pain to get right, and were essentially a hack wholly based on implementation-specific behavior. Going from that mess to standards-compliant, portable C++, was wonderful. I shudder to think that someone might suggest going back.
- Stackless coroutines don't imply heap use: you have full control over how the coroutine frame is allocated (via void * operator new(size_t) member in promise type).
- co_await can be nested just fine, in fact it's a common use case.
- Stackful coroutines have to allocate those stacks somewhere as well, and it's perhaps ironic that they can't use the thread's primary stack for that. These stacks are allocated on the heap, perhaps via a pool allocator that gets a block from heap and then subdivides it.
- Stackless coroutine implementations can elide frame allocation, such that the promise's operator new is not called at all, whereas stackful coroutines always allocate the stack for the coroutine, whether needed or not, because the compiler can't help the coroutine runtime with eliding it (at least not in C/C++).
- The allocations can be elided precisely by using the stack where the compiler can prove that the life of the coroutine doesn't leave the scope of the caller. And that's the only way you can use alloca. So, the compiler already takes care of it for you. How cool is that!
  
  Now, there's no requirement that the compilers actually do this elision, but AFAIK all implementations out there do this, with some sane limits on how complex that "proof" can be - in some cases it's not a decidable problem (IIRC). Plus, it's easy to check whether the compiler did as you expected: if you know that all coroutines with a particular promise type are nested-only (reasonable in small embedded projects but not only!), you can declare operator new in the promise type but not define it, and then the code won't link if the compiler "goofed up".
  
  A pragma could be added to a particular compiler implementation to declare that a particular coroutine frame doesn't escape even if the compiler isn't clever enough to prove it - I didn't check if anyone bothered to write these yet, because my use cases are reasonable enough that the compiler always does the right thing.
  
  Memory allocated with alloca cannot be used after you return from the caller. The use case for alloca, in practice, is to be a slightly more portable way of expressing gcc's variable-size automatic array extension.
In essentially all implementations of stackful coroutines in C-like lanaguages, the one and only supposed "benefit" of stackfull-ness is that the frame is accessed using the usual base-pointer-relative addressing, and push and pop where appropriate, so "plain" C code can run on this made-up stack, with no changes to code generator. No benchmarks support this mode of thinking, though, if you have lots of coroutines active - it's a fine strategy if there's a limited number of them, and you have the memory to waste to start with.

Stack has to be overallocated, decreasing locality of reference: a typical stackful coroutine uses a full page for the stack at the minimum, and the cost of making this page available is not shared with anything else: the single coroutine has to bear it all. That's why it was worthwhile to develop stackless python for multiplayer game servers.

If there's a couple of couroutines only - no problem. If you've got thousands of network requests all handled by stackful coroutines, with a light networking stack that doesn't impose overhead that monopolizes the performance, the performance counters for cache misses will make you cry. As Nicol has stated in the other answer, this becomes somewhat less relevant the more layers there are between the coroutine and whatever asynchronous operation it's handling.

It has been long since any 32+-bit CPU had performance benefits inherent to memory access via any particular addressing mode. What matters is cache-friendly access patterns and leveraging prefetch, branch prediction and speculative execution. Paged memory and its backing store are just two further levels of cache (L4 and L5 on desktop CPUs).
1. Why would C++ choose to use stackless coroutines? Because they perform better, and no worse. On the performance side, there can be only benefits to them. So it's a no-brainer, performance-wise, to just use them.
2. Can I use alloca() to avoid any heap allocations that would normally be used for the coroutine creation. No. It'd be a solution to a nonexistent problem. Stackful coroutines don't actually allocate on the existing stack: they create new stacks, and those are allocated on the heap by default, just as C++ coroutine frames would be (by default).
3. Are my assumptions about the c++ coroutines wrong, why? See above.
4. More verbose code because of the need for custom allocators and memory pooling. If you want stackful coroutines to perform well, you'll be doing the same thing to manage the memory areas for the stacks, and it turns out that it's even harder. You need to minimize memory waste, and thus you need to minimally overallocate the stack for the 99.9% use case, and deal somehow with coroutines that exhaust this stack.
  
  One way I have dealt with it in C++ was by doing stack checks in branch points where code analysis indicates more stack may be needed, then if the stack would overflow, an exception was thrown, the coroutine's work undone (the design of the system had to support it!), and then the work restarted with more stack. It's an easy way to quickly lose benefits of tightly packed stack-fuls. Oh, and I had to provide my own __cxa_allocate_exception for that to work. Fun, eh?
One more anecdote: I'm playing with using coroutines inside Windows kernel-mode drivers, and there the stacklessness does matter - to the extent that if the hardware allows, you can allocate the packet buffer and the coroutine's frame together, and these pages are pinned when they are submitted to the network hardware for execution. When the interrupt handler resumes the coroutine, the page is there, and if the network card allows, it could even prefetch it for you so it'll be in the cache. So that works well - it's just one use case, but since you wanted embedded - I've got embedded :).

It's perhaps not common to think of drivers on desktop platforms as "embedded" code, but I see lots of similarities, and an embedded mindset is needed. Last thing you want is kernel code that allocates too much, especially if it would add per-thread overhead. A typical desktop PC has a few thousand threads present, and a lot of them are there to handle I/O. Now imagine a diskless system that uses iSCSI storage. On such a system, anything I/O bound that's not bound to USB or GPU will be bound to the network hardware and the networking stack.

Finally: Trust benchmarks, not me, and read Nicol's answer too!. My perspective is shaped by my use cases - I can generalize, but I claim no first-hand experience with coroutines in "generalist" code where performance is of less concern. Heap allocations for stackless coroutines are very often hardly noticeable in performance traces. In general-purpose application code, it's rarely going to be a problem. It does get "interesting" in library code, and some patterns have to be developed to allow the library user to customize this behavior. These patterns will be found and popularized as more libraries use C++ coroutines.
0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...