I just reviewed some really terrible code - code that sends messages on a serial port by creating a new thread to package and assemble the message in a new thread for every
You definitely do not want to do this. Create a single thread or a pool of threads and just signal when messages are available. Upon receiving the signal, the thread can perform any necessary message processing.
In terms of overhead, thread creation/destruction, especially on Windows, is fairly expensive. Somewhere on the order of tens of microseconds, to be specific. It should, for the most part, only be done at the start/end of an app, with the possible exception of dynamically resized thread pools.
For comparison , take a look of OSX: Link
Kernel data structures : Approximately 1 KB Stack space: 512 KB (secondary threads) : 8 MB (OS X main thread) , 1 MB (iOS main thread)
Creation time: Approximately 90 microseconds
The posix thread creation also should be around this (not a far away figure) I guess.
On any sane implementation, the cost of thread creation should be proportional to the number of system calls it involves, and on the same order of magnitude as familiar system calls like open
and read
. Some casual measurements on my system showed pthread_create
taking about twice as much time as open("/dev/null", O_RDWR)
, which is very expensive relative to pure computation but very cheap relative to any IO or other operations which would involve switching between user and kernel space.
To resurrect this old thread, I just did some simple test code:
#include <thread>
int main(int argc, char** argv)
{
for (volatile int i = 0; i < 500000; i++)
std::thread([](){}).detach();
return 0;
}
I compiled it with g++ test.cpp -std=c++11 -lpthread -O3 -o test
. I then ran it three times in a row on an old (kernel 2.6.18) heavily loaded (doing a database rebuild) slow laptop (Intel core i5-2540M). Results from three consecutive runs: 5.647s, 5.515s, and 5.561s. So we're looking at a tad over 10 microseconds per thread on this machine, probably much less on yours.
That's not much overhead at all, given that serial ports max out at around 1 bit per 10 microseconds. Now, of course there's various additional thread losses one can get involving passed/captured arguments (although function calls themselves can impose some), cache slowdowns between cores (if multiple threads on different cores are battling over the same memory at the same time), etc. But in general I highly doubt the use case you presented will adversely impact performance at all (and could provide benefits, depending), despite having you already preemptively labeled the concept "really terrible code" without even knowing how much time it takes to launch a thread.
Whether it's a good idea or not depends a lot on the details of your situation. What else is the calling thread responsible for? What precisely is involved in preparing and writing out the packets? How frequently are they written out (with what sort of distribution? uniform, clustered, etc...?) and what's their structure like? How many cores does the system have? Etc. Depending on the details, the optimal solution could be anywhere from "no threads at all" to "shared thread pool" to "thread for each packet".
Note that thread pools aren't magic and can in some cases be a slowdown versus unique threads, since one of the biggest slowdowns with threads is synchronizing cached memory used by multiple threads at the same time, and thread pools by their very nature of having to look for and process updates from a different thread have to do this. So either your primary thread or child processing thread can get stuck having to wait if the processor isn't sure whether the other process has altered a section of memory. By contrast, in an ideal situation, a unique processing thread for a given task only has to share memory with its calling task once (when it's launched) and then they never interfere with each other again.
I used the above "terrible" design in a VOIP app I made. It worked very well ... absolutely no latency or missed/dropped packets for locally connected computers. Each time a data packet arrived in, a thread was created and handed that data to process it to the output devices. Of course the packets were large so it caused no bottleneck. Meanwhile the main thread could loop back to wait and receive another incoming packet.
I have tried other designs where the threads I need are created in advance but this creates it's own problems. First you need to design your code properly for threads to retrieve the incoming packets and process them in a deterministic fashion. If you use multiple (pre-allocated) threads it's possible that the packets may be processed 'out of order'. If you use a single (pre-allocated) thread to loop and pick up the incoming packets, there is a chance that thread might encounter a problem and terminate leaving no threads to process any data.
Creating a thread to process each incoming data packet works very cleanly, especially on multi-core systems and where incoming packets are large. Also to answer your question more directly, the alternative to thread creation is to create a run-time process that manages the pre-allocated threads. Being able to synchronize data hand-off and processing as well as detecting errors may add just as much, if not more overhead as just simply creating a new thread. It all depends on your design and requirements.
It is indeed very system dependent, I tested @Nafnlaus code:
#include <thread>
int main(int argc, char** argv)
{
for (volatile int i = 0; i < 500000; i++)
std::thread([](){}).detach();
return 0;
}
On my Desktop Ryzen 5 2600:
windows 10, compiled with MSVC 2019 release adding std::chrono calls around it to time it. Idle (only Firefox with 217 tabs):
It took around 20 seconds (20.274, 19.910, 20.608) (also ~20 seconds with Firefox closed)
Ubuntu 18.04 compiled with:
g++ main.cpp -std=c++11 -lpthread -O3 -o thread
timed with:
time ./thread
It took around 5 seconds (5.595, 5.230, 5.297)
The same code on my raspberry pi 3B compiled with:
g++ main.cpp -std=c++11 -lpthread -O3 -o thread
timed with:
time ./thread
took around 15 seconds (16.225, 14.689, 16.235)