Short question but for me its difficult to understand.
Why exactly does ePoll scale better than Poll?
While Damon's reason is correct for the unusual case where you never block on a socket, in typical real-world programs, the reason is completely different. A typical program looks like this:
1) Do all the work we can do now.
2) Check if any network connections need service, blocking if there's nothing to do.
3) Service any network connections discovered.
4) Go to step 1.
Typically, because you just did all the work you can do, when you come around to step 2, there is no work for you to do. So you'll have to wait a bit. Now, imagine there are 800 sockets you are interested in. The kernel has to put on the wait queue for each of those 800 sockets. And, a split-second later when data arrives on one of those 800 sockets, the kernel has to remove you from those 800 wait queues. Placing a task on a wait queue requires creating a 'thunk' to link that task to that wait queue. No good optimizations are possible because the kernel has no idea which 800 sockets you'll be waiting for.
With epoll
, the epoll socket itself has a wait queue, and the process is only put on that one single wait queue. A thunk is needed to link each of the 800 connections to the epoll wait queue, but that thunk is persistent. You create it by adding a socket to an epoll set, and it remains there until you remove the socket from the set.
When there's activity on the socket, the kernel handles it in the task that detects the activity. When you wait, the kernel already knows if there's a detected event and the kernel only has to put you on that one wait queue. When you wake, it only has to remove you from that one queue.
So it's not so much the copying that's the killer with select
or poll
, it's the fact that the kernel has to manipulate a massive number of wait queues on each blocking operation.