问题
I'm trying to write a comm module in C that can handle connected or unconnected sockets completly transparently.
being lazy, I imagined I could bind()
/connect()
udp sockets to get "one socket per client" using udp and the send()
/recv()
primitives.
the scheme is simple, I have a "server socket" bound on *:PORT with SO_REUSEPORT
on which I recvfrom()
.
from there, I'm creating a new socket with the SO_REUSEPORT
socket option and using the 'from' parameter infos to bind()
to *:PORT and connect to my new client.
I'm maintaining that list of clients (aka sockets) and can send() to them without any problems. recv()'ing, is another thing... I would have imagined that the udp "fanout" would take into account the connected end bit of the socket to find to which socket to distribute the packet. most of the time, it does work and the "proper" socket is receiving the data. but from time to time, the unconnected "server" socket receives the data.
I see in the kernel code:
- https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/net/ipv4/udp.c#n420
- https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/net/core/sock_reuseport.c#n265
(couldn't find a working lxr at the time of posting, sorry for the "non browsable" links).
I'm a bit lost at this reciprocal_scale()
thing when there's no BPF program...
first question is: am I misleaded when I say that reuseport_select_sock()
is selecting a socket based on the 4 tuple hash (local address/port + foreign address/port) passed as a parameter but completely disregard the 4 tuple hash of the connected sockets? the idea seems to consistently select a socket based on it's 4 tuple, not necessarily matching with the connected endpoint?
second question: I'm trying to wrap my head around writing an eBPF program in order to "fix" that behavior (because it would make my userland code so much simpler). there seem to be two helper functions that could get me close to what I want to achieve.
int bpf_sk_select_reuseport()
struct bpf_sock *bpf_sk_lookup_udp()
bpf_sk_lookup_udp()
works perfectly for what I want to do, it does find the proper socket based on the 4 tuple in the list of reuseport socks.
The problem here is that bpf_sk_select_reuseport()
wants an index in the reuse->socks[]
array to update the selected_sk
field...
the problem here is that the lookup will return me a pointer to a bpf_sock... of which I have no idea how to extract the array index in the reuse->socks[]
array.
Sorry if it was a bit too long. even if I can't get things working the way I want, hope it can be informative to others.
edit So, after a few days of toying with eBPF, I came to the realization that there's a lot in the way in doing what I wanted in the first place.
declaring a BPF_MAP_TYPE_REUSEPORT_ARRAY requires CAP_SYS_ADMIN, which kinda kills the idea...
even with the CAP_SYS_ADMIN, it's impossible to lookup elements in the REUSEPORT_ARRAY...
and even with the ability to lookup elements, eBPF doesn't permit loops as jumps with negative offsets are prohibited. (comments in the verifier explicitly says the code should be a DAG)
a solution using eBPF would be to use a hash table, associating the 5tuple of the packet to the index of which it is stored in the REUSEPORT_ARRAY.
but since we have no way to iterate the REUSEPORT_ARRAY, we would have to rely on "guesstimating" that index by re-implementing the algorithm used in the kernel.
sockets are stored in order of creation in the REUSEPORT_ARRAY, numbered 0 to N-1.
when socket A is closed, socket N (last socket in the array) takes its place in the array and the array is shrunk by one element.
the eBPF program would only have to lookup the hash table and return the associated index... but that seems too fragile as it depends on the kernel keeping the same ordering algorithm forever... which is quite unlikely.
来源:https://stackoverflow.com/questions/56683832/emulating-tcp-using-bind-connect-on-udp-sockets-with-so-reuseport