-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runsc: PacketMMAP dispatch mode is buggy. #210
Comments
This CL fixes the following bugs: - Uses atomic to set/read status instead of binary.LittleEndian.PutUint32 etc which are not atomic. - Increments ringOffsets for frames that are truncated (i.e status is tpStatusCopy) - Does not ignore frames with tpStatusLost bit set as they are valid frames and only indicate that there some frames were lost before this one and metrics can be retrieved with a getsockopt call. - Adds checks to make sure blockSize is a multiple of page size. This is required as the kernel allocates in pages per block and rejects sizes that are not page aligned with an EINVAL. Updates #210 PiperOrigin-RevId: 244959464 Change-Id: I5d61337b7e4c0f8a3063dcfc07791d4c4521ba1f
PacketMMap mode has issues due to a kernel bug. This change reverts us to using recvmmsg instead of a shared ring buffer to dispatch inbound packets. This will reduce performance but should be more stable under heavy load till PacketMMap is updated to use TPacketv3. See #210 for details. Perf difference between recvmmsg vs packetmmap. RecvMMsg : iperf3 -c 172.17.0.2 Connecting to host 172.17.0.2, port 5201 [ 4] local 172.17.0.1 port 43478 connected to 172.17.0.2 port 5201 [ ID] Interval Transfer Bandwidth Retr Cwnd [ 4] 0.00-1.00 sec 778 MBytes 6.53 Gbits/sec 4349 188 KBytes [ 4] 1.00-2.00 sec 786 MBytes 6.59 Gbits/sec 4395 212 KBytes [ 4] 2.00-3.00 sec 756 MBytes 6.34 Gbits/sec 3655 161 KBytes [ 4] 3.00-4.00 sec 782 MBytes 6.56 Gbits/sec 4419 175 KBytes [ 4] 4.00-5.00 sec 755 MBytes 6.34 Gbits/sec 4317 187 KBytes [ 4] 5.00-6.00 sec 774 MBytes 6.49 Gbits/sec 4002 173 KBytes [ 4] 6.00-7.00 sec 737 MBytes 6.18 Gbits/sec 3904 191 KBytes [ 4] 7.00-8.00 sec 530 MBytes 4.44 Gbits/sec 3318 189 KBytes [ 4] 8.00-9.00 sec 487 MBytes 4.09 Gbits/sec 2627 188 KBytes [ 4] 9.00-10.00 sec 770 MBytes 6.46 Gbits/sec 4221 170 KBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bandwidth Retr [ 4] 0.00-10.00 sec 6.99 GBytes 6.00 Gbits/sec 39207 sender [ 4] 0.00-10.00 sec 6.99 GBytes 6.00 Gbits/sec receiver iperf Done. PacketMMap: bhaskerh@gvisor-bench:~/tensorflow$ iperf3 -c 172.17.0.2 Connecting to host 172.17.0.2, port 5201 [ 4] local 172.17.0.1 port 43496 connected to 172.17.0.2 port 5201 [ ID] Interval Transfer Bandwidth Retr Cwnd [ 4] 0.00-1.00 sec 657 MBytes 5.51 Gbits/sec 0 1.01 MBytes [ 4] 1.00-2.00 sec 1021 MBytes 8.56 Gbits/sec 0 1.01 MBytes [ 4] 2.00-3.00 sec 1.21 GBytes 10.4 Gbits/sec 45 1.01 MBytes [ 4] 3.00-4.00 sec 1018 MBytes 8.54 Gbits/sec 15 1.01 MBytes [ 4] 4.00-5.00 sec 1.28 GBytes 11.0 Gbits/sec 45 1.01 MBytes [ 4] 5.00-6.00 sec 1.38 GBytes 11.9 Gbits/sec 0 1.01 MBytes [ 4] 6.00-7.00 sec 1.34 GBytes 11.5 Gbits/sec 45 856 KBytes [ 4] 7.00-8.00 sec 1.23 GBytes 10.5 Gbits/sec 0 901 KBytes [ 4] 8.00-9.00 sec 1010 MBytes 8.48 Gbits/sec 0 923 KBytes [ 4] 9.00-10.00 sec 1.39 GBytes 11.9 Gbits/sec 0 960 KBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bandwidth Retr [ 4] 0.00-10.00 sec 11.4 GBytes 9.83 Gbits/sec 150 sender [ 4] 0.00-10.00 sec 11.4 GBytes 9.83 Gbits/sec receiver Updates #210 PiperOrigin-RevId: 244968438 Change-Id: Id461b5cbff2dea6fa55cfc108ea246d8f83da20b
Just an update. I implemented tpacket_v3 based dispatcher, while that does not suffer from the bug in tpacket v1/v2 and does pretty good throughput on a loaded network, it does suffer from latency issues when the network is not busy. This is because the way it works is each block is filled 1 at a time by the kernel and only when the block is full or hits the blockTimeout is the skb->data_ready() invoked. The lowest resolution for the blockTimeout is 1ms which is far too high for high speed networking and introduces a large amount of delay. A simple ping of the host from gVisor can see latencies from 0.4ms -> 7ms. TPacket_V3 is not usable unless a new lower resolution timer is added. The only path forward for PACKET_RX_RING will be to wait for a fix to the kernel bug. I believe in the short term we should stick with recvmmsg. I am going to downgrade the priority of this bug as we don't really have a path forward at the moment to use PACKET_RX_RING. |
This CL fixes the following bugs: - Uses atomic to set/read status instead of binary.LittleEndian.PutUint32 etc which are not atomic. - Increments ringOffsets for frames that are truncated (i.e status is tpStatusCopy) - Does not ignore frames with tpStatusLost bit set as they are valid frames and only indicate that there some frames were lost before this one and metrics can be retrieved with a getsockopt call. - Adds checks to make sure blockSize is a multiple of page size. This is required as the kernel allocates in pages per block and rejects sizes that are not page aligned with an EINVAL. Updates google#210 PiperOrigin-RevId: 244959464 Change-Id: I5d61337b7e4c0f8a3063dcfc07791d4c4521ba1f Upstream-commit: 56cadca
PacketMMap mode has issues due to a kernel bug. This change reverts us to using recvmmsg instead of a shared ring buffer to dispatch inbound packets. This will reduce performance but should be more stable under heavy load till PacketMMap is updated to use TPacketv3. See google#210 for details. Perf difference between recvmmsg vs packetmmap. RecvMMsg : iperf3 -c 172.17.0.2 Connecting to host 172.17.0.2, port 5201 [ 4] local 172.17.0.1 port 43478 connected to 172.17.0.2 port 5201 [ ID] Interval Transfer Bandwidth Retr Cwnd [ 4] 0.00-1.00 sec 778 MBytes 6.53 Gbits/sec 4349 188 KBytes [ 4] 1.00-2.00 sec 786 MBytes 6.59 Gbits/sec 4395 212 KBytes [ 4] 2.00-3.00 sec 756 MBytes 6.34 Gbits/sec 3655 161 KBytes [ 4] 3.00-4.00 sec 782 MBytes 6.56 Gbits/sec 4419 175 KBytes [ 4] 4.00-5.00 sec 755 MBytes 6.34 Gbits/sec 4317 187 KBytes [ 4] 5.00-6.00 sec 774 MBytes 6.49 Gbits/sec 4002 173 KBytes [ 4] 6.00-7.00 sec 737 MBytes 6.18 Gbits/sec 3904 191 KBytes [ 4] 7.00-8.00 sec 530 MBytes 4.44 Gbits/sec 3318 189 KBytes [ 4] 8.00-9.00 sec 487 MBytes 4.09 Gbits/sec 2627 188 KBytes [ 4] 9.00-10.00 sec 770 MBytes 6.46 Gbits/sec 4221 170 KBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bandwidth Retr [ 4] 0.00-10.00 sec 6.99 GBytes 6.00 Gbits/sec 39207 sender [ 4] 0.00-10.00 sec 6.99 GBytes 6.00 Gbits/sec receiver iperf Done. PacketMMap: bhaskerh@gvisor-bench:~/tensorflow$ iperf3 -c 172.17.0.2 Connecting to host 172.17.0.2, port 5201 [ 4] local 172.17.0.1 port 43496 connected to 172.17.0.2 port 5201 [ ID] Interval Transfer Bandwidth Retr Cwnd [ 4] 0.00-1.00 sec 657 MBytes 5.51 Gbits/sec 0 1.01 MBytes [ 4] 1.00-2.00 sec 1021 MBytes 8.56 Gbits/sec 0 1.01 MBytes [ 4] 2.00-3.00 sec 1.21 GBytes 10.4 Gbits/sec 45 1.01 MBytes [ 4] 3.00-4.00 sec 1018 MBytes 8.54 Gbits/sec 15 1.01 MBytes [ 4] 4.00-5.00 sec 1.28 GBytes 11.0 Gbits/sec 45 1.01 MBytes [ 4] 5.00-6.00 sec 1.38 GBytes 11.9 Gbits/sec 0 1.01 MBytes [ 4] 6.00-7.00 sec 1.34 GBytes 11.5 Gbits/sec 45 856 KBytes [ 4] 7.00-8.00 sec 1.23 GBytes 10.5 Gbits/sec 0 901 KBytes [ 4] 8.00-9.00 sec 1010 MBytes 8.48 Gbits/sec 0 923 KBytes [ 4] 9.00-10.00 sec 1.39 GBytes 11.9 Gbits/sec 0 960 KBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bandwidth Retr [ 4] 0.00-10.00 sec 11.4 GBytes 9.83 Gbits/sec 150 sender [ 4] 0.00-10.00 sec 11.4 GBytes 9.83 Gbits/sec receiver Updates google#210 PiperOrigin-RevId: 244968438 Change-Id: Id461b5cbff2dea6fa55cfc108ea246d8f83da20b Upstream-commit: 99b877f
@hbhasker I keep coming back to this.
Do you happen to know when exactly this problem might happen with PACKET_MMAP 1/2? is it on RX or TX? |
The bug itself is fixed in latest kernels. I never implemented the tx part so this happens on rx. That said @kevinGC has already implemented AF_XDP based rx/tx and I would just use that instead of this. See https://github.com/google/gvisor/blob/master/pkg/tcpip/link/xdp/endpoint.go |
Interesting. I'm currently investigating vhost-net, and this might be an option as well. It allows for very fast tap device read/write, however, undeniably exposes larger host attack surface. |
Vhost-net would be an interesting alternative to af_xdp. As af_xdp if I remember correctly can't support frames larger than 4k? So you lose host gro support/gso support. Though @kevinGC has added GRO support to Netstack which probably combined with AF_XDP might solve a large part of that. Another alternative would be to use the sharedmem link endpoint. It is similar to vhost-net and allows setting up shared memory queues with another process running outside the sandbox. Which can then handle packets and send them to the host or redirect them. You would need to add support in docker/containers for a cni plugin that creates a nic backed by an fd connected to the external process. Currently the server endpoint waits on an event fd to be notified when packets are available. But if a single process was to serve many sandboxes then maybe you could just burn a single core to spin across all sandboxes and process packets to reduce delays due to wakeup latencies etc. |
Yes, AF_XDP is limited to 4K frames. Something to watch out for: supported frame size depends on the driver. Note that our AF_XDP support is not "battle-hardened" like PACKET_MMAP is. If you do use it we'd definitely appreciate feedback and bug reports. Ideally it becomes default someday. GRO can be enabled via flag, e.g. |
A friendly reminder that this issue had no activity for 120 days. |
This issue has been closed due to lack of activity. |
I tried my own hand at getting more speed with TPACKET_V3 and ran into a similar latency/throughput tradeoff issue. Even with the kernel fix @hbhasker mentioned packet_mmap v1/2 is still slower than rcvmmsg — my testing shows it pulls 1-3 packets before blocking. Rcvmmsg always gets 8. |
A friendly reminder that this issue had no activity for 120 days. |
Closing this and opening a feature request at #10966. Dispatch is no longer buggy, but could use improvements. |
TPACKEtv1/v2 PACKET_RX|TX_RING buffers have a kernel bug where multiple kernel threads can incorrectly assume control of a ring buffer entry resulting in the ring buffer being out of sync with user-space. This seems to manifest itself in our dispatcher in a way where we get stuck in a poll loop where poll returns readable but the frame at our ring offset location does not have the user-bit set.
This completely stalls network. It seems to be easier to trigger when there are lots of connections to an external destination and streaming large amounts of data in.
The bug is described in this thread here
https://www.mail-archive.com/[email protected]/msg237222.html
This is not fixed in the kernel. There are ways to side step it by increasing number of frames which reduces the likelyhood of being hit by this bug. But from my experiments using smaller frames is not worth it as it requires GRO to be disabled but at the same time in a VM GRO/TSO is almost impossible to disable completely. See
http://patchwork.ozlabs.org/patch/1016373/
For now we should revert to use RecvMMsg dispatch mode for runsc while change the ring code to use TPACKET_v3 instead of v1.
The text was updated successfully, but these errors were encountered: