Skip to content

Commit

Permalink
af-packet: fix soft lockup issues
Browse files Browse the repository at this point in the history
The Suricata AF_PACKET code opens a socket per thread, then after some minor
setup enters a loop where the socket is poll()'d with a timeout. When the
poll() call returns a non zero positive value, the AF_PACKET ring will be
processed.

The ringbuffer processing logic has a pointer into the ring where we last
checked the ring. From this position we will inspect each frame until we
find a frame with tp_status == TP_STATUS_KERNEL (so essentially 0). This
means the frame is currently owned by the kernel.

There is a special case handling for starting the ring processing but
finding a TP_STATUS_KERNEL immediately. This logic then skip to the next
frame, rerun the check, etc until it either finds an initialized frame or
the last frame of the ringbuffer.

The problem was, however, that the initial uninitialized frame was possibly
(likely?) still being initialized by the kernel. A data race between the
notification through the socket (the poll()) and the updating of the
`tp_status` field in the frame could lead to a valid frame getting skipped.

Of note is that for example libpcap does not do frame scanning. Instead it
simply exits it ring processing loop. Also interesting is that libpcap uses
atomic loads and stores on the tp_status field.

This skipping of frames had 2 bad side effects:

1. in most cases, the buffer would be full enough that the frame would
   be processed in the next pass of the ring, but now the frame would
   out of order. This might have lead to packets belong to the same
   flow getting processed in the wrong order.

2. more severe is the soft lockup case. The skipped frame sits at ring
   buffer index 0. The rest of the ring has been cleared, after the
   initial frame was skipped. As our pass of the ring stops at the end
   of the ring (ptv->frame_offset + 1 == ptv->req.v2.tp_frame_nr) the code
   exits the ring processing loop at goes back to poll(). However, poll()
   will not indicate that there is more data, as the stale frame in the
   ring blocks the kernel from populating more frames beyond it. This
   is now a dead lock, as the kernel waits for Suricata and Suricata
   never touches the ring until it hears from the kernel.

   The scan logic will scan the whole ring at most once, so it won't
   reconsider the stale frame either.

This patch addresses the issues in several ways:

1. the startup "discard" logic was fixed to not skip over kernel
   frames. Doing so would get us in a bad state at start up.

2. Instead of scanning the ring, we now enter a busy wait loop
   when encountering a kernel frame where we didn't expect one. This
   means that if we got a > 0 poll() result, we'll busy wait until
   we get at least one frame.

3. Error handling is unified and cleaned up. Any frame error now
   returns the frame to the kernel and progresses the frame pointer.

4. If we find a frame that is owned by us (TP_STATUS_USER_BUSY) we
   yield to poll() immediately, as the next expected status of that
   frame is TP_STATUS_KERNEL.

5. the ring is no longer processed until the "end" of the ring (so
   highest index), but instead we process at most one full ring size
   per run.

6. Work with a copy of `tp_status` instead of accessing original touched
   also by the kernel.

Bug: OISF#4785.
  • Loading branch information
victorjulien committed Oct 31, 2021
1 parent 4a60f80 commit 14d61a2
Showing 1 changed file with 169 additions and 160 deletions.
Loading

0 comments on commit 14d61a2

Please sign in to comment.