🔙 Improved NACKs #135

danstiner · 2021-12-09T10:23:43Z

Checklist

Test on a production server (with and without NACKs enabled) (live on do-sfo3-ingest1.ksfo.live.glimesh.tv with NACKs enabled)
Doc comments
Cleanup code
Overnight stability test
Cleanup tests (remove randomized tests, they should be reviewed separately)
Test if MAX_OUTSTANDING_NACKS should be increased

Background

Shoutout to Hayden for writing the original NACK (negative acknowledgement of missing packets) feature. It mostly worked but had some issues with sequence number wrap-around (to be fair, I introduced a number of bugs when I touched the NACK code recently). If I remember correctly streams could break if there was significant packet loss (and would not recover when the packet loss went away). This left the feature unusable and it was turned off.

The NACK feature has always been difficult to test and is tied together with the keyframe packet collection. This PR is my second take at fixing the NACK feature by keeping the core ideas of the old code while starting to pull the logic into separate classes for clarity and test-ability.

Goals

Bring NACK code under unit tests
Mitigate up to 1% packet loss between streamer <-> ingest when packet arrival jitter is less than 20ms
Handle higher packet loss gracefully, stream should recover within a couple seconds (one keyframe interval) if packet loss drops down back under 1%

I think meeting those goals will yield a "good enough" solution, but if I had unlimited time I'd aim for something frame-aware that can do things like prioritize keyframe packet loss and estimate when a packet is "late" based on the frame timestamps, jitter estimates, etc. Those are all things the client does with it's "playout buffer" data structure, we may be able to take inspiration from that and adjust the ideas to be relevant on the server.

Status

Tested locally and as of 0d23274 it seems effective and stable. Overnight stability tests show it is capable of reducing 1% packet loss to about 0.001%, which is exactly (1% * 1%) as expected (re-transmitted packets can still be lost, the 1% loss rate in this local test also applies to them and this code does not try to re-transmit a second time). This should be the difference between bad frames every couple seconds to more like once a minute.

Details

There are roughly two parts to the tracking and NACK process:

First, when a packet arrives it is both fed to the sequence tracker instance for the source and relayed immediately to all client sessions. This follows the previous behavior of this code, no latency is added to packet relaying. The sequence tracker checks if the sequence number is a re-transmit due to a prior NACK. If so, that NACK is marked as received and it does nothing else. If this is a new packet, it is added to an internal buffer, in sequence order.

Second, when it's time to send NACKs, we iterate through the buffer of received packets and look for gaps in the sequence numbers. Any gaps are marked as "missing" packets and NACKs are sent for the missing packets. As mentioned above, when NACK'd packets are re-transmitted and received they are removed from the missing list. If we never get the re-transmit, eventually we time out the NACK and stop tracking the missing packet.

That's the jist of it, but there are some additional complexities:

There is an allowance for re-ordered RTP packets, just like the old logic we do not check the highest 16 sequence numbers in the packet buffer for gaps/missing packets
- aka a sliding window based on sequence number
- Note this allowance is not guaranteed to cover the last 16 packets received, it is sequence number based. e.g. if we receive packet with seq=16 and then the next packet has seq=32, then we will check for gaps in packets with seq<=16. Sequence numbers 17-31 will not be considered missing as they are still in the "re-order window", at least until yet newer packets arrive and slide the window forward.
There is a watermark (checkForMissingWatermark) tracking the highest sequence number that has been checked for gaps. This ensures we do not count the same gap as missing twice.
- Watermark is not a great name for this, open to alternatives
There is a limit on the number of in-flight/outstanding NACKs to prevent excessive re-transmit bandwidth usage. If we are dropping packets to the ingest due to bandwidth limits, it will not be good to use even more bandwidth! (new in this PR I think)

Also, snuck in a change to the pending/current keyframe packet collection logic in FtlMediaConnection. It now checks for sequential packets without gaps and that the final packet has a marker bit before it considers a keyframe valid. This should prevent most cases of corrupt frames being used at stream thumbnails.

Closes #129

Potential Future Improvements

Better name for the SequenceTracker class?
Combine ExtendedSequenceCounter with SequenceTracker, their logic is substantially similar
Look at if extended sequence numbers are needed outside of SequenceTracker, maybe can limit use / get rid of that concept entirely
Bring back NACK bitmasks to reduce packets being sent / only send NACKs periodically in batches vs processing them on every packet received (observed testing on a wifi network shows packets are often lost in bursts of ~3-20 sequential packets)
- Note that for simplicity this code runs the check for gaps/missing packets and sends any ready NACKs every time a new packet is received. This could lead to sliding the "re-order window" forward by one sequence number at a time, meaning even if a burst of sequence packets was lost, the window would only slide to reveal one of them at a time and thus singular NACKs would still be sent even with bitmask support. I don't have a great solution for this, probably the check for missing/sending NACKs should be on some kind of a timer instead of done per-packet.
- TODO open issue for this
Group packets in buffer into frames by their RTP timestamps (allows e.g. prioritize NACKs of keyframes, if refactored enough this could replace the logic in FtlMediaConnection that captures keyframe packets)
Redefine timeouts to be based on RTP timestamps and expected arrival time (may be for a future PR as this would require extra logic for jitter/arrival time estimation)
Estimate expected arrival time for packets, to allow better time-based NACKing. Right now we have to wait until a newer packet arrives before we can calculate packets are missing and send NACKs. (more detail in below comment)

Passing reference implies caller holds the lock already and this avoids re-getting the same data reference over and over.

…ce_counter

clone1018 · 2022-01-05T15:04:44Z

Hey @danstiner, I deployed this to the infrequently used do-sfo3-ingest1.ksfo.live.glimesh.tv server and was able to test NACKs working successfully on a minimally unstable network. I'll keep it running on the server for a bit if you want to do any testing on it. Once you're comfortable we can get some more servers / users testing out this new improvement!

Thanks again for your hard work on this!

danstiner · 2022-01-06T04:26:54Z

Thanks for deploying @clone1018, good to see you and excited to see this out in the wild! Looks like SmashBets is using it right now, should ask them for feedback. I also tested a bit and can confirm it appears to be helping. It isn't as good in the real world as in my local testing, sometimes re-transmitted packets arrive too late and video/audio skips still can happen, but it is an improvement over the current behavior I think.

I don't have any issues with letting do-sfo3-ingest1 sit for a bit until Hayden reviews this. Then after merge the change can slowly be rolled out more widely. However, I'd love to also apply the playout-delay patch at some point, maybe as a combined rollout. Something like a 400ms delay should both enhance this change by giving more time for late arriving re-transmits and should also help with the issue of large keyframes causing stutters.

Test 1

This chart from chrome://webrtc-internals/ when watching https://glimesh.tv/SmashBets shows video packets being marked as lost by Chrome and then unmarked when the re-transmit arrives. The loss is not completely mitigated but this change is helping quite a bit:

A similar thing happens for audio but due to issues discussed below the audio packet re-transmits arrive too late and are discarded anyways most of the time. This is unfortunate, but due to a few factors this loss is generally not very noticeable.

Test 2

For the fun of it I also did some packet capturing. From the viewer side, the following is an example of where Seq 62850-62852 for SSRC 0x5EFD5FAE did not arrive initially:


    No.           UTC         Protocol   Length                                     Info                                    
 --------- ----------------- ---------- -------- -------------------------------------------------------------------------- 
  1029968   22:23:31.353413   RTP          1462   PT=DynamicRTP-Type-96, SSRC=0x5EFD5FAE, Seq=62848, Time=1046226000        
  1029969   22:23:31.353438   RTP          1437   PT=DynamicRTP-Type-96, SSRC=0x5EFD5FAE, Seq=62849, Time=1046226000, Mark  

  1029989   22:23:31.473773   RTP          1462   PT=DynamicRTP-Type-96, SSRC=0x5EFD5FAE, Seq=62853, Time=1046229000        
  1029990   22:23:31.473806   RTP          1462   PT=DynamicRTP-Type-96, SSRC=0x5EFD5FAE, Seq=62854, Time=1046229000

Then from the streamer's side, OBS logs show we got the NACK from do-sfo3-ingest1 and it re-transmitted those three missing packets.
Side note: Client/server time are both in UTC but not quite in sync, that's pretty normal and does not hurt anything.

22:23:31.295: [3] [12677] resent sn 62850, request delay was 196 ms, was part of iframe? 0
22:23:31.295: [3] [12677] resent sn 62851, request delay was 197 ms, was part of iframe? 0
22:23:31.295: [3] [12677] resent sn 62852, request delay was 197 ms, was part of iframe? 0

Then a bit later on the viewer side we see the re-transmitted packets arrive just after Seq 62929, about 190ms after they originally would have arrived, which matches the OBS logs. In this case it is likely the re-transmit arrived in time and there were no lost frames given my jitterBufferDelay/jitterBufferEmittedCount_in_ms was >400ms. However the jitter buffer delay can vary quite a bit depending on the viewer's network conditions, so that won't always be true. One way to mitigate the variance would be to apply my playout-delay patch with a minimum playout delay of 400ms+. That should be enough in cases like this to ensure the re-transmitted packets arrive in time.

    No.           UTC         Protocol   Length                                     Info                                    
 --------- ----------------- ---------- -------- -------------------------------------------------------------------------- 
  1030150   22:23:31.539324   RTP          1462   PT=DynamicRTP-Type-96, SSRC=0x5EFD5FAE, Seq=62928, Time=1046245500        
  1030151   22:23:31.539324   RTP           539   PT=DynamicRTP-Type-96, SSRC=0x5EFD5FAE, Seq=62929, Time=1046245500, Mark  

  1030163   22:23:31.541572   RTP          1462   PT=DynamicRTP-Type-96, SSRC=0x5EFD5FAE, Seq=62850, Time=1046227500        
  1030164   22:23:31.541572   RTP          1462   PT=DynamicRTP-Type-96, SSRC=0x5EFD5FAE, Seq=62851, Time=1046227500        
  1030165   22:23:31.541572   RTP           724   PT=DynamicRTP-Type-96, SSRC=0x5EFD5FAE, Seq=62852, Time=1046227500, Mark  
 
  1030175   22:23:31.568032   RTP          1462   PT=DynamicRTP-Type-96, SSRC=0x5EFD5FAE, Seq=62930, Time=1046247000        
  1030176   22:23:31.568067   RTP          1462   PT=DynamicRTP-Type-96, SSRC=0x5EFD5FAE, Seq=62931, Time=1046247000

A future improvement would be to lower the re-transmit delay. This can easily be tracked in OBS logs by looking for the request delay part of "resent" log lines. In my OBS logs with delay in the 500-750ms range, which is probably too slow to be of any use because the video will have already advanced past that frame by the time such a late re-transmit arrives.

I suspect the long re-transmit delay is mostly a problem for low motion p-frames (~3 packets per frame) and audio (~1 packet per interval). That is because the current code waits until a certain number of newer packets (currently sixteen) have been seen before considering a packet lost. This is to allow for out-of-order packet arrival. My initial thought is to do a time-based calculation of when a packet is missing/late. Ideally we should be able to send the NACK and get the re-transmit in under 100ms. That allows for a 30ms reorder window + transmit latency of 15ms + 16.6ms frame at 60fps + 30ms round trip on the re-transmit. However, the code for that calculation is a fair bit more complicated than "wait sixteen packets", so it's something to address in a future PR.

clone1018 · 2022-01-06T13:33:06Z

Sounds like a good approach to me. It is worth mentioning that smashbets has notoriously finicky internet when it comes to FTL, even having documented issues going back to the Mixer days. It's a good test, but will sometimes have significantly more packet loss than this case can cover.

I have no problem deploying the playout-delay as well, but a combined roll out seems easiest from a downtime perspective. Do you have any tests you want to run on production specifically for playout-delay? We could always deploy it to one server for now, and then a combined rollout later.

danstiner · 2022-01-07T03:30:51Z

Yeah I worked with SmashBets some on their connection issues, but it looks like they are more stable at a higher resolution/bitrate than they could do before so that's good news! Even if it isn't perfect.

Sounds good on deployment, deploying playout-delay to just one edge to start with would be good. I opened a quick draft PR to document the steps needed to deploy / have a place to discuss deployment stuff: Glimesh/ops#41

haydenmc

Approved w/ suggestions 😁

src/Rtp/ExtendedSequenceCounter.cpp

src/Rtp/SequenceTracker.cpp

src/Rtp/SequenceTracker.h

src/Rtp/ExtendedSequenceCounter.cpp

danstiner added 17 commits December 7, 2021 23:38

Allow skips of up to MAX_DROPOUT inclusive

004f575

Make constants public for testing

970edd5

Start maxSeq at seq like RFC example does

4d052fb

Update tests to be more clear

df424e6

Formatting

b434820

Introduce ExtendResult struct

b9a836b

Pass SsrcData instead of lock

e81d1c6

Passing reference implies caller holds the lock already and this avoids re-getting the same data reference over and over.

Put all nack behavior behind config setting

b5c380c

Extract better named captureVideoKeyframe method

35637f7

Extract sendQueuedNacks method

96b1772

Handle a few percent of packet loss

6d9de37

Improve log message

3282004

Improved logging

41f3c8e

Fix issue where sequence tracker locked up after prolonged run time

7425cc2

Downgrade nack logging to trace level

b5e1812

Allow for more jitter

a3b89e2

Log if we exceed nack capability

8691b7b

danstiner changed the title ~~Improved NACKs~~ 🔙 Improved NACKs Dec 9, 2021

haydenmc force-pushed the master branch from 91170b7 to 7d49f9e Compare December 9, 2021 23:21

danstiner added 11 commits December 9, 2021 18:28

Merge remote-tracking branch 'origin/master' into fix_extended_sequen…

30cf734

…ce_counter

Format

f8069f1

Basic sequence tracker tests

d811607

Log sequence counter resets

e670fd1

Test NACKs do not cause resets

e791a72

Cleanup

3fc6938

Move counter into sequence tracker

aba1744

Cleanup received buffer

ff03673

Keep mapping of nacked packets instead of re-extending

af92711

Fix packet metrics

a6da8e7

Failing test due holding on to nack'd packets for too long

dd053f8

danstiner added 13 commits December 16, 2021 00:50

Undo some auto formatting changes

2eb0031

Cleanup tests a bit

3fb8122

Cleanup formatting

15f4e4c

Fix formatting take 2

902ce3f

Merge remote-tracking branch 'origin/master' into fix_extended_sequen…

76d899f

…ce_counter

Improve log message

29e0bc4

Re-write sequence tracker

2b2aa86

Track nack count in SequenceTracker

07c78b3

Formatting

1db75a2

Improve handling of keyframes

78488f2

Formatting

7217d8d

Use .end() to correctly access last packet

44509ee

Fix IsComplete check

0d23274

danstiner marked this pull request as ready for review December 17, 2021 06:46

danstiner requested a review from haydenmc December 17, 2021 06:46

Improved names and documentation

fc6322d

danstiner linked an issue Dec 22, 2021 that may be closed by this pull request

Video improvements for poor network conditions #126

Open

danstiner removed a link to an issue Dec 22, 2021

Video improvements for poor network conditions #126

Open

danstiner mentioned this pull request Dec 22, 2021

Video improvements for poor network conditions #126

Open

danstiner mentioned this pull request Jan 7, 2022

Enable playout-delay extension + NACKs + janus version bumps Glimesh/ops#41

Closed

4 tasks

haydenmc approved these changes Jan 17, 2022

View reviewed changes

AdamTReineke reviewed Jan 17, 2022

View reviewed changes

src/Rtp/ExtendedSequenceCounter.cpp Outdated Show resolved Hide resolved

danstiner added 2 commits January 18, 2022 10:04

Inline ExtendedSequenceCounter::resync

75b6566

Formatting

0f6dfbb

danstiner merged commit d6d27cc into master Jan 18, 2022

danstiner deleted the fix_extended_sequence_counter branch January 18, 2022 23:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🔙 Improved NACKs #135

🔙 Improved NACKs #135

danstiner commented Dec 9, 2021 •

edited

Loading

clone1018 commented Jan 5, 2022

danstiner commented Jan 6, 2022 •

edited

Loading

clone1018 commented Jan 6, 2022

danstiner commented Jan 7, 2022

haydenmc left a comment

🔙 Improved NACKs #135

🔙 Improved NACKs #135

Conversation

danstiner commented Dec 9, 2021 • edited Loading

Checklist

Background

Goals

Status

Details

Potential Future Improvements

clone1018 commented Jan 5, 2022

danstiner commented Jan 6, 2022 • edited Loading

Test 1

Test 2

clone1018 commented Jan 6, 2022

danstiner commented Jan 7, 2022

haydenmc left a comment

Choose a reason for hiding this comment

danstiner commented Dec 9, 2021 •

edited

Loading

danstiner commented Jan 6, 2022 •

edited

Loading