Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🔙 Improved NACKs #135

Merged
merged 59 commits into from
Jan 18, 2022
Merged

🔙 Improved NACKs #135

merged 59 commits into from
Jan 18, 2022

Conversation

danstiner
Copy link
Collaborator

@danstiner danstiner commented Dec 9, 2021

Checklist

  • Test on a production server (with and without NACKs enabled) (live on do-sfo3-ingest1.ksfo.live.glimesh.tv with NACKs enabled)
  • Doc comments
  • Cleanup code
  • Overnight stability test
  • Cleanup tests (remove randomized tests, they should be reviewed separately)
  • Test if MAX_OUTSTANDING_NACKS should be increased

Background

Shoutout to Hayden for writing the original NACK (negative acknowledgement of missing packets) feature. It mostly worked but had some issues with sequence number wrap-around (to be fair, I introduced a number of bugs when I touched the NACK code recently). If I remember correctly streams could break if there was significant packet loss (and would not recover when the packet loss went away). This left the feature unusable and it was turned off.

The NACK feature has always been difficult to test and is tied together with the keyframe packet collection. This PR is my second take at fixing the NACK feature by keeping the core ideas of the old code while starting to pull the logic into separate classes for clarity and test-ability.

Goals

  • Bring NACK code under unit tests
  • Mitigate up to 1% packet loss between streamer <-> ingest when packet arrival jitter is less than 20ms
  • Handle higher packet loss gracefully, stream should recover within a couple seconds (one keyframe interval) if packet loss drops down back under 1%

I think meeting those goals will yield a "good enough" solution, but if I had unlimited time I'd aim for something frame-aware that can do things like prioritize keyframe packet loss and estimate when a packet is "late" based on the frame timestamps, jitter estimates, etc. Those are all things the client does with it's "playout buffer" data structure, we may be able to take inspiration from that and adjust the ideas to be relevant on the server.

Status

Tested locally and as of 0d23274 it seems effective and stable. Overnight stability tests show it is capable of reducing 1% packet loss to about 0.001%, which is exactly (1% * 1%) as expected (re-transmitted packets can still be lost, the 1% loss rate in this local test also applies to them and this code does not try to re-transmit a second time). This should be the difference between bad frames every couple seconds to more like once a minute.

Details

There are roughly two parts to the tracking and NACK process:

First, when a packet arrives it is both fed to the sequence tracker instance for the source and relayed immediately to all client sessions. This follows the previous behavior of this code, no latency is added to packet relaying. The sequence tracker checks if the sequence number is a re-transmit due to a prior NACK. If so, that NACK is marked as received and it does nothing else. If this is a new packet, it is added to an internal buffer, in sequence order.

Second, when it's time to send NACKs, we iterate through the buffer of received packets and look for gaps in the sequence numbers. Any gaps are marked as "missing" packets and NACKs are sent for the missing packets. As mentioned above, when NACK'd packets are re-transmitted and received they are removed from the missing list. If we never get the re-transmit, eventually we time out the NACK and stop tracking the missing packet.

That's the jist of it, but there are some additional complexities:

  • There is an allowance for re-ordered RTP packets, just like the old logic we do not check the highest 16 sequence numbers in the packet buffer for gaps/missing packets
    • aka a sliding window based on sequence number
    • Note this allowance is not guaranteed to cover the last 16 packets received, it is sequence number based. e.g. if we receive packet with seq=16 and then the next packet has seq=32, then we will check for gaps in packets with seq<=16. Sequence numbers 17-31 will not be considered missing as they are still in the "re-order window", at least until yet newer packets arrive and slide the window forward.
  • There is a watermark (checkForMissingWatermark) tracking the highest sequence number that has been checked for gaps. This ensures we do not count the same gap as missing twice.
    • Watermark is not a great name for this, open to alternatives
  • There is a limit on the number of in-flight/outstanding NACKs to prevent excessive re-transmit bandwidth usage. If we are dropping packets to the ingest due to bandwidth limits, it will not be good to use even more bandwidth! (new in this PR I think)

Also, snuck in a change to the pending/current keyframe packet collection logic in FtlMediaConnection. It now checks for sequential packets without gaps and that the final packet has a marker bit before it considers a keyframe valid. This should prevent most cases of corrupt frames being used at stream thumbnails.

Closes #129

Potential Future Improvements

  • Better name for the SequenceTracker class?
  • Combine ExtendedSequenceCounter with SequenceTracker, their logic is substantially similar
  • Look at if extended sequence numbers are needed outside of SequenceTracker, maybe can limit use / get rid of that concept entirely
  • Bring back NACK bitmasks to reduce packets being sent / only send NACKs periodically in batches vs processing them on every packet received (observed testing on a wifi network shows packets are often lost in bursts of ~3-20 sequential packets)
    • Note that for simplicity this code runs the check for gaps/missing packets and sends any ready NACKs every time a new packet is received. This could lead to sliding the "re-order window" forward by one sequence number at a time, meaning even if a burst of sequence packets was lost, the window would only slide to reveal one of them at a time and thus singular NACKs would still be sent even with bitmask support. I don't have a great solution for this, probably the check for missing/sending NACKs should be on some kind of a timer instead of done per-packet.
    • TODO open issue for this
  • Group packets in buffer into frames by their RTP timestamps (allows e.g. prioritize NACKs of keyframes, if refactored enough this could replace the logic in FtlMediaConnection that captures keyframe packets)
  • Redefine timeouts to be based on RTP timestamps and expected arrival time (may be for a future PR as this would require extra logic for jitter/arrival time estimation)
  • Estimate expected arrival time for packets, to allow better time-based NACKing. Right now we have to wait until a newer packet arrives before we can calculate packets are missing and send NACKs. (more detail in below comment)

@danstiner danstiner changed the title Improved NACKs 🔙 Improved NACKs Dec 9, 2021
@danstiner danstiner marked this pull request as ready for review December 17, 2021 06:46
@danstiner danstiner requested a review from haydenmc December 17, 2021 06:46
@clone1018
Copy link
Member

Hey @danstiner, I deployed this to the infrequently used do-sfo3-ingest1.ksfo.live.glimesh.tv server and was able to test NACKs working successfully on a minimally unstable network. I'll keep it running on the server for a bit if you want to do any testing on it. Once you're comfortable we can get some more servers / users testing out this new improvement!

Thanks again for your hard work on this!

@danstiner
Copy link
Collaborator Author

danstiner commented Jan 6, 2022

Thanks for deploying @clone1018, good to see you and excited to see this out in the wild! Looks like SmashBets is using it right now, should ask them for feedback. I also tested a bit and can confirm it appears to be helping. It isn't as good in the real world as in my local testing, sometimes re-transmitted packets arrive too late and video/audio skips still can happen, but it is an improvement over the current behavior I think.

I don't have any issues with letting do-sfo3-ingest1 sit for a bit until Hayden reviews this. Then after merge the change can slowly be rolled out more widely. However, I'd love to also apply the playout-delay patch at some point, maybe as a combined rollout. Something like a 400ms delay should both enhance this change by giving more time for late arriving re-transmits and should also help with the issue of large keyframes causing stutters.

Test 1

This chart from chrome://webrtc-internals/ when watching https://glimesh.tv/SmashBets shows video packets being marked as lost by Chrome and then unmarked when the re-transmit arrives. The loss is not completely mitigated but this change is helping quite a bit:

image

A similar thing happens for audio but due to issues discussed below the audio packet re-transmits arrive too late and are discarded anyways most of the time. This is unfortunate, but due to a few factors this loss is generally not very noticeable.
image

Test 2

For the fun of it I also did some packet capturing. From the viewer side, the following is an example of where Seq 62850-62852 for SSRC 0x5EFD5FAE did not arrive initially:


    No.           UTC         Protocol   Length                                     Info                                    
 --------- ----------------- ---------- -------- -------------------------------------------------------------------------- 
  1029968   22:23:31.353413   RTP          1462   PT=DynamicRTP-Type-96, SSRC=0x5EFD5FAE, Seq=62848, Time=1046226000        
  1029969   22:23:31.353438   RTP          1437   PT=DynamicRTP-Type-96, SSRC=0x5EFD5FAE, Seq=62849, Time=1046226000, Mark  

  1029989   22:23:31.473773   RTP          1462   PT=DynamicRTP-Type-96, SSRC=0x5EFD5FAE, Seq=62853, Time=1046229000        
  1029990   22:23:31.473806   RTP          1462   PT=DynamicRTP-Type-96, SSRC=0x5EFD5FAE, Seq=62854, Time=1046229000        

Then from the streamer's side, OBS logs show we got the NACK from do-sfo3-ingest1 and it re-transmitted those three missing packets.
Side note: Client/server time are both in UTC but not quite in sync, that's pretty normal and does not hurt anything.

22:23:31.295: [3] [12677] resent sn 62850, request delay was 196 ms, was part of iframe? 0
22:23:31.295: [3] [12677] resent sn 62851, request delay was 197 ms, was part of iframe? 0
22:23:31.295: [3] [12677] resent sn 62852, request delay was 197 ms, was part of iframe? 0

Then a bit later on the viewer side we see the re-transmitted packets arrive just after Seq 62929, about 190ms after they originally would have arrived, which matches the OBS logs. In this case it is likely the re-transmit arrived in time and there were no lost frames given my jitterBufferDelay/jitterBufferEmittedCount_in_ms was >400ms. However the jitter buffer delay can vary quite a bit depending on the viewer's network conditions, so that won't always be true. One way to mitigate the variance would be to apply my playout-delay patch with a minimum playout delay of 400ms+. That should be enough in cases like this to ensure the re-transmitted packets arrive in time.

    No.           UTC         Protocol   Length                                     Info                                    
 --------- ----------------- ---------- -------- -------------------------------------------------------------------------- 
  1030150   22:23:31.539324   RTP          1462   PT=DynamicRTP-Type-96, SSRC=0x5EFD5FAE, Seq=62928, Time=1046245500        
  1030151   22:23:31.539324   RTP           539   PT=DynamicRTP-Type-96, SSRC=0x5EFD5FAE, Seq=62929, Time=1046245500, Mark  

  1030163   22:23:31.541572   RTP          1462   PT=DynamicRTP-Type-96, SSRC=0x5EFD5FAE, Seq=62850, Time=1046227500        
  1030164   22:23:31.541572   RTP          1462   PT=DynamicRTP-Type-96, SSRC=0x5EFD5FAE, Seq=62851, Time=1046227500        
  1030165   22:23:31.541572   RTP           724   PT=DynamicRTP-Type-96, SSRC=0x5EFD5FAE, Seq=62852, Time=1046227500, Mark  
 
  1030175   22:23:31.568032   RTP          1462   PT=DynamicRTP-Type-96, SSRC=0x5EFD5FAE, Seq=62930, Time=1046247000        
  1030176   22:23:31.568067   RTP          1462   PT=DynamicRTP-Type-96, SSRC=0x5EFD5FAE, Seq=62931, Time=1046247000        

A future improvement would be to lower the re-transmit delay. This can easily be tracked in OBS logs by looking for the request delay part of "resent" log lines. In my OBS logs with delay in the 500-750ms range, which is probably too slow to be of any use because the video will have already advanced past that frame by the time such a late re-transmit arrives.

I suspect the long re-transmit delay is mostly a problem for low motion p-frames (~3 packets per frame) and audio (~1 packet per interval). That is because the current code waits until a certain number of newer packets (currently sixteen) have been seen before considering a packet lost. This is to allow for out-of-order packet arrival. My initial thought is to do a time-based calculation of when a packet is missing/late. Ideally we should be able to send the NACK and get the re-transmit in under 100ms. That allows for a 30ms reorder window + transmit latency of 15ms + 16.6ms frame at 60fps + 30ms round trip on the re-transmit. However, the code for that calculation is a fair bit more complicated than "wait sixteen packets", so it's something to address in a future PR.

@clone1018
Copy link
Member

Sounds like a good approach to me. It is worth mentioning that smashbets has notoriously finicky internet when it comes to FTL, even having documented issues going back to the Mixer days. It's a good test, but will sometimes have significantly more packet loss than this case can cover.

I have no problem deploying the playout-delay as well, but a combined roll out seems easiest from a downtime perspective. Do you have any tests you want to run on production specifically for playout-delay? We could always deploy it to one server for now, and then a combined rollout later.

@danstiner
Copy link
Collaborator Author

Yeah I worked with SmashBets some on their connection issues, but it looks like they are more stable at a higher resolution/bitrate than they could do before so that's good news! Even if it isn't perfect.

Sounds good on deployment, deploying playout-delay to just one edge to start with would be good. I opened a quick draft PR to document the steps needed to deploy / have a place to discuss deployment stuff: Glimesh/ops#41

Copy link
Member

@haydenmc haydenmc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved w/ suggestions 😁

src/Rtp/ExtendedSequenceCounter.cpp Show resolved Hide resolved
src/Rtp/SequenceTracker.cpp Show resolved Hide resolved
src/Rtp/SequenceTracker.cpp Outdated Show resolved Hide resolved
src/Rtp/SequenceTracker.cpp Outdated Show resolved Hide resolved
src/Rtp/SequenceTracker.cpp Outdated Show resolved Hide resolved
src/Rtp/SequenceTracker.h Outdated Show resolved Hide resolved
@danstiner danstiner merged commit d6d27cc into master Jan 18, 2022
@danstiner danstiner deleted the fix_extended_sequence_counter branch January 18, 2022 23:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Test packet handling, especially keyframes and lost packets
4 participants