-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🔙 Improved NACKs #135
🔙 Improved NACKs #135
Conversation
Passing reference implies caller holds the lock already and this avoids re-getting the same data reference over and over.
Hey @danstiner, I deployed this to the infrequently used Thanks again for your hard work on this! |
Thanks for deploying @clone1018, good to see you and excited to see this out in the wild! Looks like SmashBets is using it right now, should ask them for feedback. I also tested a bit and can confirm it appears to be helping. It isn't as good in the real world as in my local testing, sometimes re-transmitted packets arrive too late and video/audio skips still can happen, but it is an improvement over the current behavior I think. I don't have any issues with letting Test 1This chart from A similar thing happens for audio but due to issues discussed below the audio packet re-transmits arrive too late and are discarded anyways most of the time. This is unfortunate, but due to a few factors this loss is generally not very noticeable. Test 2For the fun of it I also did some packet capturing. From the viewer side, the following is an example of where Seq 62850-62852 for SSRC 0x5EFD5FAE did not arrive initially:
Then from the streamer's side, OBS logs show we got the NACK from
Then a bit later on the viewer side we see the re-transmitted packets arrive just after Seq 62929, about 190ms after they originally would have arrived, which matches the OBS logs. In this case it is likely the re-transmit arrived in time and there were no lost frames given my
A future improvement would be to lower the re-transmit delay. This can easily be tracked in OBS logs by looking for the request delay part of "resent" log lines. In my OBS logs with delay in the 500-750ms range, which is probably too slow to be of any use because the video will have already advanced past that frame by the time such a late re-transmit arrives. I suspect the long re-transmit delay is mostly a problem for low motion p-frames (~3 packets per frame) and audio (~1 packet per interval). That is because the current code waits until a certain number of newer packets (currently sixteen) have been seen before considering a packet lost. This is to allow for out-of-order packet arrival. My initial thought is to do a time-based calculation of when a packet is missing/late. Ideally we should be able to send the NACK and get the re-transmit in under 100ms. That allows for a 30ms reorder window + transmit latency of 15ms + 16.6ms frame at 60fps + 30ms round trip on the re-transmit. However, the code for that calculation is a fair bit more complicated than "wait sixteen packets", so it's something to address in a future PR. |
Sounds like a good approach to me. It is worth mentioning that smashbets has notoriously finicky internet when it comes to FTL, even having documented issues going back to the Mixer days. It's a good test, but will sometimes have significantly more packet loss than this case can cover. I have no problem deploying the playout-delay as well, but a combined roll out seems easiest from a downtime perspective. Do you have any tests you want to run on production specifically for playout-delay? We could always deploy it to one server for now, and then a combined rollout later. |
Yeah I worked with SmashBets some on their connection issues, but it looks like they are more stable at a higher resolution/bitrate than they could do before so that's good news! Even if it isn't perfect. Sounds good on deployment, deploying playout-delay to just one edge to start with would be good. I opened a quick draft PR to document the steps needed to deploy / have a place to discuss deployment stuff: Glimesh/ops#41 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved w/ suggestions 😁
Checklist
do-sfo3-ingest1.ksfo.live.glimesh.tv
with NACKs enabled)Background
Shoutout to Hayden for writing the original NACK (negative acknowledgement of missing packets) feature. It mostly worked but had some issues with sequence number wrap-around (to be fair, I introduced a number of bugs when I touched the NACK code recently). If I remember correctly streams could break if there was significant packet loss (and would not recover when the packet loss went away). This left the feature unusable and it was turned off.
The NACK feature has always been difficult to test and is tied together with the keyframe packet collection. This PR is my second take at fixing the NACK feature by keeping the core ideas of the old code while starting to pull the logic into separate classes for clarity and test-ability.
Goals
I think meeting those goals will yield a "good enough" solution, but if I had unlimited time I'd aim for something frame-aware that can do things like prioritize keyframe packet loss and estimate when a packet is "late" based on the frame timestamps, jitter estimates, etc. Those are all things the client does with it's "playout buffer" data structure, we may be able to take inspiration from that and adjust the ideas to be relevant on the server.
Status
Tested locally and as of 0d23274 it seems effective and stable. Overnight stability tests show it is capable of reducing 1% packet loss to about 0.001%, which is exactly (1% * 1%) as expected (re-transmitted packets can still be lost, the 1% loss rate in this local test also applies to them and this code does not try to re-transmit a second time). This should be the difference between bad frames every couple seconds to more like once a minute.
Details
There are roughly two parts to the tracking and NACK process:
First, when a packet arrives it is both fed to the sequence tracker instance for the source and relayed immediately to all client sessions. This follows the previous behavior of this code, no latency is added to packet relaying. The sequence tracker checks if the sequence number is a re-transmit due to a prior NACK. If so, that NACK is marked as received and it does nothing else. If this is a new packet, it is added to an internal buffer, in sequence order.
Second, when it's time to send NACKs, we iterate through the buffer of received packets and look for gaps in the sequence numbers. Any gaps are marked as "missing" packets and NACKs are sent for the missing packets. As mentioned above, when NACK'd packets are re-transmitted and received they are removed from the missing list. If we never get the re-transmit, eventually we time out the NACK and stop tracking the missing packet.
That's the jist of it, but there are some additional complexities:
Also, snuck in a change to the pending/current keyframe packet collection logic in FtlMediaConnection. It now checks for sequential packets without gaps and that the final packet has a marker bit before it considers a keyframe valid. This should prevent most cases of corrupt frames being used at stream thumbnails.
Closes #129
Potential Future Improvements