-
Notifications
You must be signed in to change notification settings - Fork 181
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Research and create proposal for addressing the session limit problem when fast verification is used #2323
Comments
Findings From looking through the code and running some tests:
Proposed solution I'm leaning in favor of a quite simple solution with minimal modifications of the code. We need to extend logic in 5. above to send session tear down signal to O (via RPC). It will then invoke logic similar to timeout teardown of 4. on T session using supplied SessionId. Other notes
It's unknown if this is a verification or a 'real' session before B makes a choice based on transcoding results, so we'll need to
Interesting idea. How would we balance load in this case? If we pick Os for verification so their ratio to normal Os matches verification frequency, will these Os be statistically more 'lucky' in terms of getting steadier load and payouts? |
Great writeup, thanks @cyberj0g! Not having to worry about the HW context teardown makes a big difference and so I think that your proposal to avoid complexity and introducing any extra accounting here makes sense. I think that the extra monitoring / insights work we're looking to do should help us keep an eye on whether this issue grows again to the point (i.e an excessive amount of verification vs. orchestrator numbers happening) where we need to revisit and refine the solution. |
Regarding point 2 (HW context init) found a check for different pixel-format that triggers Expensive decoder init on pix fmt change. What we want to introduce is select idle session to use for new segment that matches same pixel-format. Furthermore i would expect similar selection to take place and match session input and output codecs with segment's input and output codecs, because we added support for vp8, vp9, hevc . Furthermore i would expect reused session to do expensive reinit of encoders&decoders when new segment has larger resolution than session used before. Didn't found such checks in the code yet. |
@AlexKordic, good point on pixel format, but are we sure that there's a need to re-initialize HW context in other cases, like different resolutions? Also, I think the network is quite uniform in terms of transcoding parameters currently, so these re-initializations shouldn't happen often, and we can skip implementing T session 'matching' for now. |
Thanks @AlexKordic for bringing up T session accounting again on the call. It looks like I didn't consider T load balancing logic in initial implementation proposal. That's one more layer of sessions, which are linked 1:1 to incoming streams and teared down after a timeout of 1 minute. Finally, lpms sessions (which hold HW context) are linked 1:1 with these T sessions. Which means HW context is initialized and then teared down after 1 minute of inactivity for each incoming stream. We will need to signal tear down for these sessions as well, so frequent verification won't cause GPU resources (memory) exhaustion - though, it doesn't seem likely with 1 minute timeouts. Separately, it's also a good idea to refactor the load balancer, so transcoding sessions are selected based on input/output parameters of the stream instead of stream id, this would allow more meaningful session re-use and less initializations. This is probably not a high-priority task. |
I think its very important to remove most of timeouts from the protocols. 1 minute is good choice for timeout. On other hand it amounts to 14700 frames of idle time that we need to compensate in our loadbalancing and capacity logic. |
Just for clarification - this session teardown design will in fact require HW context re-initialization because the HW context is destroyed in the latest design upon receiving the session teardown signal correct? And the discussion b/w @cyberj0g and @AlexKordic is highlighting that a separate task could be to explore keeping HW context around and re-using them if O receives another stream that has params that are compatible with the existing HW context? |
Correct. Taking into consideration timeouts in our protocol cost 10 to 100 times more than re-initialization. Difference is timeouts introduce idle time and re-initialization cause resource-usage&small-delay. |
Did a quick measure of decode performance. Decoding segment yields 850 frames per second. Decoding I frame with re-creating decoder after each frame yields 31 frames per second. According to this results we pay 32 millisecond penalty for recreating decoder while timeout is 60000 milliseconds (near 2000 times more). |
Closed by #2381 |
Fast verification is run at random intervals based on configurable frequency (i.e. 1 out of every 5 segments). As a result, whenever it is run, B will send a source segment to multiple Os, but afterwards will continue sending segments of the stream to just one of those Os. The other Os would have created sessions in order to handle the segment, but afterwords these sessions would be idle and waiting to be cleaned up after a timeout. If this happens consistently, it is possible for session usage on Os to increase and to diverge a lot from the actual resource utilization (i.e. hardware). Many Os would see spikes in sessions that do not reflect actual resource utilization and eventually many Os could hit session limits preventing them from accepting additional jobs.
Some starter ideas:
The text was updated successfully, but these errors were encountered: