Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Research and create proposal for addressing the session limit problem when fast verification is used #2323

Closed
yondonfu opened this issue Mar 16, 2022 · 10 comments
Assignees

Comments

@yondonfu
Copy link
Member

Fast verification is run at random intervals based on configurable frequency (i.e. 1 out of every 5 segments). As a result, whenever it is run, B will send a source segment to multiple Os, but afterwards will continue sending segments of the stream to just one of those Os. The other Os would have created sessions in order to handle the segment, but afterwords these sessions would be idle and waiting to be cleaned up after a timeout. If this happens consistently, it is possible for session usage on Os to increase and to diverge a lot from the actual resource utilization (i.e. hardware). Many Os would see spikes in sessions that do not reflect actual resource utilization and eventually many Os could hit session limits preventing them from accepting additional jobs.

Some starter ideas:

  • Perhaps we could have B signal to O to tear down session instead of waiting for timeout
    • Additional consideration - initialization sessions with Nvidia can be expensive so what is the implication of initializing + tearing down really often if we have Bs signal to Os to tear down sessions
  • Reserve Os specifically for fast verification so transcoding capacity doesn’t get impacted by verification capacity
    • Or reserve certain sessions on O for one type of task vs. the other
    • If we do this, prob want clear delineations/boundaries
      • Maybe just default to split O + T, some Ts dedicated to one and other Ts dedicated to the other
  • Have some way of differentiating these sessions so that they don’t count / count less towards to limit
  • Algorithm for how frequently verification runs takes into account number of Orchestrators and their session capacity (in addition to target frequency interval)
@cyberj0g
Copy link
Contributor

Findings

From looking through the code and running some tests:

  1. Transcoding session accounting is entirely on O side. Transcoder capacity is 10 by default, controlled by maxSessions parameter and is reported when T connects to O.
  2. T initializes HW context (if acceleration is enabled) once transcoding of the first segment starts on Transcoder instance (instance meaning the object, not the node). Separate transcoder instance is associated with each HW accelerator. It will not ever destroy HW context, thus, we don't have to worry about initialization sessions with Nvidia can be expensive.
  3. This piece in RemoteTranscoderManager.selectTranscoder() (runs on O) will prevent O from sending more sessions to the T, which is overbooked by idle sessions.
  4. transcodeSegmentLoop() has actual T session teardown, but it happens only on timeout.
  5. Semantically, there's unused session tear down logic on B here, but it doesn't send any signal to O, therefore, tear down doesn't propagate to transcoding sessions.

Proposed solution

I'm leaning in favor of a quite simple solution with minimal modifications of the code. We need to extend logic in 5. above to send session tear down signal to O (via RPC). It will then invoke logic similar to timeout teardown of 4. on T session using supplied SessionId.

Other notes

Have some way of differentiating these sessions so that they don’t count / count less towards to limit

It's unknown if this is a verification or a 'real' session before B makes a choice based on transcoding results, so we'll need to

  • signal to O that this session won't be used and shouldn't be accounted anymore
  • modify session accounting logic on O to respect 'unused' sessions
    It sounds like more complex solution, than just tear down signal, without any immediate benefits.

Reserve Os specifically for fast verification so transcoding capacity doesn’t get impacted by verification capacity

Interesting idea. How would we balance load in this case? If we pick Os for verification so their ratio to normal Os matches verification frequency, will these Os be statistically more 'lucky' in terms of getting steadier load and payouts?

@thomshutt
Copy link
Contributor

Great writeup, thanks @cyberj0g!

Not having to worry about the HW context teardown makes a big difference and so I think that your proposal to avoid complexity and introducing any extra accounting here makes sense.

I think that the extra monitoring / insights work we're looking to do should help us keep an eye on whether this issue grows again to the point (i.e an excessive amount of verification vs. orchestrator numbers happening) where we need to revisit and refine the solution.

@AlexKordic
Copy link
Contributor

AlexKordic commented Mar 23, 2022

Regarding point 2 (HW context init) found a check for different pixel-format that triggers Expensive decoder init on pix fmt change.

What we want to introduce is select idle session to use for new segment that matches same pixel-format.

Furthermore i would expect similar selection to take place and match session input and output codecs with segment's input and output codecs, because we added support for vp8, vp9, hevc .

Furthermore i would expect reused session to do expensive reinit of encoders&decoders when new segment has larger resolution than session used before. Didn't found such checks in the code yet.

@cyberj0g
Copy link
Contributor

cyberj0g commented Mar 24, 2022

@AlexKordic, good point on pixel format, but are we sure that there's a need to re-initialize HW context in other cases, like different resolutions? Also, I think the network is quite uniform in terms of transcoding parameters currently, so these re-initializations shouldn't happen often, and we can skip implementing T session 'matching' for now.

@cyberj0g
Copy link
Contributor

cyberj0g commented Mar 28, 2022

Thanks @AlexKordic for bringing up T session accounting again on the call.

It looks like I didn't consider T load balancing logic in initial implementation proposal. That's one more layer of sessions, which are linked 1:1 to incoming streams and teared down after a timeout of 1 minute. Finally, lpms sessions (which hold HW context) are linked 1:1 with these T sessions. Which means HW context is initialized and then teared down after 1 minute of inactivity for each incoming stream. We will need to signal tear down for these sessions as well, so frequent verification won't cause GPU resources (memory) exhaustion - though, it doesn't seem likely with 1 minute timeouts.

Separately, it's also a good idea to refactor the load balancer, so transcoding sessions are selected based on input/output parameters of the stream instead of stream id, this would allow more meaningful session re-use and less initializations. This is probably not a high-priority task.

@AlexKordic
Copy link
Contributor

I think its very important to remove most of timeouts from the protocols. 1 minute is good choice for timeout. On other hand it amounts to 14700 frames of idle time that we need to compensate in our loadbalancing and capacity logic.

@yondonfu
Copy link
Member Author

Which means HW context is initialized and then teared down after 1 minute of inactivity for each incoming stream.

Just for clarification - this session teardown design will in fact require HW context re-initialization because the HW context is destroyed in the latest design upon receiving the session teardown signal correct?

And the discussion b/w @cyberj0g and @AlexKordic is highlighting that a separate task could be to explore keeping HW context around and re-using them if O receives another stream that has params that are compatible with the existing HW context?

@AlexKordic
Copy link
Contributor

And the discussion b/w @cyberj0g and @AlexKordic is highlighting that a separate task could be to explore keeping HW context around and re-using them if O receives another stream that has params that are compatible with the existing HW context?

Correct. Taking into consideration timeouts in our protocol cost 10 to 100 times more than re-initialization. Difference is timeouts introduce idle time and re-initialization cause resource-usage&small-delay.

@AlexKordic
Copy link
Contributor

timeouts in our protocol cost 10 to 100 times more than re-initialization.

Did a quick measure of decode performance. Decoding segment yields 850 frames per second. Decoding I frame with re-creating decoder after each frame yields 31 frames per second.

According to this results we pay 32 millisecond penalty for recreating decoder while timeout is 60000 milliseconds (near 2000 times more).

@yondonfu
Copy link
Member Author

Closed by #2381

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants