Research and create proposal for addressing the session limit problem when fast verification is used #2323

yondonfu · 2022-03-16T17:45:53Z

Fast verification is run at random intervals based on configurable frequency (i.e. 1 out of every 5 segments). As a result, whenever it is run, B will send a source segment to multiple Os, but afterwards will continue sending segments of the stream to just one of those Os. The other Os would have created sessions in order to handle the segment, but afterwords these sessions would be idle and waiting to be cleaned up after a timeout. If this happens consistently, it is possible for session usage on Os to increase and to diverge a lot from the actual resource utilization (i.e. hardware). Many Os would see spikes in sessions that do not reflect actual resource utilization and eventually many Os could hit session limits preventing them from accepting additional jobs.

Some starter ideas:

Perhaps we could have B signal to O to tear down session instead of waiting for timeout
- Additional consideration - initialization sessions with Nvidia can be expensive so what is the implication of initializing + tearing down really often if we have Bs signal to Os to tear down sessions
Reserve Os specifically for fast verification so transcoding capacity doesn’t get impacted by verification capacity
- Or reserve certain sessions on O for one type of task vs. the other
- If we do this, prob want clear delineations/boundaries
  - Maybe just default to split O + T, some Ts dedicated to one and other Ts dedicated to the other
Have some way of differentiating these sessions so that they don’t count / count less towards to limit
Algorithm for how frequently verification runs takes into account number of Orchestrators and their session capacity (in addition to target frequency interval)

cyberj0g · 2022-03-18T13:09:51Z

Findings

From looking through the code and running some tests:

Transcoding session accounting is entirely on O side. Transcoder capacity is 10 by default, controlled by maxSessions parameter and is reported when T connects to O.
T initializes HW context (if acceleration is enabled) once transcoding of the first segment starts on Transcoder instance (instance meaning the object, not the node). Separate transcoder instance is associated with each HW accelerator. It will not ever destroy HW context, thus, we don't have to worry about initialization sessions with Nvidia can be expensive.
This piece in RemoteTranscoderManager.selectTranscoder() (runs on O) will prevent O from sending more sessions to the T, which is overbooked by idle sessions.
transcodeSegmentLoop() has actual T session teardown, but it happens only on timeout.
Semantically, there's unused session tear down logic on B here, but it doesn't send any signal to O, therefore, tear down doesn't propagate to transcoding sessions.

Proposed solution

I'm leaning in favor of a quite simple solution with minimal modifications of the code. We need to extend logic in 5. above to send session tear down signal to O (via RPC). It will then invoke logic similar to timeout teardown of 4. on T session using supplied SessionId.

Other notes

Have some way of differentiating these sessions so that they don’t count / count less towards to limit

It's unknown if this is a verification or a 'real' session before B makes a choice based on transcoding results, so we'll need to

signal to O that this session won't be used and shouldn't be accounted anymore
modify session accounting logic on O to respect 'unused' sessions
It sounds like more complex solution, than just tear down signal, without any immediate benefits.

Reserve Os specifically for fast verification so transcoding capacity doesn’t get impacted by verification capacity

Interesting idea. How would we balance load in this case? If we pick Os for verification so their ratio to normal Os matches verification frequency, will these Os be statistically more 'lucky' in terms of getting steadier load and payouts?

thomshutt · 2022-03-23T16:25:54Z

Great writeup, thanks @cyberj0g!

Not having to worry about the HW context teardown makes a big difference and so I think that your proposal to avoid complexity and introducing any extra accounting here makes sense.

I think that the extra monitoring / insights work we're looking to do should help us keep an eye on whether this issue grows again to the point (i.e an excessive amount of verification vs. orchestrator numbers happening) where we need to revisit and refine the solution.

AlexKordic · 2022-03-23T18:51:35Z

Regarding point 2 (HW context init) found a check for different pixel-format that triggers Expensive decoder init on pix fmt change.

What we want to introduce is select idle session to use for new segment that matches same pixel-format.

Furthermore i would expect similar selection to take place and match session input and output codecs with segment's input and output codecs, because we added support for vp8, vp9, hevc .

Furthermore i would expect reused session to do expensive reinit of encoders&decoders when new segment has larger resolution than session used before. Didn't found such checks in the code yet.

cyberj0g · 2022-03-24T08:17:09Z

@AlexKordic, good point on pixel format, but are we sure that there's a need to re-initialize HW context in other cases, like different resolutions? Also, I think the network is quite uniform in terms of transcoding parameters currently, so these re-initializations shouldn't happen often, and we can skip implementing T session 'matching' for now.

cyberj0g · 2022-03-28T15:50:42Z

Thanks @AlexKordic for bringing up T session accounting again on the call.

It looks like I didn't consider T load balancing logic in initial implementation proposal. That's one more layer of sessions, which are linked 1:1 to incoming streams and teared down after a timeout of 1 minute. Finally, lpms sessions (which hold HW context) are linked 1:1 with these T sessions. Which means HW context is initialized and then teared down after 1 minute of inactivity for each incoming stream. We will need to signal tear down for these sessions as well, so frequent verification won't cause GPU resources (memory) exhaustion - though, it doesn't seem likely with 1 minute timeouts.

Separately, it's also a good idea to refactor the load balancer, so transcoding sessions are selected based on input/output parameters of the stream instead of stream id, this would allow more meaningful session re-use and less initializations. This is probably not a high-priority task.

AlexKordic · 2022-03-28T17:13:42Z

I think its very important to remove most of timeouts from the protocols. 1 minute is good choice for timeout. On other hand it amounts to 14700 frames of idle time that we need to compensate in our loadbalancing and capacity logic.

yondonfu · 2022-04-28T21:02:15Z

Which means HW context is initialized and then teared down after 1 minute of inactivity for each incoming stream.

Just for clarification - this session teardown design will in fact require HW context re-initialization because the HW context is destroyed in the latest design upon receiving the session teardown signal correct?

And the discussion b/w @cyberj0g and @AlexKordic is highlighting that a separate task could be to explore keeping HW context around and re-using them if O receives another stream that has params that are compatible with the existing HW context?

AlexKordic · 2022-04-29T08:50:53Z

And the discussion b/w @cyberj0g and @AlexKordic is highlighting that a separate task could be to explore keeping HW context around and re-using them if O receives another stream that has params that are compatible with the existing HW context?

Correct. Taking into consideration timeouts in our protocol cost 10 to 100 times more than re-initialization. Difference is timeouts introduce idle time and re-initialization cause resource-usage&small-delay.

AlexKordic · 2022-04-29T16:58:33Z

timeouts in our protocol cost 10 to 100 times more than re-initialization.

Did a quick measure of decode performance. Decoding segment yields 850 frames per second. Decoding I frame with re-creating decoder after each frame yields 31 frames per second.

According to this results we pay 32 millisecond penalty for recreating decoder while timeout is 60000 milliseconds (near 2000 times more).

yondonfu · 2022-10-27T14:22:22Z

Closed by #2381

yondonfu assigned cyberj0g Mar 16, 2022

cyberj0g mentioned this issue Apr 27, 2022

Session tear down logic for fast verification #2381

Merged

5 tasks

yondonfu closed this as completed Oct 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Research and create proposal for addressing the session limit problem when fast verification is used #2323

Research and create proposal for addressing the session limit problem when fast verification is used #2323

yondonfu commented Mar 16, 2022

cyberj0g commented Mar 18, 2022

thomshutt commented Mar 23, 2022

AlexKordic commented Mar 23, 2022 •

edited

Loading

cyberj0g commented Mar 24, 2022 •

edited

Loading

cyberj0g commented Mar 28, 2022 •

edited

Loading

AlexKordic commented Mar 28, 2022

yondonfu commented Apr 28, 2022

AlexKordic commented Apr 29, 2022

AlexKordic commented Apr 29, 2022

yondonfu commented Oct 27, 2022

Research and create proposal for addressing the session limit problem when fast verification is used #2323

Research and create proposal for addressing the session limit problem when fast verification is used #2323

Comments

yondonfu commented Mar 16, 2022

cyberj0g commented Mar 18, 2022

thomshutt commented Mar 23, 2022

AlexKordic commented Mar 23, 2022 • edited Loading

cyberj0g commented Mar 24, 2022 • edited Loading

cyberj0g commented Mar 28, 2022 • edited Loading

AlexKordic commented Mar 28, 2022

yondonfu commented Apr 28, 2022

AlexKordic commented Apr 29, 2022

AlexKordic commented Apr 29, 2022

yondonfu commented Oct 27, 2022

AlexKordic commented Mar 23, 2022 •

edited

Loading

cyberj0g commented Mar 24, 2022 •

edited

Loading

cyberj0g commented Mar 28, 2022 •

edited

Loading