-
Notifications
You must be signed in to change notification settings - Fork 461
MFE Overview
-
GPU supports only single context execution, limitation is - in single context, at any given time, workload (encoder kernel threads) from only one frame can be executed by EU/VME.
-
In current AVC/HEVC encoder design, MBENC kernel thread for each MB/LCU is launched following 26 degree wavefront dependency pattern because of the dependency on left, above and above right MBs. Left/above/above right MBs are used to calculate initial search starting point and neighboring INTRA prediction mode. This dependency is defined by H.264/H.265 ITU recommendation specification. Due to this dependency, there is inherent limitation in amount of parallel threads that can be dispatched for the computation. The bigger the frame for processing, more parallel threads (going from top of frame to center diagonal); and smaller frames have fewer parallel threads. Due to this, the utilization of the resource (VME) is not efficient for smaller frames (<1080).
- thread dispatching - the overhead for thread dispatching is pretty high comparing to one macroblock
In this section, we will look into the relations between the frame size for encode and the VME utilization. As discussed previously, MB dependencies limit the number of parallel threads (wavefront dependency pattern), and the number of parallel threads is related to its frame size. Now, we will tie this understanding to VME utilization; consider the case below:
-
Assuming 1080p frame and each MB takes the 1 unit of time to process.
-
Assuming we are not limited by resources, meaning we have enough threads and VMEs.
-
EU/VME utilization is very low
-
Peak # of parallel kernel thread is about 60.
-
Takes about 256 unit of time to finish the whole frame.
-
Takes about 120 unit of time to ramp up to peak state.
-
Takes about 120 unit of time to ramp down from peak state.
-
- Result of algorithm and thread dispatching limitations - we can't utilize VME for 100% at high end GPU SKUs, and even at low-end SKUs for low resolutions.
- On a SKL GT4 system, which has 3 slices and 9 VMEs, when encoding 1080p in "best speed" mode, we can only utilize around 30% of the VME available (that is equivalent to 1 slice worth of VMEs!). When we have a single stream being encoded, as a result, disabling 2 slices and having only 1 slice on starts to become beneficial - (1) to reduce thread scheduling, (2) and power consumption by concentrating the execution on one slice instead of spreading across 3.
In most transcode use-cases, we are bound by the EUs (due to MBEnc kernel timings), and combine that with poor scalability with VMEs and single context limitation, transcode workloads esp ABR or multi transcode use-cases are impacted. This impact is exacerbated on low resolutions. Consider the pipeline below with 4 parallel transcodes:
Frames from different streams/sessions are combined together as single batch buffer of the ENC kernel (MF-ENC). This helps to increase number of independent wave fronts executed in parallel, creating more parallelism for ENC operation and thus increasing VME utilization. As a result the EU array can process several frames within the same or near to the same execution time as a single frame.
VME utilization increased, within the same or near ENC execution time as single frame.
As multiple frames execute at a time of single frame, system utilization improved, through improved EU Array time utilization.
- Media SDK for Linux
- Media SDK for Windows
- FFmpeg QSV
- GStreamer MSDK
- Docker
- Usage guides
- Building Media SDK
- Running Media SDK CI tests
- Additional information
- Multi-Frame Encode