问题
Problem statement:
Intel hardware MFT is not honoring the GOP setting, resulting in more bandwidth consumption in realtime applications. The same code works fine on Nvidia hardware MFT.
Background:
I'm trying to encode NV12 samples captured through DesktopDuplication APIs to video stream using MediaFoundation H264 hardware encoder on Windows10 machine, stream and render the same in real-time over LAN.
Initially, I was facing too much buffering at the encoder as the encoder was buffering up to 25 frames (GOP size) before delivering an output sample. After some research, I figured out that setting the CODECAPI_AVLowLatencyMode would reduce the latency with the cost of a bit of quality and bandwidth.
Setting CODECAPI_AVLowLatencyMode property kind of improved the performance a bit, but not up to the real-time requirements. It looks like now the encoder still buffers up to 15 frames at least before producing the samples (Introducing around 2 seconds delay in the output). And this behavior is noticeable only when a low frame rate is configured. At 60FPS the output is almost realtime with no visually noticeable delay.
In fact, the buffering is noticeable to the human eye only when the frame rate is set below 30FPS. And, the delay increases inversely proportional to the FPS configuration, at 25FPS the delay is in a few hundred milliseconds and goes up to 3 seconds when FPS is configured to 10 (Constant rate). I guess, setting FPS more than 30(Say 60FPS) actually causes the encoder buffer to overflow quickly enough to produce samples with unnoticeable delay.
Lately, I tried CODECAPI_AVEncCommonRealTime property (https://docs.microsoft.com/en-us/windows/win32/directshow/avenccommonrealtime-property) as well to check if it improves the performance when lowering the input frame rate to avoid bandwidth consumption, but that call fails with "parameter incorrect" error.
My Experiments:
To maintain a constant frame rate, and also to force the encoder to produce real-time outputs, I'm feeding the same sample (previously saved sample) to the encoder at a constant rate of 30FPS/60FPS. I'm doing this by capturing only at most 10FPS (or at any required FPS) and faking 30/60FPS by feeding the same sample thrice or exactly at a rate based on EMULATED_FRAME_RATE/ACTUAL_FRAME_RATE ratio (Ex: 30/10, 60/15, 60/20) to fill the gap exactly at constant intervals. For example, when no change happens for 10 seconds, I would have fed the encoder with the same sample 30*10 times (30FPS). I learned about this approach from some opensource Github projects, also from chromium's experimental code samples, I was also informed (Primarily on SO, and also on other forums) that this is the only way to push the encoder for real-time output, and there is no way around it.
The above-mentioned approach produces near-realtime output but consumes more data than I have expected even though I'm feeding only the previously saved sample to the encoder.
The output bitrate seems to be consistently staying between 350KBps to 500KBps on Intel MFT, and varies between 80KBps to 400KBps on NVidia MFT (with 30FPS and 500KB bitrate configuration), no matter whether the screen content changes at 30FPS or 0FPS(idle). NVidia hardware encoder seems to be somewhat better in this case.
In fact, during screen idle time the encoder was producing way more data per second than the above-mentioned rate. I have been able to cut the data consumption on NVidia devices through setting a larger GOP size (Current GOP size configured is 16K). But still, the screen idle-time data consumption stays around 300KBps on Intel graphics 620 hardware, and 50KBps to 80KBps on NVidia GTX 1070 (config: 500KB bit rate and 30FPS) which is unacceptable. I guess, Intel hardware MFT is not honoring the GOP setting at all or the improvement is unnoticeable.
I have also been able to bring the idle-time data consumption down to ~130KBps and ~40KBps on Intel and Nvidia hardware respectively by setting very low bitrates but this is still unacceptable, this also deteriorates the video quality.
Is there a way to configure the encoder to produce less than ~10KBps output when no changes happened between input samples? I have actually aimed for ~0KB output when no change happens but ~10KBps is somewhat acceptable.
Update:
I'm able to bring down the idle time data consumption on NVidia MFT by tweaking some parameters, to less than ~20KBps with 400KB bitrate configuration, and below ~10KBps with 100KB bitrate configuration. This is convincing. But the same code with the same encoder configurations produces 20 to 40 times more data on Intel machines. Intel (Intel graphics 620) is surely not honoring the GOP setting. I have even tried varying the GOP between 256 to INT_MAX nothing seems to be changing on Intel hardware MFT's output.
Update 2:
After playing around with the encoder properties (I have only configured CODECAPI_AVEncCommonRateControlMode with eAVEncCommonRateControlMode_UnconstrainedVBR instead of eAVEncCommonRateControlMode_CBR), now I could see that the Intel MFT produces 3KBps data during screen idle time but only for the first few seconds (probably around 3 to 8 seconds), then it goes back to the same story. I guess after a few seconds, the encoder is losing the reference to the keyframe to which it compares the samples and it seems to be not recovering after that point. The behavior is the same no matter whether the GOP is 16/128/256/512/1024 or INT_MAX.
Encoder configurations:
Reference: http://alax.info/blog/1586
const int EMULATED_FRAME_RATE = 30;//
const int TARGET_FPS = 10;
const int FPS_DENOMINATOR = 1;
const unsigned long long time_between_capture = 1000 / TARGET_FPS;
const unsigned long long nEmulatedWaitTime = 1000 / EMULATED_FRAME_RATE;
const unsigned long long TARGET_AVERAGE_BIT_RATE = 4000000; // Adjusting this affects the quality of the H264 bit stream.
const LONGLONG VIDEO_FRAME_DURATION = 10ll * 1000ll * 1000ll / ((long long)EMULATED_FRAME_RATE); // frame duration in 100ns units
const UINT32 KEY_FRAME_SPACING = 16384;
const UINT32 GOP_SIZE = 16384;
const UINT32 BPICTURECOUNT = 2;
VARIANT var = { 0 };
//no failure on both Nvidia & Intel, but Intel seems to be not behaving as expected
var.vt = VT_UI4;
var.lVal = GOP_SIZE;
CHECK_HR(mpCodecAPI->SetValue(&CODECAPI_AVEncMPVGOPSize, &var), "Failed to set GOP size");
var.vt = VT_BOOL;
var.ulVal = VARIANT_TRUE;
// fails with "parameter incorrect" error.
CHECK_HR(mpCodecAPI->SetValue(&CODECAPI_AVEncCommonRealTime, &var), "Failed to set realtime mode");
var = { 0 };
var.vt = VT_BOOL;
var.ulVal = VARIANT_TRUE;
CHECK_HR(mpCodecAPI->SetValue(&CODECAPI_AVLowLatencyMode, &var), "Failed to set low latency mode");
var = { 0 };
var.vt = VT_BOOL;
var.ulVal = VARIANT_TRUE;
CHECK_HR(mpCodecAPI->SetValue(&CODECAPI_AVEncCommonLowLatency, &var), "Failed to set low latency mode");
var = { 0 };
var.vt = VT_UI4;
var.lVal = 2; // setting B-picture count to 0 to avoid latency and buffering at both encoder and decoder
CHECK_HR(mpCodecAPI->SetValue(&CODECAPI_AVEncMPVDefaultBPictureCount, &var), "Failed to set B-Picture count");
var = { 0 };
var.vt = VT_UI4;
var.lVal = 100; //0 - 100 (100 for best quality, 0 for low delay)
CHECK_HR(mpCodecAPI->SetValue(&CODECAPI_AVEncCommonQualityVsSpeed, &var), "Failed to set Quality-speed ratio");
var = { 0 };
var.vt = VT_UI4;
var.lVal = 20;
CHECK_HR(mpCodecAPI->SetValue(&CODECAPI_AVEncCommonQuality, &var), "Failed to set picture quality");
var = { 0 };
var.vt = VT_UI4;
var.lVal = eAVEncCommonRateControlMode_CBR; // This too fails on some hardware
CHECK_HR(mpCodecAPI->SetValue(&CODECAPI_AVEncCommonRateControlMode, &var), "Failed to set rate control");
var = { 0 };
var.vt = VT_UI4;
var.lVal = 4000000;
CHECK_HR(mpCodecAPI->SetValue(&CODECAPI_AVEncCommonMeanBitRate, &var), "Failed to set Adaptive mode");
var = { 0 };
var.vt = VT_UI4;
var.lVal = eAVEncAdaptiveMode_FrameRate;
CHECK_HR(mpCodecAPI->SetValue(&CODECAPI_AVEncAdaptiveMode, &var), "Failed to set Adaptive mode");
I tried retrieving the supported parameter range for GOP size with the following code, but it just returns E_NOTIMPL error.
VARIANT ValueMin = { 0 };
VARIANT ValueMax = { 0 };
VARIANT SteppingDelt = { 0 };
HRESULT hr = S_OK;
if (!mpCodecAPI) {
CHECK_HR(_pTransform->QueryInterface(IID_PPV_ARGS(&mpCodecAPI)), "Failed to get codec api");
}
hr = mpCodecAPI->GetParameterRange(&CODECAPI_AVEncMPVGOPSize, &ValueMin, &ValueMax, &SteppingDelt);
CHECK_HR(hr, "Failed to get GOP range");
VariantClear(&ValueMin);
VariantClear(&ValueMax);
VariantClear(&SteppingDelt);
Am I missing something? Are there any other properties I could experiment with to obtain real-time performance while consuming as little bandwidth as possible when there is no screen content change?
回答1:
Some miracle has happened. While also playing around with encoder configurations, I accidentally changed my primary monitor to a different one on my machine, now the problem is gone. Switching back to the previously selected primary monitor leads to the same problem. I suspect the d3ddevice to be the trouble maker. I'm not sure why this happens only on that device/monitor yet, have to experiment some more.
Note: I'm not marking this as an answer due to the fact that I'm yet to find out the reason for the problem happening only on that monitor/d3ddevice. Just posting this as a reference for other people who may come across a similar situation. I will update the answer once I'm able to find the reason for the strange behavior on that particular d3d11device instance.
This is how I'm creating the d3ddevice, and reusing the same for desktop duplication image capturer, video processor for color conversion and also for the hardware transform through MFT_MESSAGE_SET_D3D_MANAGER property.
Options:
const D3D_DRIVER_TYPE m_DriverTypes[] = {
//Hardware based Rasterizer
D3D_DRIVER_TYPE_HARDWARE,
//High performance Software Rasterizer
D3D_DRIVER_TYPE_WARP,
//Software Rasterizer (Low performance but more accurate)
D3D_DRIVER_TYPE_REFERENCE,
//TODO: Explore other driver types
};
const D3D_FEATURE_LEVEL m_FeatureLevel[] = {
D3D_FEATURE_LEVEL_11_1,
D3D_FEATURE_LEVEL_11_0,
D3D_FEATURE_LEVEL_10_1,
D3D_FEATURE_LEVEL_10_0,
D3D_FEATURE_LEVEL_9_3,
D3D_FEATURE_LEVEL_9_2,
D3D_FEATURE_LEVEL_9_1
//TODO: Explore other features levels as well
};
int m_DriversCount = ARRAYSIZE(m_DriverTypes);
int m_FeatureLevelsCount = ARRAYSIZE(m_FeatureLevel);
Create d3ddevice:
DWORD errorCode = ERROR_SUCCESS;
if (m_FnD3D11CreateDevice == NULL)
{
errorCode = loadD3D11FunctionsFromDll();
}
if (m_Id3d11Device)
{
m_Id3d11Device = NULL;
m_Id3d11DeviceContext = NULL;
}
UINT uiD3D11CreateFlag = (0 * D3D11_CREATE_DEVICE_SINGLETHREADED) | D3D11_CREATE_DEVICE_VIDEO_SUPPORT;
if (errorCode == ERROR_SUCCESS)
{
if (m_FnD3D11CreateDevice) {
for (UINT driverTypeIndex = 0; driverTypeIndex < m_DriversCount; ++driverTypeIndex)
{
m_LastErrorCode = D3D11CreateDevice(nullptr, m_DriverTypes[driverTypeIndex], nullptr, uiD3D11CreateFlag,
m_FeatureLevel, m_FeatureLevelsCount, D3D11_SDK_VERSION, &m_Id3d11Device, &m_SelectedFeatureLevel, &m_Id3d11DeviceContext);
if (SUCCEEDED(m_LastErrorCode))
{
break;
}
}
}
}
来源:https://stackoverflow.com/questions/59051443/intel-h264-hardware-mft-poor-performance-compared-to-nvidia