Multithreaded video rendering in OpenGL on Mac shows severe flickering issues

I have a video player application, and use multiple threads in order to keep the user interaction still smooth.

The thread that decodes the video originally just wrote the resulting frames as BGRA into a RAM buffer, which got uploaded to VRAM by glTexSubImage2D which worked fine enough for normal videos, but -as expected- got slow for HD (esp 1920x1080).

In order to improve that I implemented a different kind of pool class which has its own GL context (NSOpenGLContext as I am on Mac), which shares the resources with the main context. Furthermore I changed the code so that it uses

glTextureRangeAPPLE( GL_TEXTURE_RECTANGLE_ARB, m_mappedMemSize, m_mappedMem );

and

glTexParameteri(GL_TEXTURE_RECTANGLE_ARB, GL_TEXTURE_STORAGE_HINT_APPLE, GL_STORAGE_SHARED_APPLE);

for the textures that I use, in order to improve the performance of uploading to VRAM. Instead of uploading BGRA textures (which weigh in at about 8MB per frame for 1920x1080) I upload three individual textures for Y, U and V (each being GL_LUMINANCE, GL_UNSIGNED_BYTE, and the Y texture of original size, and the U and V at half the dimensions), thereby reducing the size being uploaded to about 3 MB, which already showed some improvement.

I created a pool of those YUV textures (depending on the size of the video it typically ranges between 3 and 8 surfaces (times three as it is Y, U and V components) - each of the textures mapped into its own area of the above m_mappedMem.

When I receive a newly decoded video frame I find a set of free YUV surfaces and update the three components each with this code:

glActiveTexture(m_textureUnits[texUnit]);
glEnable(GL_TEXTURE_RECTANGLE_ARB);

glBindTexture(GL_TEXTURE_RECTANGLE_ARB, planeInfo->m_texHandle);

glTexParameteri(GL_TEXTURE_RECTANGLE_ARB, GL_TEXTURE_STORAGE_HINT_APPLE, GL_STORAGE_SHARED_APPLE);
glPixelStorei(GL_UNPACK_CLIENT_STORAGE_APPLE, GL_TRUE);

memcpy( planeInfo->m_buffer, srcData, planeInfo->m_planeSize );

glTexSubImage2D( GL_TEXTURE_RECTANGLE_ARB, 
                0, 
                0, 
                0, 
                planeInfo->m_width, 
                planeInfo->m_height, 
                GL_LUMINANCE, 
                GL_UNSIGNED_BYTE, 
                planeInfo->m_buffer );

(As a side question: I am not sure if for each of the textures I should use a different texture unit? [I am using unit 0 for Y, 1 for U and 2 for V btw])

Once this is done I put the textures that I used are marked as being used and a VideoFrame class is filled with their info (ie the texture number, and which area in the buffer they occupy etc) and put into a queue to be rendered. Once the minimum queue size is reached, the main application is notified that it can start to render the video.

The main rendering thread meanwhile (after ensuring correct state, etc) then accesses this queue (that queue class has its access internally protected by a mutex) and renders the top frame.

That main rendering thread has two framebuffers, and associated to them via glFramebufferTexture2D two textures, in order to implement some kind of double buffering. In the main rendering loop it then checks which one is the front buffer, and then renders this front buffer to the screen using texture unit 0:

glActiveTexture(GL_TEXTURE0);
glEnable(GL_TEXTURE_RECTANGLE_ARB);            
glBindTexture(GL_TEXTURE_RECTANGLE_ARB, frontTexHandle);            
glTexEnvi(GL_TEXTURE_ENV, GL_TEXTURE_ENV_MODE, GL_REPLACE);

glPushClientAttrib( GL_CLIENT_VERTEX_ARRAY_BIT );
glEnableClientState( GL_VERTEX_ARRAY );
glEnableClientState( GL_TEXTURE_COORD_ARRAY );            
glBindBuffer(GL_ARRAY_BUFFER, m_vertexBuffer);
glVertexPointer(4, GL_FLOAT, 0, 0);
glBindBuffer(GL_ARRAY_BUFFER, m_texCoordBuffer);
glTexCoordPointer(2, GL_FLOAT, 0, 0);
glDrawArrays(GL_QUADS, 0, 4);
glPopClientAttrib();

Before doing that rendering to the screen of the current frame (as the usual framerate is about 24 fps for videos, this frame might be rendered a few times before the next videoframe gets rendered - that's why I use this approach) I call the video decoder class to check if a new frame is available (ie it is responsible for syncing to the timeline and updating the backbuffer with a new frame), if a frame is available, then I am rendering to the backbuffer texture from inside the videodecoder class (this happens on the same thread as the main rendering thread):

glBindFramebuffer(GL_FRAMEBUFFER, backbufferFBOHandle);

glPushAttrib(GL_VIEWPORT_BIT);    // need to set viewport all the time?
glViewport(0,0,m_surfaceWidth,m_surfaceHeight);

glMatrixMode(GL_MODELVIEW);
glPushMatrix();
glLoadIdentity();
glMatrixMode(GL_PROJECTION);
glPushMatrix();
glLoadIdentity();
glMatrixMode(GL_TEXTURE);
glPushMatrix();
glLoadIdentity();
glScalef( (GLfloat)m_surfaceWidth, (GLfloat)m_surfaceHeight, 1.0f );

glActiveTexture(GL_TEXTURE0);
glBindTexture(GL_TEXTURE_RECTANGLE_ARB, texID_Y);

glActiveTexture(GL_TEXTURE1);
glBindTexture(GL_TEXTURE_RECTANGLE_ARB, texID_U);

glActiveTexture(GL_TEXTURE2);
glBindTexture(GL_TEXTURE_RECTANGLE_ARB, texID_V);

glUseProgram(m_yuv2rgbShader->GetProgram());

glBindBuffer(GL_ARRAY_BUFFER, m_vertexBuffer);
glEnableVertexAttribArray(m_attributePos);
glVertexAttribPointer(m_attributePos, 4, GL_FLOAT, GL_FALSE, 0, 0);
glBindBuffer(GL_ARRAY_BUFFER, m_texCoordBuffer);
glEnableVertexAttribArray(m_attributeTexCoord);
glVertexAttribPointer(m_attributeTexCoord, 2, GL_FLOAT, GL_FALSE, 0, 0);
glDrawArrays(GL_QUADS, 0, 4);

glUseProgram(0);

glBindTexture(GL_TEXTURE_RECTANGLE_ARB, 0);                

glActiveTexture(GL_TEXTURE1);
glBindTexture(GL_TEXTURE_RECTANGLE_ARB, 0);

glActiveTexture(GL_TEXTURE0);
glBindTexture(GL_TEXTURE_RECTANGLE_ARB, 0);

glPopMatrix();
glMatrixMode(GL_PROJECTION);
glPopMatrix();
glMatrixMode(GL_MODELVIEW);
glPopMatrix();                            

glPopAttrib();
glBindFramebuffer(GL_FRAMEBUFFER, 0);

[Please note that I omitted certain safety checks and comments for brevity]

After the above calls the video decoder sets a flag that the buffer can be swapped, and after the mainthread rendering loop from above, it checks for that flag and sets the frontBuffer/backBuffer accordingly. Also the used surfaces are marked as being free and available again.

In my original code when I used BGRA and uploads by glTexSubImage2D and glBegin and glEnd, I didn't experience any problems, but once I started improving things, using a shader to convert the YUV components to BGRA, and those DMA transfers, and glDrawArrays those issues started showing up.

Basically it seems partially like a tearing effect (btw I set the GL swap interval to 1 to sync with refreshes), and partially like it is jumping back a few frames in between.

I expected that having a pool of surfaces which I render to, and which get freed after rendering to the target surface, and double buffering that target surface should be enough, but obviously there needs to be more synchronization done in other places - however I don't really know how to solve that.

I assume that because glTexSubImage2D is now handled by DMA (and the function according to the documents supposed to return immediately) that the uploading might not be finished yet (and the next frame is rendering over it), or that I forgot (or don't know) about some other synchronization mechanism which I need for OpenGL (Mac).

According to OpenGL profiler before I started optimizing the code:

almost 70% GLTime in glTexSubImage2D (ie uploading the 8MB BGRA to VRAM)
almost 30% in CGLFlushDrawable

and after I changed the code to the above it now says:

about 4% GLTime in glTexSubImage2D (so DMA seems to work well)
16% in GLCFlushDrawable
almost 75% in glDrawArrays (which came as a big surprise to me)

Any comments on those results?

If you need any further info about how my code is set up, please let me know. Hints on how to solve this would be much appreciated.

Edit: Here are my shaders for reference

#version 110
attribute vec2 texCoord;
attribute vec4 position;

// the tex coords for the fragment shader
varying vec2 texCoordY;
varying vec2 texCoordUV;

//the shader entry point is the main method
void main()
{   
    texCoordY = texCoord ;
    texCoordUV = texCoordY * 0.5;
    gl_Position = gl_ModelViewProjectionMatrix * position;
}

and fragment:

#version 110

uniform sampler2DRect texY;
uniform sampler2DRect texU;
uniform sampler2DRect texV;

// the incoming tex coord for this vertex
varying vec2 texCoordY;
varying vec2 texCoordUV;

// RGB coefficients
const vec3 R_cf = vec3(1.164383,  0.000000,  1.596027);
const vec3 G_cf = vec3(1.164383, -0.391762, -0.812968);
const vec3 B_cf = vec3(1.164383,  2.017232,  0.000000);

// YUV offset
const vec3 offset = vec3(-0.0625, -0.5, -0.5);

void main()
{
    // get the YUV values
    vec3 yuv;
    yuv.x = texture2DRect(texY, texCoordY).r;
    yuv.y = texture2DRect(texU, texCoordUV).r;
    yuv.z = texture2DRect(texV, texCoordUV).r;
    yuv += offset;

    // set up the rgb result
    vec3 rgb;

    // YUV to RGB transform
    rgb.r = dot(yuv, R_cf);
    rgb.g = dot(yuv, G_cf);
    rgb.b = dot(yuv, B_cf);

    gl_FragColor = vec4(rgb, 1.0);
}

Edit 2: As a side-note, I have another rendering pipeline which uses a VDADecoder object for decoding, which works super-nicely performance-wise, but has the same flickering issues. So there is definitely some problem with the threading in my code - so far I just couldn't figure out what exactly. But I also need to provide a software decoder solution for those machines that don't support VDA, hence the CPU load is quite high and therefore I tried to unload the YUV to RGB conversion to the GPU

Viktor Latypov

From what I see (namely the glPushMatrix calls etc.) I assume you are using not the latest hardware and quite possible you are running into the issues with older videocards, like this with the CGLFlushDrawable Why is CGLFlushDrawable so slow? (I am using VBOs).

The second thing you said is the YUV->RGB shader which obviously access the source texture many times and this must be slow on any videocard, especially the older one. So the large timings for the glDrawArrays() call actually reflect the fact that you're using a very heavy shader program (in terms of memory access), even when the shader code might look "innocent".

The shader code accesses the texture (and thus system's RAM) and it is in terms of performance (for this videocard) the same as doing the RAM->VRAM copy.

General advice: try avoiding non-rectangular and non-power-of-two textures. This can also kill the performance. Any non-standard texture formats and extensions should be avoided also. The simpler - the better. Try using something like 2048x1024 texture or 2048x2048, if you really need the FullHD resolution (by the way, this should be slow by pure arithmetics).

Ok, after a lot more testing and research, I finally managed to solve my problems:

What happened was that first I tried to write to the target texture using a framebuffer (bound to that texture with glFramebufferTexture2D as color attachment 0), and in the same frame then tried to read from it when rendering the frame to the window framebuffer.

Basically I wrongly assumed that (being called in the same frame, and directly in succession of each other) the first call would finish writing to the framebuffer, before the next call would read from it. Therefore a call to glFlush (for the class using the VDADecoder) and a glFinish (for the class using the software decoder) did the trick.

On a sidenote: As indicated in the comments above, I changed my whole code so it doesn't user the fixed pipeline anymore, and to make it look cleaner. Performance tests under OpenGL Profiler (under Mac OS X 10.7) have shown that the changes from my original code to the current code have reduced the time OpenGL used of the total application time from almost 50% down to about 15% (freeing up more resources for the actual video decoding - in the case a VDADecoder object is not available).

来源：https://stackoverflow.com/questions/10680883/multithreaded-video-rendering-in-opengl-on-mac-shows-severe-flickering-issues

标签

multithreading

performance

opengl

flicker