Is OS X/iOS VideoToolBox decoding -> vmem vout a valid path in theory?

oviano · Postby **oviano** » 13 Dec 2016 18:19

I noticed on both OS X and iOS when I built libVLC that it would try and decode my h.264 source with the video toolbox module.

Unfortunately OS X failed with a crash, copying some frame data, and on iOS it just passes through a green screen and emits a bunch of warnings.

Before I consider debugging this - may I ask the developers if this pipeline theoretically should work?

Everything works fine if I block/remove the video toolbox plugin and force it to use libavcodec.

oviano · Postby **oviano** » 14 Dec 2016 19:54

Ok, an update on this on the vague chance anyone is remotely interested - I got it working, it needed "videotoolbox-zero-copy" to be disabled so that the decoded frames reach the vmem output.

I also noticed something a little unsatisfactory; the native decoding format on iOS is NV12, yet there is a rather cumbersome function in videotoolbox.m which blindly converts to I420, and not very efficiently. I profiled it in Xcode and something like 60% of execution time was spent in the function splitPlanes called via copy420YpCbCr8Planar (it's not hard to see why this function is slow).

So my suggestion is not to convert to I420 and leave it up to whatever is using the decoder to worry about any conversion. For example, by turning this into a straightforward Nv12toNv12 copy and leaving it up to swscale to do chroma conversion to I420 this led to something like 3% of the execution time for the conversion being done in swscale, vs 60% as mentioned in the function above. Since then I've adapted my video player to work directly in NV12 anyway (since SDL2 supports this format) so I can eliminate any conversion done by the CPU entirely now.

To put a better perspective on what this change meant for my project - it now allows me to decode a 1080p60 4mbit/s h.264 transport stream smoothly on an iPad Air 2 without stutter whereas before my changes this wasn't possible and it would play a couple of seconds, freeze, play some more seconds, freeze, etc. Bear in mind I'm using vmem (I know, I know) and copying back to a texture to blend with my UI.

Here is my patch anyway.

https://www.dropbox.com/s/e0dqj6kirvh29 ... patch?dl=0

Postby **fkuehne** » 14 Dec 2016 23:15

This is very nice work!

Can you email me a proper patch with your name and email address for correct attribution in case we decide to merge this in part or full with mainline VLC?

Postby **Jean-Baptiste Kempf** » 14 Dec 2016 23:20

This patch is doing 2 different things, aka move to use the cache and using NV12 instead of I420.

While using the cache will bring a very large improvement, there is not many reasons that using NV12 instead of I420 changes anything. If you are doing this, there is a mistake somewhere else, and you are hiding it.

Also, you could be even faster by asking vmem to give you an opaque pointer CVPX and display it in your code.

Postby **Jean-Baptiste Kempf** » 14 Dec 2016 23:21

Not to mention that I would argue that we need to remove this code from the video output and move it to chroma converter CVPX -> I420.

oviano · Postby **oviano** » 15 Dec 2016 06:51

Yes there are some different approaches that could be taken.

The most important thing is to remove the SplitPlanes call.

I'm not sure why videotoolbox.m wasn't written so that the decoder simply outputs I420 instead of NV12. Then it would be a straight I420 -> I420 copy. Maybe because of the Apple docs that seem to hint that the optimum decoder output is NV12? But it seems pointless if it's then going to do an inefficient copy at the end of it - might aswell leave it to the decoder to deal with in the first place?

oviano · Postby **oviano** » 23 Dec 2016 10:13

Just following on from this.

I wrote a variation on "vmem.c" for iOS (vmem.m) which deals with the CVPixelBuffer directly and means I've now eliminated all copying in my code except for when I update my SDL texture.

I noticed something though - the CVPixelBuffer has double the width I was expecting, due to this line in videotoolbox.m:

VTDictionarySetInt32(p_sys->destinationPixelBufferAttributes,
kCVPixelBufferBytesPerRowAlignmentKey,
i_video_width * 2);

This has two effects when I eventually use the pixel buffer in SDL_UpdateTexture.

1) SDL has some code which ensures the planes are tightly packed (seems like this is a requirement of GLES 2). It checks the source and dest pitch and if they differ does a malloc and a copy, which is of course horrible because this is before it even calls glTexSubImage2D. I can get around this by creating a larger texture, but this leads onto...

2) ...the glTextSubImage2D function being slower than it needs to be for the same reason - it's copying double the width.

If I replace the "i_video_width" with just "1" then these problems go away. The pitches match, and glTextSubImage2D is faster.

What is the purpose of having a pitch that is double the pixel width per plane?

PS I'm aware that I may be able to actually skip the glTexSubImage2D step altogether by using CVOpenGLESTextureCacheCreateTextureFromImage and hopefully using this directly in SDL, but I'm doing things one step at a time so that I fully understand everything! Maybe this work will eventually morph into some kind of iOS/SDL integration module which of course I will make available.

oviano · Postby **oviano** » 26 Dec 2016 12:47

So I got this all working.

In case anyone wants to do this in future, here is a link to the SDL forums where I've described a method for getting a CVPixelBufferRef into an SDL_Texture:

https://forums.libsdl.org/viewtopic.php?t=12258

Here are my replacements for vmem.c and vmem.m. Personally I think this is a better interface to vmem than the existing lock/unlock. Far more flexible to simply pass down the data for the application to handle in the manner it wants. SDL integration required one important tweak though - it no longer releases the picture immediately, instead each Display releases *the one before*. Allowing the SDL app to be able to access the picture until the next frame is vital because SDL calls can only be made in the main thread so this makes things a lot easier with regards to thread synchronisation.

https://www.dropbox.com/s/mgpjgz0rtt3vxx5/vmem.m?dl=0
https://www.dropbox.com/s/shhymqldh9qoykz/vmem.c?dl=0

vmem.m is a drop-in replacement for vmem.c but I couldn't work out how to modify the Makefile.am so that it chose one or the other depending on the platform being compiled. I tried using the if HAVE_IOS to specify the source etc but it just complained that I was specifying the source file twice however I structured it.

I can turn these into patches if necessary but maybe these should become "vmem2" or something if you don't want to change the interface.

Or maybe you don't want them at all, which is also fine. Just sharing...

The VideoLAN Forums

Is OS X/iOS VideoToolBox decoding -> vmem vout a valid path in theory?

Is OS X/iOS VideoToolBox decoding -> vmem vout a valid path in theory?

Re: Is OS X/iOS VideoToolBox decoding -> vmem vout a valid path in theory?

Re: Is OS X/iOS VideoToolBox decoding -> vmem vout a valid path in theory?

Re: Is OS X/iOS VideoToolBox decoding -> vmem vout a valid path in theory?

Re: Is OS X/iOS VideoToolBox decoding -> vmem vout a valid path in theory?

Re: Is OS X/iOS VideoToolBox decoding -> vmem vout a valid path in theory?

Re: Is OS X/iOS VideoToolBox decoding -> vmem vout a valid path in theory?

Re: Is OS X/iOS VideoToolBox decoding -> vmem vout a valid path in theory?

Who is online