better utilization of multi-core machines
Posted: 14 Apr 2009 03:25
Currently vlc only uses a maximum of 1.5 cores during rendering of even the most tenaciously encoded H.264 videos that I can find. Often, on machines with multiple cores, there are rendering issues for HD content that are related to the time it takes to render the MPEG video via the ffmpeg libraries. If it takes TOO LONG to decode a frame, it gets dropped [and often many others that follow].
The problem (as I see it) is that there is ONLY ONE THREAD that is responsible for decoding AND rendering. Since there are also only two work buffers, which alternate every other frame, this may be unavoidable.
I have discussed this with others, and I believe optimization for 2 or 4 cores is definitely possible. This would require a serious architectural change in the main 'rendering' thread.
On a side note, it appears that threads are being used effectively when streaming, to keep the queue filled with incoming packets from a separate thread. That being said, the CPU time used for packetizing and placing the data into the picture queue is trivial compared to what rendering takes, and rendering is still in a single thread.
Therefore, I am suggesting that vlc be modified to do asynchronous rendering and displaying of frames in two separate threads, one of which renders via ffmpeg, then makes a copy of the video buffer into shared video memory and queues the copy [with the appropriate "when to render" timestamp], _THEN_ immediately begins decoding the next frame, keeping the 'shared video memory' queue full. A second thread [running on a different CPU of course] would then pull from the queue and do synchronous frame rendering, wait for the display time, render a frame, wait for the next display time, render a frame, etc..
If frames were rendered in one thread for ffmpeg use, and _THEN_ an MMX-or-equivalent memory move into shared video memory blocks were done (in 'video driver' format as needed), this would be _MUCH_ faster than waiting on "the system" before beginning to decode the next MPEG frame. In fact, an H.264 1080 video would require a maximum of 1920x1080 clock cycles if it were copied 32 bytes at a time using a 'REP' prefix. I would expect MMX operations to be a bit faster than this. On a 2Ghz machine, I expect it to take less than 1 millisecond to copy that buffer with MMX or similar methods (using a regular 'memcpy' call takes about 6 msecs). Compare this to the time needed to render it on the screen (several milliseconds, maybe up to 20 on X11 systems). So, if one thread rendered with ffmpeg, made a copy, then continued rendering, while another thread read the buffers from a queue and sent them to the video driver, vlc could process TWICE the frames per second as is currently being done. It would also be possible to pre-render ahead of time, allowing some flexibility for 'slow to render' frames.
Software that _really_ takes advantage of multiple cores isn't really "out there" quite yet, in my opinion. Wouldn't it be fun if vlc were "the first" to _REALLY_ max out the multi-core advantage?
The problem (as I see it) is that there is ONLY ONE THREAD that is responsible for decoding AND rendering. Since there are also only two work buffers, which alternate every other frame, this may be unavoidable.
I have discussed this with others, and I believe optimization for 2 or 4 cores is definitely possible. This would require a serious architectural change in the main 'rendering' thread.
On a side note, it appears that threads are being used effectively when streaming, to keep the queue filled with incoming packets from a separate thread. That being said, the CPU time used for packetizing and placing the data into the picture queue is trivial compared to what rendering takes, and rendering is still in a single thread.
Therefore, I am suggesting that vlc be modified to do asynchronous rendering and displaying of frames in two separate threads, one of which renders via ffmpeg, then makes a copy of the video buffer into shared video memory and queues the copy [with the appropriate "when to render" timestamp], _THEN_ immediately begins decoding the next frame, keeping the 'shared video memory' queue full. A second thread [running on a different CPU of course] would then pull from the queue and do synchronous frame rendering, wait for the display time, render a frame, wait for the next display time, render a frame, etc..
If frames were rendered in one thread for ffmpeg use, and _THEN_ an MMX-or-equivalent memory move into shared video memory blocks were done (in 'video driver' format as needed), this would be _MUCH_ faster than waiting on "the system" before beginning to decode the next MPEG frame. In fact, an H.264 1080 video would require a maximum of 1920x1080 clock cycles if it were copied 32 bytes at a time using a 'REP' prefix. I would expect MMX operations to be a bit faster than this. On a 2Ghz machine, I expect it to take less than 1 millisecond to copy that buffer with MMX or similar methods (using a regular 'memcpy' call takes about 6 msecs). Compare this to the time needed to render it on the screen (several milliseconds, maybe up to 20 on X11 systems). So, if one thread rendered with ffmpeg, made a copy, then continued rendering, while another thread read the buffers from a queue and sent them to the video driver, vlc could process TWICE the frames per second as is currently being done. It would also be possible to pre-render ahead of time, allowing some flexibility for 'slow to render' frames.
Software that _really_ takes advantage of multiple cores isn't really "out there" quite yet, in my opinion. Wouldn't it be fun if vlc were "the first" to _REALLY_ max out the multi-core advantage?