summaryrefslogtreecommitdiffstats
path: root/video/out/opengl/hwdec_cuda.c
Commit message (Collapse)AuthorAgeFilesLines
* vo/gpu: hwdec_cuda: Refactor gpu api specific code into separate filesPhilip Langdale2019-05-031-749/+0
| | | | | | | | The amount of code now present that's specific to Vulkan or OpenGL has reached the point where we really want to split it out to avoid a mess of #ifdefs. At the same time, I'm moving the code to an api neutral location.
* vo_gpu/hwdec_cuda: fixup compilation with vulkan disabledJan Ekström2019-04-221-0/+2
| | | | | | | The actual code utilizing this enum was seemingly properly if'd, but not the enum in the struct itself. Fixes compilation.
* vo/gpu: hwdec_cuda: Reorganise backend-specific codePhilip Langdale2019-04-211-151/+223
| | | | | This tries to tidy up the GL vs Vulkan code to be a bit cleaner and easier to read.
* vo_gpu: hwdec_cuda: Implement interop for placeboPhilip Langdale2019-04-211-145/+224
| | | | | | | | | | | | | | | | | | | | This change updates the vulkan interop code to work with the libplacebo based ra_vk, but also introduces direct VkImage sharing to avoid the use of the intermediate buffer. It is also necessary and desirable to introduce explicit semaphore bsed synchronisation for operations on the shared images. Synchronisation means we can safely reuse the same VkImage for every mapped frame, by ensuring the frame is copied to the VkImage before mapping the next frame. This functionality requires a 417.xx or newer nvidia driver, due to bugs in the VkImage interop in the earlier 411 and 415 drivers. It's definitely worth the effort, as the raw throughput is about twice that of implementation using an intermediate buffer.
* vo_gpu: vulkan: use libplacebo insteadNiklas Haas2019-04-211-1/+10
| | | | | | | | | | | | | | | | | | | | | | | | This commit rips out the entire mpv vulkan implementation in favor of exposing lightweight wrappers on top of libplacebo instead, which provides much of the same except in a more up-to-date and polished form. This (finally) unifies the code base between mpv and libplacebo, which is something I've been hoping to do for a long time. Note: The ra_pl wrappers are abstract enough from the actual libplacebo device type that we can in theory re-use them for other devices like d3d11 or even opengl in the future, so I moved them to a separate directory for the time being. However, the rest of the code is still vulkan-specific, so I've kept the "vulkan" naming and file paths, rather than introducing a new `--gpu-api` type. (Which would have been ended up with significantly more code duplicaiton) Plus, the code and functionality is similar enough that for most users this should just be a straight-up drop-in replacement. Note: This commit excludes some changes; specifically, the updates to context_win and hwdec_cuda are deferred to separate commits for authorship reasons.
* vo_gpu: hwdec_cuda: Guard GL and Vulkan headers properlyPhilip Langdale2018-11-181-0/+5
| | | | | | | | We are currently unnecessarily including vulkan headers even when not building with vulkan support. I also guarded the GL header inclusion even though this doesn't appear to break anything today. Fixes #6330.
* vo_gpu: hwdec_cuda: Clean up init() error handlingPhilip Langdale2018-10-311-10/+15
| | | | | | | | | | | | Currently, the error paths in init() are a bit confusing, and we can end up trying to pop the current context when there is no context, which leads to distracting error messages. I also added an explicit path to return early if the GPU backend is not OpenGL or Vulkan. It's pointless to do any other cuda init after that point. (Of course, someone could write more interops.) Fixes #6256
* vo_gpu: vulkan: hwdec_cuda: Add support for Vulkan interopPhilip Langdale2018-10-221-58/+285
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Despite their place in the tree, hwdecs can be loaded and used just fine by the vulkan GPU backend. In this change we add Vulkan interop support to the cuda/nvdec hwdec. The overall process is mostly straight forward, so the main observation here is that I had to implement it using an intermediate Vulkan buffer because the direct VkImage usage is blocked by a bug in the nvidia driver. When that gets fixed, I will revist this. Nevertheless, the intermediate buffer copy is very cheap as it's all device memory from start to finish. Overall CPU utilisiation is pretty much the same as with the OpenGL GPU backend. Note that we cannot use a single intermediate buffer - rather there is a pool of them. This is done because the cuda memcpys are not explicitly synchronised with the texture uploads. In the basic case, this doesn't matter because the hwdec is not asked to map and copy the next frame until after the previous one is rendered. In the interpolation case, we need extra future frames available immediately, so we'll be asked to map/copy those frames and vulkan will be asked to render them. So far, harmless right? No. All the vulkan rendering, including the upload steps, are batched together and end up running very asynchronously from the CUDA copies. The end result is that all the copies happen one after another, and only then do the uploads happen, which means all textures are uploaded the same, final, frame data. Whoops. Unsurprisingly this results in the jerky motion because every 3/4 frames are identical. The buffer pool ensures that we do not overwrite a buffer that is still waiting to be uploaded. The ra_buf_pool implementation automatically checks if existing buffers are available for use and only creates a new one if it really has to. It's hard to say for sure what the maximum number of buffers might be but we believe it won't be so large as to make this strategy unusable. The highest I've seen is 12 when using interpolation with tscale=bicubic. A future optimisation here is to synchronise the CUDA copies with respect to the vulkan uploads. This can be done with shared semaphores that would ensure the copy of the second frames only happens after the upload of the first frame, and so on. This isn't trivial to implement as I'd have to first adjust the hwdec code to use asynchronous cuda; without that, there's no way to use the semaphore for synchronisation. This should result in fewer intermediate buffers being required.
* vo_gpu: hwdec: Use ffnvcodec to load CUDA symbolsPhilip Langdale2018-04-151-33/+45
| | | | | | The CUDA dynamic loader was broken out of ffmpeg into its own repo and package. This gives us an opportunity to re-use it in mpv and remove our custom loader logic.
* hwdec: don't require setting legacy hwdec fieldswm42017-12-021-5/+2
| | | | | | | | | | With the recent changes, mpv's internal mechanisms got synced to libavcodec's once more. Some things are still needed for filters (until the mechanism gets replaced), but there's no need to require other hwdec methods to use these fields. So remove them where they are unnecessary. Also fix some minor leaks in the dxva2 backends, and set the driver_name field in the Apple ones. Untested on Apple crap.
* vo_gpu: hwdec: remove redundant fieldswm42017-12-011-2/+1
| | | | | | | | | | | | | The testing_only field is not referenced anymore with vaglx removed and the previous commit dropping all uses. The ra_hwdec_driver.api field became unused with the previous commit, but all hwdec interop drivers still initialized it. Since this touches highly OS-specific code, build regressions are possible (plus the previous commit might break hw decoding at runtime). At least hwdec_cuda.c still used the .api field, other than initializing it.
* vd_lavc: remove need for duplicated cuda GL interop backendwm42017-10-301-17/+1
| | | | | | | This is just a dumb consequence of HWDEC_ types somehow being part of both decoder and VO. Obviously, the VO should only care about supporting specific hardware surface types or providing specific device types, but until they are separated, stupid unintuitive mismatches will occur.
* vd_lavc: add support for nvdec hwaccelwm42017-10-281-1/+17
| | | | | | | | See manpage additions. (In ffmpeg-mpv and Libav, this is still called "cuvid". Libav won't work yet, because it has no frame params support yet, but this could get fixed soon.)
* vo_opengl: refactor into vo_gpuNiklas Haas2017-09-211-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is done in several steps: 1. refactor MPGLContext -> struct ra_ctx 2. move GL-specific stuff in vo_opengl into opengl/context.c 3. generalize context creation to support other APIs, and add --gpu-api 4. rename all of the --opengl- options that are no longer opengl-specific 5. move all of the stuff from opengl/* that isn't GL-specific into gpu/ (note: opengl/gl_utils.h became opengl/utils.h) 6. rename vo_opengl to vo_gpu 7. to handle window screenshots, the short-term approach was to just add it to ra_swchain_fns. Long term (and for vulkan) this has to be moved to ra itself (and vo_gpu altered to compensate), but this was a stop-gap measure to prevent this commit from getting too big 8. move ra->fns->flush to ra_gl_ctx instead 9. some other minor changes that I've probably already forgotten Note: This is one half of a major refactor, the other half of which is provided by rossy's following commit. This commit enables support for all linux platforms, while his version enables support for all non-linux platforms. Note 2: vo_opengl_cb.c also re-uses ra_gl_ctx so it benefits from the --opengl- options like --opengl-early-flush, --opengl-finish etc. Should be a strict superset of the old functionality. Disclaimer: Since I have no way of compiling mpv on all platforms, some of these ports were done blindly. Specifically, the blind ports included context_mali_fbdev.c and context_rpi.c. Since they're both based on egl_helpers, the port should have gone smoothly without any major changes required. But if somebody complains about a compile error on those platforms (assuming anybody actually uses them), you know where to complain.
* vo_opengl: separate hwdec context and mapping, port it to use rawm42017-08-101-75/+93
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | This does two separate rather intrusive things: 1. Make the hwdec context (which does initialization, provides the device to the decoder, and other basic state) and frame mapping (getting textures from a mp_image) separate. This is more flexible, and you could map multiple images at once. It will help removing some hwdec special-casing from video.c. 2. Switch all hwdec API use to ra. Of course all code is still GL specific, but in theory it would be possible to support other backends. The most important change is that the hwdec interop returns ra objects, instead of anything GL specific. This removes the last dependency on GL-specific header files from video.c. I'm mixing these separate changes because both requires essentially rewriting all the glue code, so better do them at once. For the same reason, this change isn't done incrementally. hwdec_ios.m is untested, since I can't test it. Apart from superficial mistakes, this also requires dealing with Apple's texture format fuckups: they force you to use GL_LUMINANCE[_ALPHA] instead of GL_RED and GL_RG. We also need to report the correct format via ra_tex to the renderer, which is done by find_la_variant(). It's unknown whether this works correctly. hwdec_rpi.c as well as vo_rpi.c are still broken. (I need to pull my RPI out of a dusty pile of devices and cables, so, later.)
* vo_opengl: hwdec_cuda: fix filtering modewm42017-08-091-1/+1
| | | | Probably explains quality issues in some cases.
* vo_opengl: hwdec_cuda: Support separate decode and display devicesPhilip Langdale2017-06-031-12/+44
| | | | | | | | | | | | | | | | | In a multi GPU scenario, it may be desirable to use different GPUs for decode and display responsibilities. For example, if a secondary GPU has better video decoding capabilities. In such a scenario, we need to initialise a separate context for each GPU, and use the display context in hwdec_cuda, while passing the decode context to avcodec. Once that's done, the actually hand-off between the two GPUs is transparent to us (It happens during the cuMemcpy2D operation which copies the decoded frame from a cuda buffer to the OpenGL texture). In the end, the bulk of the work is around introducing a new configuration option to specify the decode device.
* vo_opengl: hwdec_cuda: use new format setup functionwm42017-02-171-34/+8
| | | | Gives us automatically support for all formats vo_opengl supports.
* vo_opengl: hwdec_cuda: add yuv420p supportwm42017-01-161-19/+35
| | | | | | | | | Because it allows easier testing of filters + hwdec. Make the texture setup code a bit more generic so it doesn't get too much of a mess. We also use the GL renderer utility function gl_find_unorm_format(), which saves us additional work with OpenGL's semi-redundant format specifiers.
* vo_opengl: hwdec_cuda: export AVHWDeviceContextwm42017-01-161-6/+31
| | | | So we can use it for filtering later.
* cuda: use libavutil functions for copying hw surfaces to memorywm42017-01-121-67/+0
| | | | | | | | mp_image_hw_download() is a libavutil wrapper added in the previous commit. We drop our own code completely, as everything is provided by libavutil and our helper wrapper. This breaks the screenshot code, so that has to be adjusted as well.
* vo_opengl: hwdec_cuda: Don't include hwcontext headersPhilip Langdale2016-12-041-4/+0
| | | | | After various simplifications, these includes simply aren't needed now.
* vo_opengl: hwdec_cuda: make some init errors verbosewm42016-11-241-2/+2
| | | | | Improves autoprobe behavior. This is equivalent to other hwdec interop wrappers. If CUDA is just not available, it should remain silent.
* vo_opengl: hwdec_cuda: fix crash when trying to use hwdec=cuda if cuda SDK ↵pavelxdd2016-11-241-0/+1
| | | | | | is not present If CUDA SDK wasn't installed, mpv crashed immediately with the message "Failed to load CUDA symbols"
* vo_opengl: hwdec_cuda: Use dynamic loading for cuda functionsPhilip Langdale2016-11-231-2/+9
| | | | | This change applies the pattern used in ffmpeg to dynamically load cuda, to avoid requiring the CUDA SDK at build time.
* vo_opengl: hwdec_cuda: Support P016 output surfacesPhilip Langdale2016-11-221-8/+45
| | | | | | | | | The latest 375.xx nvidia drivers add support for P016 output surfaces. In combination with an ffmpeg change to return those surfaces, we can display them. The bulk of the work is related to knowing which format you're dealing with at the right time. Once you know, it's straight forward.
* vo_opengl: hwdec_cuda: get the cuda device from the GL contextPhilip Langdale2016-09-241-3/+3
| | | | | | | | | | Obviously, in the vast majority of cases, there's only one device in the system, but doing this means we're more likely to get a usable device in the multi-device case. cuda would support decoding on one device and displaying on another but the peer memory handling is not transparent and I have no way to test it so I can't really write it.
* vo_opengl: hwdec_cuda: directly map GL textures and skip using PBOsPhilip Langdale2016-09-241-65/+20
| | | | | | | | | | | | | | | | | | | The documentation around this stuff is poor, but I found an nvidia sample that demonstrates how to use the interop API most efficiently. (https://github.com/nvpro-samples/gl_cuda_interop_pingpong_st) Key lessons are: 1) you can register the texture itself and have cuda write to it, thereby skipping an additional copy through the PBO. 2) You don't have to be mapped when you do the copy - once you get a mapped pointer, it remains valid. Magic! This lets us throw out the PBOs as well as much of the explicit alignment and stride handling. CPU usage is slightly (~3%) lower for 4K content in one test case, so it makes a detectable difference, and presumably saves memory.
* hwdec_cuda: Implement download_image for screenshotsPhilip Langdale2016-09-101-0/+53
| | | | Data has to be copied to system memory for screenshots.
* hwdec_cuda: Use the non-deprecated CUDA-GL interop APIPhilip Langdale2016-09-101-12/+26
| | | | | | | | | | The nvidia examples use the old (as in CUDA 3.x) interop API which is deprecated, and I think not even functional on recent versions of CUDA for windows. As I was following the examples, I used this old API. So, let's update to the new API, and hopefully, it'll start working on windows too.
* hwdec/opengl: Add support for CUDA and cuvid/NvDecodePhilip Langdale2016-09-081-0/+282
Nvidia's "NvDecode" API (up until recently called "cuvid" is a cross platform, but nvidia proprietary API that exposes their hardware video decoding capabilities. It is analogous to their DXVA or VDPAU support on Windows or Linux but without using platform specific API calls. As a rule, you'd rather use DXVA or VDPAU as these are more mature and well supported APIs, but on Linux, VDPAU is falling behind the hardware capabilities, and there's no sign that nvidia are making the investments to update it. Most concretely, this means that there is no VP8/9 or HEVC Main10 support in VDPAU. On the other hand, NvDecode does export vp8/9 and partial support for HEVC Main10 (more on that below). ffmpeg already has support in the form of the "cuvid" family of decoders. Due to the design of the API, it is best exposed as a full decoder rather than an hwaccel. As such, there are decoders like h264_cuvid, hevc_cuvid, etc. These decoders support two output paths today - in both cases, NV12 frames are returned, either in CUDA device memory or regular system memory. In the case of the system memory path, the decoders can be used as-is in mpv today with a command line like: mpv --vd=lavc:h264_cuvid foobar.mp4 Doing this will take advantage of hardware decoding, but the cost of the memcpy to system memory adds up, especially for high resolution video (4K etc). To avoid that, we need an hwdec that takes advantage of CUDA's OpenGL interop to copy from device memory into OpenGL textures. That is what this change implements. The process is relatively simple as only basic device context aquisition needs to be done by us - the CUDA buffer pool is managed by the decoder - thankfully. The hwdec looks a bit like the vdpau interop one - the hwdec maintains a single set of plane textures and each output frame is repeatedly mapped into these textures to pass on. The frames are always in NV12 format, at least until 10bit output supports emerges. The only slightly interesting part of the copying process is that CUDA works by associating PBOs, so we need to define these for each of the textures. TODO Items: * I need to add a download_image function for screenshots. This would do the same copy to system memory that the decoder's system memory output does. * There are items to investigate on the ffmpeg side. There appears to be a problem with timestamps for some content. Final note: I mentioned HEVC Main10. While there is no 10bit output support, NvDecode can return dithered 8bit NV12 so you can take advantage of the hardware acceleration. This particular mode requires compiling ffmpeg with a modified header (or possibly the CUDA 8 RC) and is not upstream in ffmpeg yet. Usage: You will need to specify vo=opengl and hwdec=cuda. Note that hwdec=auto will probably not work as it will try to use vdpau first. mpv --hwdec=cuda --vo=opengl foobar.mp4 If you want to use filters that require frames in system memory, just use the decoder directly without the hwdec, as documented above.