- Uses a fixed 64MB for the cache instead of an ever growing map.
- Slightly faster by using atomics instead of a single mutex for access.
- Thanks for Rodrigo for the idea.
Some games benefit from skipping caches (Pokémon Sword), and others
don't (Animal Crossing: New Horizons). Add an heuristic to decide this
at runtime.
The cache hit ratio has to be ~98% or better to not skip the cache.
There are 16 frames of buffer.
This commit removes early placeholders for an implementation of async nvdec. With recent changes to the source code, the placeholders are no longer accurate, and can cause a nullptr dereference due to the nature of the cdma_pusher lifetime.
src/video_core/shader_notify.cpp: In member function 'void VideoCore::ShaderNotify::MarkShaderComplete()':
src/video_core/shader_notify.cpp:33:10: error: 'unique_lock' is not a member of 'std'
33 | std::unique_lock lock{mutex};
| ^~~~~~~~~~~
src/video_core/shader_notify.cpp:6:1: note: 'std::unique_lock' is defined in header '<mutex>'; did you forget to '#include <mutex>'?
5 | #include "video_core/shader_notify.h"
+++ |+#include <mutex>
6 |
src/video_core/shader_notify.cpp: In member function 'void VideoCore::ShaderNotify::MarkSharderBuilding()':
src/video_core/shader_notify.cpp:38:10: error: 'unique_lock' is not a member of 'std'
38 | std::unique_lock lock{mutex};
| ^~~~~~~~~~~
src/video_core/shader_notify.cpp:38:10: note: 'std::unique_lock' is defined in header '<mutex>'; did you forget to '#include <mutex>'?
This creates non-sRGB texture views for sRGB texture formats to allow for interfacing with these views in compute shaders using imageLoad and imageStore.
Co-Authored-By: Rodrigo Locatti <reinuseslisp@airmail.cc>
Load the current tick to a local variable, moving it out of an atomic
and allowing us to compare the value without going through a pointer
each time. This should make the loop more optimizable.
Fix a tragic off-by-one condition that causes Vulkan's stream buffer to
think it's always full, using fallback memory. The OpenGL was also
affected by this bug to a lesser extent.
We are already using robustness2 features without requiring it
explicitly, causing potential crashes on drivers without the extension.
Requiring this at boot allows better diagnostics for it and formalizes
our usage on the extension.
There was still a code path that could wait on a timeline semaphore tick
that would never be signalled.
While we are at it, make use of more STL algorithms.
Games can bind a null index buffer (size=0) where all indices are
evaluated as zero. VK_EXT_robustness2 doesn't support this and all
drivers segfault when a null index buffer is passed to
vkCmdBindIndexBuffer.
Workaround this by creating a 4 byte buffer and filling it with zeroes.
If it's read out of bounds, robustness takes care of returning zeroes as
indices.
Bind extra bytes beyond the guest API's bound range.
This is due to some games like Astral Chain operating out of bounds.
Binding the whole map range would be technically correct, but games
have large maps that make this approach unaffordable for now.
Avoids waiting idle while the GPU finishes to do work, and fixes an
issue where we'd wait forever if a single command buffer (logic tick)
all the data.
Detect when a memory region has been joined several times and increase
the size of the created buffer on those instances. The buffer is assumed
to be a "stream buffer", increasing its size should stop us from
constantly recreating it and fragmenting memory.
Ports from OpenGL the optimization to skip small 3D uniform buffer
uploads. This will take advantage of the previously introduced stream
buffer.
Fixes instances where the staging buffer offset was being ignored.
This uses a ring buffer similar to OpenGL's stream buffer for small
uploads. This stops us from allocating several small buffers, reducing
memory fragmentation and cache locality.
It uses dedicated allocations when possible.
Reimplement the buffer cache using cached bindings and page level
granularity for modification tracking. This also drops the usage of
shared pointers and virtual functions from the cache.
- Bindings are cached, allowing to skip work when the game changes few
bits between draws.
- OpenGL Assembly shaders no longer copy when a region has been modified
from the GPU to emulate constant buffers, instead GL_EXT_memory_object
is used to alias sub-buffers within the same allocation.
- OpenGL Assembly shaders stream constant buffer data using
glProgramBufferParametersIuivNV, from NV_parameter_buffer_object. In
theory this should save one hash table resolve inside the driver
compared to glBufferSubData.
- A new OpenGL stream buffer is implemented based on fences for drivers
that are not Nvidia's proprietary, due to their low performance on
partial glBufferSubData calls synchronized with 3D rendering (that
some games use a lot).
- Most optimizations are shared between APIs now, allowing Vulkan to
cache more bindings than before, skipping unnecesarry work.
This commit adds the necessary infrastructure to use Vulkan object from
OpenGL. Overall, it improves performance and fixes some bugs present on
the old cache. There are still some edge cases hit by some games that
harm performance on some vendors, this are planned to be fixed in later
commits.
Workaround an issue on Nvidia where creating a Vulkan instance from an
active OpenGL thread disables threaded optimization on the driver.
This optimization is important to have good performance on Nvidia
OpenGL.
Instead of using a two step initialization to report errors, initialize
the GPU renderer and rasterizer on the constructor and report errors
through std::runtime_error.
Some games usually write memory pages currently used by the GPU, causing
rendering issues (e.g. flashing geometry and shadows on Link's
Awakening). To workaround this issue, Guest CPU writes are delayed until
the command buffer finishes processing, but the pages are updated
immediately.
The overall behavior is:
- CPU writes are cached until they are flushed, they update the page
state, but don't change the modification state. Cached writes stop
pages from being flushed, in case games have meaningful data in it.
- Command processing writes (e.g. push constants) update the page state
and are marked to the command processor as dirty. They don't remove
the state of cached writes.
Also renames related CMake variables to match both the Find*FFmpeg* and
variables defined within the file. Fixes odd errors produced by the old
FindFFmpeg.
Citra's FindFFmpeg is slightly modified here: adds Citra's copyright at
the beginning, renames FFmpeg_INCLUDES to FFmpeg_INCLUDE_DIR, disables a
few components in _FFmpeg_ALL_COMPONENTS, and adds the missing avutil
component to the comment above.
For Linux, instructs CMake to use the FFmpeg submodule in externals.
This is HEAVILY based on our usage of the late Unicorn. Minimal change
to MSVC as it uses the yuzu-emu/ext-windows-bin. MinGW now targets the
same ext-windows-bin libraries as MSVC for FFmpeg. Adds FFMPEG_LIBRARIES
to WIN32 and simplifies video_core/CMakeLists.txt a bit.
Vulkan 1.0 didn't support creating sRGB image views on an ABGR8 VkImage
with storage usage bits. VK_KHR_maintenance2 addressed this allowing to
reduce the usage bits on a VkImageView.
To allow image store on non-sRGB image views when the VkImage is created
with sRGB, always create VkImages without sRGB and add the sRGB format
on the view.
The VertexA stage is not yet implemented, but Vulkan is adding its
descriptors, causing a discrepancy in the pushed descriptors and the
template. This generally ends up in a driver side crash.
Bypass the VertexA stage for now.
When we copy into a buffer, it might contain data modified from the GPU
on the same pages. Because of this, we have to flush the contents before
writing new data.
An alternative approach would be to write the data in place, but games
can also write data in other ways, invalidating our contents.
Fixes geometry in Zombie Panic in Wonderland DX.
Setting __GL_SHADER_DISK_CACHE_PATH we can force the cache directory to
be in yuzu's user directory to stop commonly distributed malware from
deleting our driver shader cache. And by setting
__GL_SHADER_DISK_CACHE_SKIP_CLEANUP we can have an unbounded shader
cache size.
This has only been implemented on Windows, mostly because previous tests
didn't seem to work on Linux.
Disable the precompiled cache on Nvidia's driver. There's no need to
hide information the driver already has in its own cache.
Silence the new validation layer error about SPIR-V not allowing OpUndef
on a OpTypeVoid, even when the SPIR-V spec doesn't say anything against
it.
They will be inserted as an undefined int to avoid SPIRV-Cross and
validation errors, but only when a debugging tool is attached.
Allow users of the allocator to hint memory usage for downloads. This
removes the non-descriptive boolean passed for "host visible" or not
host visible memory commits, and uses an enum to hint device local,
upload and download usages.
Fix a bug where the memory allocator could leave gaps between commits.
To fix this the allocation algorithm was reworked, although it's still
short in number of lines of code.
Rework the allocation API to self-contained movable objects instead of
naively using an unique_ptr to do the job for us. Remove the VK prefix.
Stops us from merging code with unused functions in the future.
If something is invoked behind conditionally evaluated code in
a way that the language can't see it (e.g. preprocessor macros), the
potentially unused function should use [[maybe_unused]].
It keeps track of the modified CPU and GPU ranges on a CPU page
granularity, notifying the given rasterizer about state changes
in the tracking behavior of the buffer.
Use a small vector optimization to store buffers smaller than 256 KiB
locally instead of using free store memory allocations.
With timeline semaphores we can avoid creating objects. Instead of
creating an event, grab the current tick from the scheduler and flush
the current command buffer. When the fence has to be queried/waited, we
can do so against the master semaphore instead of spinning on an event.
If Vulkan supported NVN like events or fences, we could signal from the
command buffer and wait for that without splitting things in two
separate command buffers.