NVIDIA and DirectX 12 Bottleneck? MSI GeForce RTX 3090 SUPRIM vs. MSI Radeon RX 6900XT Gaming X and its own drivers

What is behind Asynchronous Compute?

I have already written a longer version about this more than 5 years ago, but I would like to refresh it a little bit, because it is important. What exactly is this about? Many of the in-game effects such as shadow casting, lighting, artificial intelligence, physics, and lensing effects often require several computational steps before even determining what will be rendered to the screen by a GPU’s graphics hardware. In DirectX 11 these steps had to be done sequentially.

Step by step, the graphics card now followed the process of the API to render something from start to finish. And you know it from a traffic jam on a motorway: every delay at an early stage of the congestion would then mean an ever-increasing wave of delays in the future as a consequence. These delays in the pipeline are also somewhat flippantly referred to as “bubbles” and then represent a certain moment in which a part of the GPU hardware has to pause to wait for new instructions.

The following graphic shows the visual representation of DirectX 11 threading. All graphics, memory, and computational operations are combined into a long sequence of processing (“pipeline”) that is extremely susceptible to delays:

These so-called “pipeline bubbles” happen all the time, of course, and on every graphics card as well, because no game in the world can really take perfect advantage of all the power or hardware a GPU has to offer. And no game can consistently avoid creating such bubbles when the user moves and acts freely in the game world. And now comes the trick with the “Asynchronous Compte”. What if instead of waiting, you could fill those bubbles with other tasks to make better use of the hardware so that less processing power is idle?

For example, if there is a rendering bubble when rendering complex lighting, you could let the AI compute in the meantime. So you can do several things in parallel or simply bring forward pending, suitable tasks. The next graphic is the visual representation of the flow of asynchronous computations under DirectX 12. The graphics, memory, and compute operations are decoupled into independent task packages that can then even be executed in parallel:

Summary and conclusion

As a consequence of the decreasing resolution, the number of rendered frames increases and the latency decreases in return. But what does this have to do with the CPU and the limit set by it? Many things run on CPU and the number of drawcalls alone increases dramatically with decreasing resolution. The CPU must always deliver so that the graphics hardware is also always optimally utilized. Here, however, the saw seems to be a bit stuck with NVIDIA’s drivers. I wouldn’t go so far as to suggest that NVIDIA still has issues with asynchronous pipeline processing, but it’s probably still not optimal. Especially when engines are optimized for AMD’s hardware (e.g. the single pass downsampling in HZD)

The dependency of game and engine actually demands a more detailed investigation of this problem, but as a lone warrior with two upcoming launches at the moment I don’t have the time for it. I would actually want to exclude a general performance loss of the GeForce drivers at lower resolutions due to a flatly declared “overhead”, because even if the programmers of both teams like to be a bit scatterbrained once in a while, NVIDIA is certainly not that brutally wrong.

Whether it was Horizon Zero Dawn or Watch Dogs Legion, whenever the FPS dropped on the GeForce (especially in the measurements with only 2 cores), the slower popping of content, delayed loading of textures, or errors with lighting and shadows were less bad on the Radeon than on the GeForce. This is also an indicator that the pipeline was simply tight (bubbles) and the multi-threading on the GPU was not really optimal. This is supported by the fact that the percentage gaps between the two cards are always the same when increasing the core count and decreasing the CPU limit (see page two). Because I see the problem rather less with the CPU, but the processing of the pipelines on the GPU. A limiting CPU only makes the process more obvious, but is not the real reason.

Of course, there must be solid reasons why all this is happening, software- or maybe even hardware-related. But then not in a general way with generally bad drivers, but very specific and limitable, maybe even platform-dependent with older systems as a sum of negative factors. After all, we have also seen that one cannot find such big differences in tests on one and the same, current platform with PCIe 4.0 as it was the case with the colleagues.

The current conclusion is that the graphics hardware should match the rest of the system and of course the used screen resolution, and that you won’t win anything with such potent cards like the GeForce RTX 3080 or RTX 3090 on (older) systems with rather weak CPUs in low resolutions anyway. No matter if with or without limits. Unless you want to mine cryptic black money on the side. But that’s not my problem anymore.

