The end of this year was extremely fruitful for the IT-industry and those who follow its development. The debut of Ryzen 5000 series chips, NVIDIA video cards based on Ampere silicon and, more recently, mobile SoC from Apple, which intend to shake the established hierarchy in the world of consumer CPUs – all this happened just in one or two months. Let’s not forget about consoles of new generation, which also have direct relation to initiatives AMD. Before our eyes is perhaps the hottest season of new products in the past few years, both in terms of number of events and their importance. It is not easy that all advanced iron instantly turned out to be a scarce commodity to some extent.
Until now there was only one important element missing from the picture – the long promised Radeon video cards on a large Navi chip, which, in our opinion, deserve a central place in the series of autumn releases. The fact is that AMD since the time of confrontation between Radeon R9 Fury X and GeForce GTX 980 Ti actually eliminated from the battle for the crown of the highest performance among gaming GPUs. Attempts to return to the ring, which from those who took a chip maker, accompanied by generous promises, but each time the hopes of supporters of the red brand were broken. However, thanks to the RDNA architecture AMD had all the prerequisites to eliminate the technological gap with NVIDIA silicon, and in some aspects even to take a leading position.
Radeon 5000 series graphics cards, even though they bear the features of transitional products, have already proved their competitiveness in the medium and medium-high price category, and now AMD’s ambitions have again spread to the highest price segment. And it’s not just a competition within the traditional performance metrics, but also the return of functional parity between NVIDIA and AMD silicon, because “big Navi” also performs hardware accelerated ray tracing.
AMD has announced three graphics cards based on the Navi 21 chip. Today we will consider the two younger models, Radeon RX 6800 and Radeon RX 6800 XT, which should go on sale at the recommended prices of $579 and $649, respectively, when you read this article. Their direct competitors on the NVIDIA product side are the GeForce RTX 3070 and RTX 3080, valued at $499 and $699. The flagship model in the family, the Radeon RX 6900 XT, is delayed until December 8 and promises performance in the same category with the GeForce RTX 3090 for an estimated $999.
The debut of Radeon RX 6800 and Radeon RX 6800 XT is based on reference design devices, which have changed beyond recognition compared to previous and, frankly speaking, not too successful, reference solutions AMD, and the appearance of partner versions is expected later. Referencing graphics cards will represent the Navi 21 chip in benchmarking, but first we will have a thorough discussion about its architecture, far from simple quantitative changes of the previous generation RDNA core.
Geometric processor with Mesh Shaders support
The logic of RDNA 2, although it carries a lot of new things, has not undergone the same extensive transformation in the principles of the graphics processor, which distanced RDNA 1 from the architecture of Graphics CoreNext, which served AMD for almost eight years from the time of Radeon HD 7970 and up to the release of Radeon 5000 series. We analyzed in detail the differences between RDNA and GCN in the Radeon RX 5700 XT review and will not focus on them today, because GCN has since not only left AMD’s consumer products (only Ryzen’s integrated graphics processors still contain GCN blocks), but also in server solutions for GP-GPU tasks is about to retire and give way to the next CDNA architecture. Instead, let’s look at the structure of the chip in the direction from the front end to the back end of the rendering pipeline, refresh the knowledge about the principles of RDNA, which remained unchanged, and highlight those innovations that distinguish the logic of Navi 21 from Navi 10 – the crystal that underlies Radeon RX 5600, RX 5700 and RX 5700 XT.
At the beginning of the chain are command processors for shaders and general purpose computing, geometric processor and hardware scheduler, which recently learned how to use the Windows operating system. There is also a DMA block that serves for direct access to GPU memory via PCI Express bus as part of the collaborative work of several gas pedals, as well as some innovative features of Navi 21: Smart Access Memory on Ryzen 5000 platform and DirectStorage support – we will pay attention to them later. However, the main change in the GPU’s front-end quality is still related to the way Navi 21 handles geometry processing.
AMD uses a distributed topology of blocks involved in the preparation of primitives for shading, where the geometry processor performs common steps, and the majority of operations before and after tessellation lie on Primitive Unit blocks located inside scalable GPU partitions. While the geometry processor in the old GCN architecture is closely linked to the Direct3D programming model, RDNA has learned to cut off invisible triangles in the early stages of rendering and can take up to eight primitives per beat to give four for rasterization. The condition for this is the so-called primitive shaders – highly efficient programs that take the place of Direct3D domain and geometrical shaders in a sequence of operations of graphic API.
Primitive shaders are used in RDNA devices at the compiler level, and most of the corresponding calculations pass through them. In addition, Navi chips implement fast Surface Shaders, which, according to the compiler solution, replace part of the shader chain involved in tessellation (Vertex Shaders and Hull Shaders) before data is transferred to the tessellation itself – the fixed functionality block. Described functions, which in chipmaker terminology is called NGG (Next Generation Geometry), replace the problematic portions of the geometric conveyor, inherent to Direct3D, on the fly recoding the old type of shaders into the new, and thus perform the previous work more efficiently without the need for explicit support from developers of graphic applications.
However, it is impossible to hide the fact that the proportion between the power of geometric front-end’s and scalable GPU resources, which are engaged in texture and shader kernel’s work, familiar with the Navi 10 chip, has not changed to the best. The Navi 21 GPU with a doubled array of shader ALUs still has to settle for four primitives that have been filtered out by invisible surfaces. Fortunately, in addition to the optimizations presented in the first generation RDNA, the Navi 21 can take advantage of a completely new model of programming geometry – Mesh Shaders. Mesh Shaders were first implemented in NVIDIA’s Turing GPUs, but have since become an industry standard and have become part of the iron requirements for DirectX 12 Ultimate (feature level 12_2).
In previous versions of Direct3D, a polygon mesh of some scene object is calculated in its entirety: all numbers of the index buffer that defines the position of the vertices are considered in sequential order. Thus, the load on the early stages of the rendering conveyor increases linearly together with the geometric complexity of the scene. The Mesh Shaders model parallels this task by dividing a polygon grid into fragments (meshlets), each of which is engaged in a separate group of instruction flows. Hand in hand with Mesh Shaders go Amplifications Shaders, determining how many groups of instructions will be launched, and give the last necessary data. As the name implies for this stage of rendering, Amplifications Shaders allow geometry to be multiplied on the fly by quickly duplicating polygon meshes – a feature previously demonstrated by NVIDIA as the crown feature of Mesh Shaders. In addition, Mesh Shaders powered by Amplifications Shaders perform early removal of invisible surfaces and automatic selection of the required level of detail (LOD) directly on the GPU.
In the future, the new model should completely replace the tessellation phase, which is now performed by fixed functionality blocks, or at least complement it when a more flexible solution is appropriate. The only thing that currently hinders the migration of game graphics to Mesh Shaders is the need for explicit support for the technology in each new game.
Double array of shader ALUs and full support for DirectX 12 Ultimate
Despite a lot of quality, and some frankly revolutionary changes that brought the architecture of RDNA 2, the main thing that attracts attention in the characteristics of the chip Navi 21, is a doubled array of shader ALUs and blocks of overlapping textures compared to Navi 10, which is the basis of the flagship model of the previous generation, Radeon RX 5700 XT.
AMD engineers have done exactly what the rumors of “big Navi” from the beginning indicated: there was a return to the configuration with four Shader Engines, which is the hallmark of the top “red” GPU since the time of the chip Hawaii (series Radeon R9 290). Shader Engine is the largest scalable component of the GCN or RDNA architecture, accommodating a number of Compute Units – essentially analogues of a single core in a CPU. The number of CU varies depending on a particular GPU and in this case is 20. A fully functional Navi 21 chip contains a total of 80 active CU, which corresponds to 5120 shader ALUs.
Thus, the “big Navi” has indeed become the largest discrete GPU from AMD in terms of raw processing power in the number of operations FP32 per stroke. This is not the first time AMD has produced GPUs of this caliber. Thus, chips and Vega 10/20 already reached the mark of 4096 shader ALUs, which, by the way, makes a strong impression in retrospect. However, RDNA is an architecture incomparably better optimized for 3D rendering, not to mention second generation innovations and high clock frequencies promised to the new silicon.
|Navi 14||Navi 10||Navi 21|
|Radeon RX 5300 – Radeon RX 5500 XT||Radeon RX 5600 XT;|
Radeon RX 5700;
Radeon RX 5700 XT
|Radeon RX 6800;|
Radeon RX 6800 XT;
Radeon RX 6900 XT
|Process, nm||7 нм FinFET||7 нм FinFET||7 нм FinFET|
|Number of transistors, million||6400||10300||26800|
|Chip area, mm2||158||251||519,8|
|Number of CU/WGP/SA/SE|
|Compute Units (CU)||24||40||80|
|Workgroup Processors (WGP)||12||20||40|
|Shader Arrays (SA)||2 (?)||4||Нет|
|Shader Engines (SE)||1||2||4|
|Configuration of Compute Unit|
|Vector ALU||2 × 32||2 × 32||2 × 32|
|Special Purpose ALU (SFU)||2 × 4||2 × 4||2 × 4|
|Blocks of ray tracing (Ray Accelerators)||No||No||1|
|Texture Overlay Units (TMU)||4||4||4|
|Vector (vGPR)/scalar registers||1024/2560||1024/2560||1024/2560|
|Cache volume L0, KB||16||16||16|
|Workgrpoup Processor (WGP) configuration|
|Local Data Store (LDS)||128 (one for 2 CU)||128 (one for 2 CU)||128 (one for 2 CU)|
|Cash of Instructions, KB||32||32||32|
|Scalar caсру, Kbyte||16||16||16|
|Cache volume L1, KB||128 (one for 12 CU)||128 (one for 10 CU)||256 (one for 20 CU)|
|GPU Programmable Computing Blocks|
|Special Purpose ALU (SFU)||192||320||640|
|Fixed-function GPU Blocks|
|Texture Overlay Units (TMU)||96||160||320|
|Rasterization Operation Blocks (ROP)||32||64||128|
|Cache volume L2, KB||2048||4096||4096|
|Infinity Cache Volume, Mb||No||No||128|
|RAM bus bit||128||256||256|
|RAM chip type||GDDR6||GDDR6||GDDR6|
|PCI Express interface||4.0 x8||4.0 x16||4.0 x16|
Shader Engine also includes a section of L1 cache and a block of primitives responsible for assembly of triangles from vertices and tessellation, and a rasterizer that transitions from operations on geometrical data to pixel data. Finally, there are also combined Render Backends that act as stand-alone ROPs. And now those few readers who can draw a block diagram of previous AMD chips from memory should notice something in the Navi 21 structure. Inside Shader Engine there is no separation in half in Shader Arrays, each of which contains 10 Compute Units, associated with them primitive blocks, RB and rasterizers, ie the proportion between the number of these elements and shader ALUs has changed dramatically.
It seems that the rasterizers have become a potential bottleneck in the Navi 21 conveyor. Since AMD doesn’t speak about any changes in their bandwidth, it is reasonable to assume that each rasterizer still produces 16 pixels per stroke. Thus, the non-optimal ratio of 1:2 between the rasterization speed and pixel fillet rate is established. However, whether this is true or not, we cannot say for sure until AMD publishes the whitepaper of the new architecture.
As for RB, Navi 21 supports the pixel fillerrate required by the powerful shader kernel due to the fact that each block is now capable of sampling, testing and mixing eight32-bit pixels per beat instead of four, which gives the GPU the equivalent of 128 separate ROPs.
In addition, due to changes in RB, the gas pedal has acquired another function required to meet the DirectX 12 Ultimate standard – Variable Rate Shading. It provides the ability to flexibly and arbitrarily adjust the computational resources allocated to render individual fragments of the image, which require increased accuracy or, conversely, allow for a drop in quality. The principle of VRS work is similar to the MSAA method of full-screen antialiasing with super-sampling, when there are several color samples per pixel of the screen, each of which causes a pixel shader, but vice versa – the number of samples decreases randomly. The VRS implementation in Navi 21 supports both levels of the Direct3D feature level 12_2, i.e. you can specify the shading accuracy for a single draw call to a polygon grid, a separate primitive or just a section of the image. Various grids of sparse samples are allowed, including 1 x 2, 1 x 2 и 2 x 2.
In addition, RDNA 2 supports shading in texture space. Standard direct rendering means that the GPU rasterizes the geometry into screen pixels, performs a pixel shader for each pixel, and sends the result to the frame buffer. In Texture Space Shading, the result of pixel shaders is saved as texture space texels, which allows you to use them many times and thus save tact on further shader calculations. In addition, the quality of rendering can be improved in some situations, because TSS breaks the binding to the coordinates of the screen space and shaders can be performed outside the graphics pipeline itself.
The condition for effective operation of TSS is to determine in advance which objects it is reasonable to shade in the texture space. To solve this problem, as well as to reduce the general requirements of the renderer to the local memory capacity of the graphics card, within DirectX Ultimate there is a mechanism Samper Feedback. Each time GPU requests a certain tile (in the case when tile resources are used) or MIP texture level, a note is made in a special buffer that this resource was requested with a certain level of detail and it is loaded into VRAM. Otherwise, the memory contains only low-detailed copies of the resources.
RDNA 2 against Turing and Ampere
Now let’s dive deeper into the device chips RDNA 2, the level of logic engaged in shader calculations. A separate Compute Unit contains 64 so-called vector ALUs, divided into four blocks SIMD (Single Instruction Multiple Data). Each SIMD for one beat performs one instruction from a group of 32 operation flows (wavefront in AMD terminology). Thanks to separate schedulers that serve their own SIMDs, and the simultaneous giving of two instructions each stroke, RDNA is characterized by a lower latency performance of individual instructions compared to the old architecture GCN. In this case, RDNA allows you to work with the old 64-threaded wavefront format – in this case, the SIMD performs one instruction for two strokes, and it has its advantages: thus, the CU has a time window to wait for the data necessary for the next instruction – for example, from RAM GPU. The width of the wavefront for program execution is determined by the compiler. Computational shaders are usually compiled in Wave32 format, pixel ones in Wave64 format.
RDNA performs FP16 operations on its SIMDs at double speed, and the same applies to integer INT16 operations. In addition, the architecture supports several varieties of mixed precision, when the instruction FMA requires the multiplication of two matrices of reduced accuracy (FP16 for real numbers and up to INT4 for integers) and subsequent addition to the matrix FP32 or INT32. Such calculations take place at an accelerated rate multiplied by the discharge loss in the original operands (see illustration). According to RDNA, the GPU is additionally equipped with specialized ALUs for double precision calculations (FP64), which can be from 2 to 16 per vector SIMD, and the execution rate of instructions varies accordingly – from 1/2 to 1/16 in relation to FP32. The latter is true for Navi 21, because it is primarily a gaming GPU, and the high speed of FP64 processing should be looked for in the recently introduced CDNA architecture devices.
In fact, here AMD provides data on the bandwidth of one WGP (Workgroup Processor), i.e. CU pairs, not one of them.
In addition to vector SIMD, each Compute Unit has a double scalar conveyor that is used by operations of conditional branching, transition and similar integer arithmetic. In such cases, all the values of operations of one waverfront instruction are the same and there is a so-called scalarization in one operation so as not to waste energy by loading SIMDs with the same calculations. RDNA scalar blocks can receive their instructions in parallel to the vector instruction of the corresponding SIMD – recently we finally managed to get AMD engineers to confirm that such parallelism really exists in RDNA.
Finally, another type of executive blocks (SFU), which is present in Compute Unit, performs so-called special purpose operations: trigonometric functions, which are often used in 3D rendering tasks. SFU is a separate SIMD, which is bound to each of the main vector SIMDs. It serves as a backup path for wavefront instructions and performs it at a rate of 1/4 compared to conventional vector instructions. To allow CU to load SFU, vector SIMD skip one beat and then ready to accept and execute instructions in standard mode.
The table below summarizes the theoretical CU throughput within RDNA and RDNA 2 architecture compared to modern Turing and Ampere logic from NVIDIA and GCN chip-based solutions, which are suitable candidates to replace fresh Radeon 6000-series gas pedals. We have taken the 8 GPU cycles per unit of time to minimize the number of fractional numbers, and did not seek to cover all possible combinations of instruction types: in this case, we are only interested in operations on real and integers of standard accuracy (FP32 and INT32), as well as the calculation of trigonometric functions (SF) and floating point arithmetic of reduced accuracy (FP16).
RDNA 2 was not awarded a separate column in the table, because AMD does not report any deep changes in Compute Unit’s operation. It is only said that CU has added 30% of performance per watt of power consumption due to higher clock frequencies. But this already applies to the energy efficiency of the Navi 21 chip, which we will discuss (and then measure independently) later.
RDNA architecture is equivalent to Turing for theoretical performance in FP32/INT32, FP16/INT16 or SFU operations, when the load consists of only one type of instructions. However, it is striking that RDNA/RDNA 2 does not have the ability to perform parallel calculations of FP32 and INT32, which acquired Turing or simply double the bandwidth of the FP32, as done in Ampere. However, RDNA has a number of its unique advantages. Thus, scalarization of instructions does not increase the performance of NVIDIA chips and serves only to save power consumption, while in RDNA scalar and vector instructions are given to execution simultaneously. Further, the scheduler’s clients in “green” GPUs are also tensor cores, branching block and also a group of load/store blocks. In order to use any of them, the scheduler cannot send instructions for execution on shader CUDA cores in this tact. RDNA gets out of the situation due to the large number of ports of the scheduler (it can give up to five instructions of different types per clock) – memory queries and branching are also executed in parallel with the issuance of instructions to the vector and scalar ALU.
Not surprisingly, in the context of real applications, rather than theoretical performance evaluations, it is not easy to fully realize the potential of Ampere chips. The results approaching the passport values can be discussed only under a purely material load. Indeed, no game benchmark has ever repeated such a huge difference in performance between Ampere products and previous-generation gas pedals as, for example, tests in Blender rafters. It seems that the graphical architectures that develop NVIDIA and AMD, have changed places: once on the side of the “red” was the advantage in the number of shader ALUs, which is most clearly manifested in the tasks of GP-GPU. Now the same can be said about Ampere, which obviously gravitates to the load of the calculated type despite the mainly gaming orientation of GA102 and GA104 chips.
But what AMD still has nothing to cover – this is the above-mentioned tensor cores, designed for extremely fast execution of FMA operations on the data of reduced discharge, which are used mainly in the operation (and since some time and training) of neural networks. RDNA-based chips are forced to do this work relatively slowly by vector SIMDs. And let RDNA develop a double rate of execution of instructions half-precision, one CU per stroke makes no more than 128 operations FP16, while one streaming multiprocessor NVIDIA – as much as 512, provided that these operations in tensor format. In addition, as long as tensor cores are involved, shader ALUs also have to idle.
Hardware ray tracing
There is not much we can say about hardware-accelerated rafting – one of the crown functions of the second generation RDNA architecture – no matter how strange it may sound. Just like the rival Turing and Ampere architectures from NVIDIA, RDNA 2 relies on the popular Bounding Volume Hierarchy algorithm to optimize the search for intersections between the beam and polygons of the scene. The BVH algorithm sorts the primitives by nested boxes in advance. Thus, to quickly find the intersection point of the ray with the surface of the primitive, the program needs to recursively pass through the tree structure of BVH and only then calculate the barycentric coordinates of the intersection of the ray with the plane instead of doing an extremely inefficient search of all the triangles in the scene.
Nevertheless, even with the help of BVH rafting still remains an extremely resource-intensive task, when it is performed on universal shaders ALUs, so in the architecture Turing from NVIDIA and appeared (and then were reinforced in chips Ampere) blocks of fixed functionality designed specifically to pass the structure of BVH and calculate the intersection of the beam with the primitives inside the smallest box. Similar in purpose blocks called Ray Accelerators (RA) became an addition to the architecture of RDNA in the chip Navi 21. One part of RA is engaged in BVH, the other search for coordinates on the polygon, and the pace of calculations is 4 and 1 ray intersection per stroke, respectively. Alas, for comparison with RDNA 2 we do not have such information about RT-blocks in NVIDIA chips, but we know at least that Ampere (unlike Turing and RDNA 2) allows two parts of an RT-block to track different rays at the same time. In addition, Ampere contains specific optimizations to accelerate traced blurring in motion, which are also absent in RDNA 2.
According to internal testing, the AMD gas pedal based on Navi 21 develops performance 10 times more through hardware racing than solely software rendering on the same equipment. However, these numbers do not yet say anything about the performance of real applications. The chipmaker avoids direct comparison of new graphics cards with NVIDIA offers in game benchmarks with ray tracing, and this is not an easy task. And absolute frame rate numbers are not accompanied by sufficiently detailed comments on the testing methodology to allow us to compare them with the already well-known Turing and Ampere performance. However, we will soon find out how things really are. After all, the main thing is that hardware-accelerated racing, which was previously impossible on the “red” gland, has now become a reality, and lagging behind the pioneer is quite forgivable for early AMD solutions.
A more serious problem for Radeon with hardware rafting is the lack of as effective frame scaling tools as DLSS version 2.0 on NVIDIA graphics cards. Even the latest models of GeForce RTX 30-series at high resolution do not always support a comfortable frame rate without upscaling, and, apparently, game developers are already getting used to laying in the requirements of the graphics engine speed increase, which provides DLSS. This puts the iron AMD, clearly weaker in ray tracing, at a disadvantage. Currently the company is working on another extension of FidelityFX libraries called Super Resolution, which is a better way to reconstruct details than the existing FidelityFX CAS algorithm, but no matter how you spin it, the game market remains fragmented, and not all game studios will spend efforts to integrate two competing technologies.
Like the RT cores in Turing and Ampere chips, RDNA 2 enables hardware-accelerated racing under standard DXR GUI programming interfaces and similar Vulkan extensions. In the tasks of professional visualization blocks Ray Accelerators activate RadeonProRender plugin version 3.0 (it is still in beta status) for several 3D-modeling programs, and in the future support will extend to render Cycles in the package Blender. Here again, NVIDIA has come a long way since the debut of Turing chips, and racing acceleration on GeForce and Quadro graphics cards is now used almost everywhere. AMD will have to catch up, but on the other hand, NVIDIA has already played the role of an icebreaker, and as a result, the software infrastructure for games and working applications is now much better adapted to hardware solutions of this kind.
By the way, we won’t be surprised if one of the first companies to provide the RDNA 2 architecture chips with a software platform for rafting is Apple. “Home” chips of this company look promising, but Apple will probably not do without someone else’s graphics processors in its high-performance desktops and workstations.
Memory hierarchy Navi 21 and Infinity Cache
Now it’s time to finally discuss unconditionally the main innovation of RDNA 2 architecture – the local memory stack, which includes a huge cache of the third level. All other storages closest to shader ALUs have undergone almost no changes compared to what was already done with Navi chips of the first generation. Compute Units of RDNA are linked in pairs (Workgroup Processors) with a total LDS (Local DataShare) of 128 Kbytes (if it had grown, AMD would surely have reported), which is the fastest type of memory after vector SIMD registers, as well as a 32-kilobyte instruction cache and a scalar cache of 16 Kbytes. The zero level cache is exclusive to each CU. The volume of L1 section within each Shader Engine remains the same – four of them give in total 1 Mbyte of cache of the first level.
There is even some regression in performance: the bandwidth from L0 to L1 has been halved, and the L2 cache, common to Shader Engine blocks and command processors, has retained 4 Mbytes of capacity despite the doubled computing potential of Navi 21. However, all this more than redeems the measures AMD has taken regarding the latest memory levels available to the GPU.
GPUs of such caliber as Navi 21 need high-speed access to large amounts of data. To meet this need, AMD could go the beaten path of HBM2 memory, which has brought Radeon VII still operating among consumer graphics cards a 1TB/s memory bandwidth record, or use the 512-bit bus, which in combination with conventional GDDR6 chips guarantees no less PSP. However, both of these solutions are problematic due to different reasons, common among which is the price. Finally, there are also far less expensive GDDR6X chips in the price of the end product, but although this standard is not formally exclusive to NVIDIA products, Micron has worked with the latter, and a third in this history would probably be superfluous.
Fortunately, AMD has found its own, and seems to be the most promising, approach to the problem. The company has given up trying to speed up external dynamic memory and is using banal GDDR6 chips in the new gas pedals with 16Gb/s bandwidth on a 256-bit bus, which together gives 512GB/s. Thus, the 6000 series devices near the Radeon RX 5700 XT with its 448 Gbytes/sec. However, in addition to discrete DRAM, the GPU chip itself now houses a massive 128MB Level 3 cache called Infinity Cache.
The Infinity Cache differs significantly from the L1 and L2 caches, which are an integral part of any modern GPU, in that the internal CU and Shader Engines stores are more tuned for high bandwidth than volume and, consequently, tricky. Thus, the RDNA channel between L1 and Compute Unit within Shader Engine transmits a total of 4096 bytes per clock, and 2048 bytes between all L1 and L2 sections. In terms of the GPU as a whole, these values are tens of Tbytes per second, but due to the modest amount of cache, the percentage of hits in them is relatively small. And most importantly, it would be extremely expensive to scale such a structure in terms of crystal area. Instead, Navi 21 received the last level of cache as a pad between L2 and external dynamic memory, which was created on the model of L3 in the central processing units of Zen architecture and has an extremely high density of layout (four times smaller area relative to the capacity of L2 chips Navi).
Solid horizontal blocks at the top and bottom are Infinity Cache.
In fact, solutions such as Infinity Cache are not new. A large array of static (or high-speed dynamic) memory for the needs of the GPU – directly on the chip or as a separate chip – has repeatedly equipped SoC game consoles, and until recently it was used in intelligent mobile CPUs. But among discrete graphic processors for PC Navi 21 still became a pioneer.
It’s a pity AMD didn’t tell us what part of the crystal area is occupied by Infinity Cache, and we don’t have any real photos without diffusion barrier to measure it at least approximately (to put the abrasive chip of the test video card, sorry, my hand is not rising). But the crystal itself turned out to be expected to be large for so many computing blocks. Navi 21 contains a total of 26.8 billion transistors and, thus, is not too behind the flagship among NVIDIA consumer chips – GA102 (28.3 billion). However, Navi 21 not only outperforms Navi 10 in terms of number of transistors by a factor of 2.6, but also has higher component density, which is worth considering as a compact SRAM library, of which Infinity Cache is composed, because both processors are manufactured on the same TSMC 7-nanometer pipeline.
The L3 cache is connected to a Level 2 Infinity Fabric bus, which consists of 16 channels 64 bytes wide and accelerates to a clock speed of 1.94 GHz. The interface bandwidth is significantly lower compared to deeper layers of the memory stack, but still achieves an impressive 1.85 bytes per second, which is almost four times the GDDR6 chips on the 256-bit bus (512 Gbytes per second for new AMD gas pedals). Infinity Cache is located in its own clock domain, which drops to 1.4 GHz while L3 is not accessed frequently. In addition, the high bandwidth of Infinity Fabric, whose clients are GDDR6 controllers, accelerates even the non-cached access to RAM.
Numerical estimates of Infinity Cache performance suggest that its latency is 48% lower than 14Gb/s GDDR6 memory used in the Radeon RX 5700 XT, while the average latency of L3 and VRAM in 6000 series gas pedals is 34% lower than that of the previous generation. Infinity Cache has enough capacity to accommodate much of the data required by ray tracing units to traverse BVH structures. Also according to the chipaker’s data, the total hit rate in L3 reaches 58% for 4K games, i.e. up to 58% of GPU requests are served at extremely high speed, which is second only to the PSP of the latest server GPUs with HBM2 memory.
Of course, 58% is the most optimistic estimate, and the guarantee of high hit rates in this case are driver algorithms. AMD does not require the software to explicitly specify what data should be placed in L3, although this feature exists, so all existing applications automatically take advantage of the new memory architecture.
We expect that the performance of Navi 21 in different games and professional software will not be equally high and will surely increase in the future as drivers and applications are optimized. In any case, Infinity Cache has already allowed AMD to boldly double the number of shader ALUs and dramatically increase GPU clock speeds without worrying about PSPs, while at the same time improving memory efficiency over alternative solutions involving 384-bit or 512-bit GDDR6 chips.
Finally, AMD announced support for the DirectStorage API in Windows, which enables direct loading of game resources from SSDs into the local memory of the GPU. Unfortunately, the chipmaker is far less detailed about the features of DirectStorage implementation in its own products than its competitor on the similar RTX IO technology. In particular, there is no mention of hardware data decompression by shader ALUs, which is an important part of the NVIDIA solution. In addition, DirectStorage gives the impression of a software product that does not rely on fundamentally new hardware tools. For example, RTX IO will not only run on fresh GeForce 30-series graphics cards, but will also run on GeForce RTX 20. AMD, in turn, did not specify whether DirectStorage compatibility extends to first-generation RDNA chips, although we do not see any technical reasons that could prevent this.
You can’t ignore the following RDNA 2 function called Smart Access Memory, the principle of which is not yet fully clear. Judging by the way SAM is characterized by its creators, it gives the CPU direct access to the full amount of local memory of the graphics card. In typical gaming computers, a small part of VRAM is always part of the address space of the system memory, but SAM, as we assume, simply allows to display the entire VRAM in RAM, which will avoid unnecessary data copying. In order to get the most out of it, software optimization is welcome, but already now the technology promises, according to AMD’s averaged estimations, an additional 6% FPS in popular games with a maximum of up to 11%. SAM is available on Radeon 5000 and 6000 series gas pedals, but there is a catch: only Ryzen 5000 processors and motherboards on B550 or X570 chipset can work with it (you will need to update the BIOS and activate SAM in its settings).
Recently it became known that NVIDIA is working on its own analog Smart Access Memory, which will work together with Intel chips and, if AMD does not object, Ryzen. If so, it is possible that AMD will remove restrictions from the SAM on the graphics card side, and full access to VRAM, hopefully sooner or later will be open to any combination of CPU and GPU.
AV1 decoding and HDMI 2.1 output
According to official performance estimates, the Navi 21 multimedia box has not undergone any changes compared to a similar component of the first generation Navi chips in terms of decoding and encoding video standards H.264, HEVC and VP9, although it must be said that AMD underestimates its potential: in our own benchmarks, we got better results from Radeon RX 5700 XT. However, Navi 21 has acquired the ability to encode HEVC with 8K resolution, which the “red” GPU did not have before, and support for B-frames (one of the types of intermediate frames) within H.264. More importantly, Navi 21 has learned to decode the advanced and extremely resource-intensive AV1 standard with a frame rate of 30 FPS at 8K resolution (even if it’s not 60 FPS or more, like in Ampere silicon from NVIDIA).
Finally, the advanced HDMI 2.1 video interface has already reached commercial implementation in TVs and monitors, and now it is being mastered by video cards. The Navi 21 display controller uses the full bandwidth of the HDMI 2.1 to output an 8K resolution image at 60 Hz or, more importantly, 4K at 120 Hz without the need for data compression.