Over a year ago, AMD laid out a two-pronged strategy for its 14nm GPU refreshes. First, it would refresh its midrange GPUs with new 14nm hardware based on an updated Graphics Core Next GPU, codenamed Polaris. These entry-level to midrange cards would be followed by a full high-end refresh based on a new GPU design, codenamed Vega. Polaris arrived on-schedule and delivered a significant performance boost in the areas where AMD needed it most, but details on Vega have been slow to materialize.
Today, that changes. We don’t have review hardware in-hand yet, but AMD has finally pulled back the curtain and shared some significant information on what Vega can do and what it changes compared with GCN. Vega will use second-generation High Bandwidth Memory (HBM2) rather than HBM or GDDR5. HBM2 delivers two substantial improvements over HBM — it doubles the data rate per pin, meaning 2x more memory bandwidth is provided in the same number of “stacks,” and it significantly increases how much RAM can fit into each stack. HBM, if you recall, topped out at 4x1GB stacks, or 4GB of RAM total. This was already a bit of a tight squeeze for AMD’s Fury X family in June 2015, but HBM2 demolishes this limitation.
Consumer cards with HBM2 will likely start at 8GB of RAM, with the standard capable of supporting at least 32GB. Any cards with that much RAM that appear in 2017 will be workstation or server-oriented, but the headroom is there when AMD eventually needs it for consumer cards. Rumors AMD would release both an HBM2 and GDDR5X version of Vega appear to be wrong, much like the rumors of an 8GB Fury X in the run-up to that GPU’s launch that never materialized.
AMD isn’t just relying on HBM2 for traditional memory, however. Vega will also introduce two new, HBM2-related features: A High Bandwidth Cache, and High Bandwidth Cache Controller.
The HBC and HBCC give Vega a large (in comparison to on-die caches, though exact size isn’t known) memory pool that it can use in a variety of ways. AMD isn’t giving out the exact details on how this cache functions yet, but the goal is to enable fine-grained data movement and keep important data local to the GPU without having to pull it out of memory. It can also be accessed without stalling other workloads — normally the GPU will stall if pulling texture data out of main memory, whereas AMD’s HBCC avoids this problem.
The High Bandwidth Cache Controller provides 512TB of virtual address space and it uses relatively small pages to ensure the GPU gets fed the data it needs rather than a bunch of information that ultimately won’t be used. There are also algorithms in-place to monitor the rate at which data is loaded or evicted from the cache.
One of the most common misconceptions about GPU RAM allocation and the popular freeware utility GPU-Z is that GPU-Z is capable of telling you how much RAM the GPU is actually using. As we first covered in our tests of whether 4GB of RAM was enough for the Fury X, it is not. GPU-Z and all of the utilities that report VRAM usage under DirectX 11 cannot tell you how much RAM the GPU is actually utilizing because that information is not actually given by the DirectX 11 API. Instead, they report how much RAM has been allocated by the GPU, not whether the GPU is actually making any use of that VRAM. As the slide above shows, the gap between how much VRAM has been allocated and how much VRAM is actually in-use is quite significant, even in popular titles. The goal of AMD’s HBC + HBCC cluster is to allow the GPU to load and access data more efficiently.
We’ve gathered the next few slides, with briefer explanations, into a single slideshow. Each slide can be clicked on to open a larger version in its own window.
Meet the NCU:
From 2012 to the present day, AMD’s GPUs have all been built around Graphics Core Next and its Compute Units. With Vega, AMD is debuting its New Compute Units (NCUs). There are 128 cores per NCU (double what GCN offered) and the cores themselves are capable of 512 8-bit operations per clock, 256 16-bit operations, or 128 32-bit operations.
NCUs can pack multiple 8-bit or 16-bit operations into the same execution window, allowing the GPU to double or quadruple its throughput depending on its workload. Our understanding is the ALU doesn’t dynamically reconfigure itself on the fly, but it can execute variable width instructions (1x 32-bit, 2×16-bit, etc). This gives AMD a potent hand to play in emerging fields like AI or deep learning by boosting throughput.
One of the weak spots of GCN was that the core dramatically favored width over clock speed. This worked well early in its life, when it competed against Kepler and to some extent Maxwell, but Pascal gave Nvidia an enormous amount of clock speed headroom that AMD’s wider RX 480 design didn’t counter effectively. AMD isn’t releasing its target clock speeds or IPC rates just yet, but Vega is designed to give improvements on both fronts, with both higher clock rates and higher IPC efficiency.
Finally, AMD is connecting Vega’s ROPs directly to its L2 cache. This will boost performance in games that use deferred rendering because it allows the GPU’s render backends to write directly to L2 rather than moving data through main memory first.
There’s a lot we still don’t know about Vega, including TDP, number of cores, price, and performance figures. AMD has played its cards close to its vest with both Vega and Ryzen, only disclosing information bit by bit. We still don’t have a release date for Vega — AMD has previously said H1, but it’s also possible the company is playing this close to the chest as well. For now, these aren’t areas where I’m willing to speculate.
As far as the GPU design itself, these look like the right sort of improvements Vega needs to make. Nvidia’s Maxwell was a huge efficiency leap over Kepler and its tiled renderer is thought to have been a large part of the reason. AMD adopting this approach makes good sense, while the core’s high-bandwidth cache and cache controller offer capabilities we haven’t seen on a GPU before.
We know AMD needed both higher IPC and faster clock speeds, and the company is promising Vega delivers both. While throughput figures don’t tell us everything, being able to issue up to 11 polygons per clock instead of four is a substantial improvement to geometry processing, even before we take relative efficiency into account.
There are some other interesting parts of the block diagram, like the “Network Storage” block. This could be a reference to the SSD+GPU concept AMD unveiled at SIGGRAPH 2016, or even an on-die low-latency storage pool that bypasses the need to pull data in via PCI Express. Meanwhile, the variable ALU width that supports 8-bit, 16-bit, and 32-bit data gives AMD the opportunity to duke it out in the high-end HPC, AI, and deep learning markets where Nvidia has dominated to-date.
Paper specs can only tell us so much and I’m not going to render a verdict on Team Green versus Red performance until we’ve got hardware in-hand. But based on what we’ve seen, AMD has made the right moves with Vega. After five years with GCN, AMD needed a dramatically new approach. It looks like they’ve got one.