The main difference is that CPU cores are isolated, so sharing data between them is explicit. If you’ve been following the site over the last couple of months, you’ll note that this is basically the same as AMD is doing with Threadripper and EPYC. The paper touches on the topic several times, but I didn’t really see anything explicit about what they were doing. That said, they spent quite a bit of time discussing how much bandwidth is required within the package, and figures of 768 GB/s to 3TB/s were mentioned, so it’s possible that it’s just the same tricks as fetching from global memory. I don’t know how the design would automatically account for fetching data that’s associated with other GPU modules, as this would probably be a huge stall. While NVIDIA’s simulations, run on 48 different benchmarks, have accounted for this, I still can’t visualize how this would work in an automated way. It was also faster than the multi-card equivalent by 26.8%. Regardless, that hypothetical, impossible design was only ~10% faster than the actually-possible multi-chip one, showing that the overhead of splitting the design is only around that much, according to their simulation. They scaled up the single-chip design until it had the same amount of compute units as the multi-die design, even though this wouldn’t work in the real world because no fab could actual lithograph it. NVIDIA ran simulations to determine how this chip would perform, and, in various workloads, they found that it out-performed the largest possible single-chip GPU by about 45.5%. In the first diagram, the GPU is a single, typical die that’s surrounded by four stacks of HBM, like GP100 the second configuration breaks the GPU into five dies, four GPU modules and an I/O controller, with each GPU module attached to a pair of HBM stacks.
![nvidia graphics cards comparison 2017 nvidia graphics cards comparison 2017](https://i0.wp.com/eurobudgetgamer.eu/wp-content/uploads/2017/03/617jU7jTjWL._SL1000_.jpg)
In their diagram, they show two examples. NVIDIA published a research paper discussing just that. What’s one way around it? Split your design across multiple dies!
![nvidia graphics cards comparison 2017 nvidia graphics cards comparison 2017](https://htxt.co.za/wp-content/uploads/2017/05/Linus-Tech-Tips-NVIDIA-Decade.png)
One harsh limitation for GPUs is that, while your workloads could theoretically benefit from more and more processing units, the number of usable chips from a batch shrinks as designs grow, and the reticle limit of a fab’s manufacturing node is basically a brick wall. When designing an integrated circuit, you are attempting to fit as much complexity as possible within your budget of space, power, and so forth.
![nvidia graphics cards comparison 2017 nvidia graphics cards comparison 2017](https://venturebeat.com/wp-content/uploads/2018/06/screen-shot-2018-06-04-at-2-42-15-pm.jpg)
How do you get past reticle limits? Split into multiple dies… maybe?