The fastest 4nm GPU in the world and the first in the world with HBM3 memory

At GTC 2022, NVIDIA unveiled the GPU Hopper H100, a power unit designed for the next generation of data centers. It’s been a while since we talked about this powerful chip, but it seems that NVIDIA has given a close-up of its top chip in selected media.

NVIDIA Hopper H100 GPU: First with 4nm and HBM3 technology Gets high resolution images

CNET has managed to understand not only the graphics board on which the GPU H100 is integrated but also the H100 chip itself. The GPU H100 is a monster chip that comes with the latest 4nm technology and incorporates 80 billion transistors along with state-of-the-art HBM3 memory technology. According to the technology outlet, the H100 is built on the PG520 PCB board that has over 30 VRM power and a huge built-in interceptor that uses TSMC CoWoS technology to combine the Hopper H100 GPU with a 6-stack HBM3 design.

The next generation NVIDIA GeForce RTX 4090 with top AD102 GPU could be the first gaming graphics card to exceed 100 TFLOP

NVIDIA Hopper H100 GPU illustrated (Image Credits: CNET):

Of the six stacks, two are stacked to ensure performance integrity. But the new HBM3 standard allows capacity of up to 80 GB at speeds of 3 TB / s, which is crazy. By comparison, the current fastest gaming graphics card, the RTX 3090 Ti, offers just 1 TB / s of bandwidth and 24 GB of VRAM capacity. Apart from that, the GPU H100 Hopper also has the latest FP8 data format and through its new SXM connection, helps to adjust the 700W power design around which the chip is designed.

NVIDIA Hopper H100 GPU Specifications at a Glance

So keeping in mind the specifications, the NVIDIA Hopper GH100 GPU consists of a huge chip assembly of 144 SM (Streaming Multiprocessor) which is presented in a total of 8 GPCs. These GPCs generate a total of 9 TPCs, which further consist of 2 SM units each. This gives us 18 SM per GPC and 144 in the full 8 GPC configuration. Each SM consists of up to 128 FP32 units which should give us a total of 18,432 CUDA cores. Here are some of the configurations you can expect from the H100 chip:

The full implementation of the GH100 GPU includes the following units:

Intel CEO Pat Gelsinger predicts end to chip shortages by 2024

  • 8 GPC, 72 TPC (9 TPC / GPC), 2 SM / TPC, 144 SM per full GPU
  • 128 FP32 CUDA cores per SM, 18432 FP32 CUDA cores per full GPU
  • 4 fourth generation tensor cores per SM, 576 per full GPU
  • 6 HBM3 or HBM2e stacks, 12 512 bit memory controllers
  • 60 MB L2 Cache
  • NVLink fourth generation and PCIe Gen 5

The NVIDIA H100 GPU with SXM5 board format factor includes the following units:

  • 8 GPC, 66 TPC, 2 SM / TPC, 132 SM per GPU
  • 128 FP32 CUDA Cores per SM, 16896 FP32 CUDA Cores per GPU
  • 4 fourth generation Tensor cores per SM, 528 per GPU
  • 80 GB HBM3, 5 HBM3 stacks, 10 512 bit memory controllers
  • 50 MB L2 Cache
  • NVLink fourth generation and PCIe Gen 5

This is a 2.25x increase over the full GAU GA100 configuration. NVIDIA also utilizes more FP64, FP16 & Tensor cores within the GPU Hopper that would boost performance. And that will be necessary to compete with Intel’s Ponte Vecchio, which is also expected to feature a 1: 1 FP64.

Cache is another area where NVIDIA has paid close attention, raising it to 48 MB on the Hopper GH100 GPU. This is a 20% increase over the 50 MB cache displayed on the Ampere GA100 GPU and 3 times the size of the AMD Aldebaran MCM GPU flagship MI250X.

Rounding out the performance data, the NVIDIA GH100 Hopper GPU will offer 4000 TFLOPs of FP8, 2000 TFLOPs of FP16, 1000 TFLOPs of TF32 and 60 TFLOPs of FP64 Compute performance. These record numbers outperform all previous HPC accelerators. By comparison, this is 3.3 times faster than NVIDIA’s GPU A100 and 28% faster than AMD’s Instinct MI250X on the FP64. In calculating the FP16, the GPU H100 is 3 times faster than the A100 and 5.2 times faster than the MI250X, which is literally awesome.

The PCIe variant, which is a truncated model, was recently introduced in Japan for over $ 30,000, so one can imagine that the better-configured SXM variant would easily cost around $ 50.

Tesla A100 Specifications with NVIDIA Ampere GA100 GPU:

NVIDIA Tesla graphics card NVIDIA H100 (SMX5) NVIDIA H100 (PCIe) NVIDIA A100 (SXM4) NVIDIA A100 (PCIe4) Tesla V100S (PCIe) Tesla V100 (SXM2) Tesla P100 (SXM2) Tesla P100
(PCI-Express)
Tesla M40
(PCI-Express)
Tesla K40
(PCI-Express)
GPU GH100 (hopper) GH100 (hopper) GA100 (Ampere) GA100 (Ampere) GV100 (Volta) GV100 (Volta) GP100 (Pascal) GP100 (Pascal) GM200 (Maxwell) GK110 (Kepler)
Process node 4 nm 4 nm 7 nm 7 nm 12 nm 12 nm 16 nm 16 nm 28 nm 28 nm
Transistor 80 billion 80 billion 54.2 billion 54.2 billion 21.1 billion 21.1 billion 15.3 billion 15.3 billion 8 billion 7.1 billion
GPU Die Size 814mm2 814mm2 826mm2 826mm2 815mm2 815mm2 610 mm2 610 mm2 601 mm2 551 mm2
Text message 132 114 108 108 80 80 56 56 24 15
TPC 66 57 54 54 40 40 28 28 24 15
FP32 CUDA Cores per SM 128 128 64 64 64 64 64 64 128 192
FP64 CUDA / SM kernels 128 128 32 32 32 32 32 32 4 64
CUDA FP32 cores 16896 14592 6912 6912 5120 5120 3584 3584 3072 2880
CUDA FP64 cores 16896 14592 3456 3456 2560 2560 1792 1792 96 960
Tensile cores 528 456 432 432 640 640 N / A N / A N / A N / A
Texture Units 528 456 432 432 320 320 224 224 192 240
Clock boost TBD TBD 1410 MHz 1410 MHz 1601 MHz 1530 MHz 1480 MHz 1329 MHz 1114 MHz 875 MHz
TOP (DNN / AI) 2000 TOP
4000 TOP
1600 TOP
3200 TOP
1248 TOP
2496 TOPs with Sparity
1248 TOP
2496 TOPs with Sparity
130 TOP 125 TOP N / A N / A N / A N / A
FP16 Calculation 2000 TFLOP 1600 TFLOP 312 TFLOP
624 TFLOP with Sparity
312 TFLOP
624 TFLOP with Sparity
32.8 TFLOP 30.4 TFLOP 21.2 TFLOP 18.7 TFLOP N / A N / A
FP32 Calculation 1000 TFLOP 800 TFLOP 156 TFLOP
(Standard 19.5 TFLOPs)
156 TFLOP
(Standard 19.5 TFLOPs)
16.4 TFLOP 15.7 TFLOP 10.6 TFLOP 10.0 TFLOP 6.8 TFLOP 5.04 TFLOP
FP64 Calculation 60 TFLOP 48 TFLOP 19.5 TFLOP
(Standard 9.7 TFLOP)
19.5 TFLOP
(Standard 9.7 TFLOP)
8.2 TFLOP 7.80 TFLOP 5.30 TFLOP 4.7 TFLOP 0.2 TFLOP 1.68 TFLOP
Memory interface 5120-bit HBM3 5120-bit HBM2e 6144-bit HBM2e 6144-bit HBM2e 4096-bit HBM2 4096-bit HBM2 4096-bit HBM2 4096-bit HBM2 384-bit GDDR5 384-bit GDDR5
Memory size Up to 80 GB HBM3 @ 3.0 Gbps Up to 80 GB HBM2e @ 2.0 Gbps Up to 40 GB HBM2 @ 1.6 TB / s
Up to 80 GB HBM2 @ 1.6 TB / s
Up to 40 GB HBM2 @ 1.6 TB / s
Up to 80 GB HBM2 @ 2.0 TB / s
16 GB HBM2 @ 1134 GB / s 16 GB HBM2 @ 900 GB / s 16 GB HBM2 @ 732 GB / s 16 GB HBM2 @ 732 GB / s
12 GB HBM2 @ 549 GB / s
24 GB GDDR5 @ 288 GB / s 12 GB GDDR5 @ 288 GB / s
Cache size L2 51200 KB 51200 KB 40960 KB 40960 KB 6144 KB 6144 KB 4096 KB 4096 KB 3072 KB 1536 KB
TDP 700W 350 W 400W 250 W 250 W 300W 300W 250 W 250 W 235 W

Leave a Reply

Your email address will not be published. Required fields are marked *