NVIDIA Hopper H100 GPU shown in all its glory: the world’s fastest 4nm GPU and the world’s first with HBM3 memory

NVIDIA Hopper H100 GPU shown in all its glory: the world’s fastest 4nm GPU and the world’s first with HBM3 memory

NVIDIA's flagship Datacenter GPU, the Hopper H100, has been pictured in all its glory. (Image Credits: CNET)

At GTC 2022, NVIDIA unveiled its GPU Hopper H100, a computing powerhouse designed for the next generation of data centers. It’s been a while since we’ve talked about this powerful chip, but it looks like NVIDIA has given a close-up of its flagship chip to select media.

NVIDIA Hopper H100 GPU: the first with 4nm technology and HBM3 achieves high-resolution images

CNET managed to get not only the graphics card the H100 GPU was integrated with but also the same chip. It is a massive chip with the latest 4nm technology and 80 billion transistors combined with the latest HBM3 memory technology. According to the technology port, the H100 is built on a PG520 PCB that has more than 30 VRMs of power and a massive integrated interferometer that uses TSMC’s CoWoS technology to integrate a Hopper GPU with a 6-stack HBM3 design.

Next-gen NVIDIA GeForce RTX 4090 with best-in-class AD102 GPU may be the first gaming graphics card to exceed 100 TFLOPs

NVIDIA Hopper H100 GPU shown (Image credits: CNET):

Of the six stacks, two stacks are retained to ensure performance integrity. But the new HBM3 standard allows capacities of up to 80 GB at speeds of 3 TB/s, which is crazy. For comparison, the current fastest gaming graphics card, the RTX 3090 Ti, offers just 1TB/s of bandwidth and 24GB of VRAM capabilities. Apart from that, the H100 Hopper GPU also incorporates the latest FP8 data format and with its new SXM connection, it helps to support the 700W power design that the chip is built around.

NVIDIA Hopper H100 GPU Specs at a Glance

As for the specs, the NVIDIA Hopper GH100 GPU is comprised of a massive configuration of 144 SM (Streaming Multiprocessor) chips which is featured in a total of 8 GPCs. These GPCs switch in a total of 9 TPCs which are further composed of 2 SM units each. This gives us 18 SM per GPC and 144 on the full 8 GPC configuration. Each SM is made up of up to 128 FP32 units, which should give us a total of 18,432 CUDA cores. Here are some of the configurations you can expect from the H100 chip:

The full GH100 GPU implementation includes the following units:

Intel CEO Pat Gelsinger predicts an end to chip shortages by 2024

  • 8 GPC, 72 TPC (9 TPC/GPC), 2 SM/TPC, 144 SM per full GPU
  • 128 FP32 CUDA cores per SM, 18,432 FP32 CUDA cores per full GPU
  • 4 fourth-generation Tensor cores per SM, 576 per full GPU
  • 6 HBM3 or HBM2e stacks, 12 512-bit memory controllers
  • 60 MB L2 cache
  • Gen 4 NVLink and PCIe Gen 5

The NVIDIA H100 GPU with SXM5 card form factor includes the following units:

  • 8 GPC, 66 TPC, 2 SM/TPC, 132 SM per GPU
  • 128 FP32 CUDA cores per SM, 16896 FP32 CUDA cores per GPU
  • 4 fourth-generation Tensor cores per SM, 528 per GPU
  • 80 GB HBM3, 5 HBM3 stacks, 10 512-bit memory controllers
  • 50 MB L2 cache
  • Gen 4 NVLink and PCIe Gen 5

This is a 2.25x increase over the complete GA100 GPU configuration. NVIDIA also takes advantage of more FP64, FP16, and Tensor cores in its GPU Hopper, which would boost performance immensely. And it will be a necessity to compete with Intel’s Ponte Vecchio, which is also expected to feature 1:1 FP64.

The cache is another space that NVIDIA paid a lot of attention to, increasing it to 48MB in the Hopper GH100 GPU. This is a 20% increase over the Ampere GA100 GPU’s 50MB cache and 3x the size of AMD’s flagship Aldebaran MCM GPU, the MI250X.

performance, NVIDIA’s GH100 Hopper:

To sum up the performance, the NVIDIA GH100 Hopper GPU will provide 4000 TFLOPs from the FP8, 2000 TFLOPs from the FP16, and 1000 TFLOPs from the TF32, and 60 TFLOPs from the FP64 compute performance. These records eliminate all other HPC accelerators that came before them. For comparison, it’s 3.3 times faster than NVIDIA’s GPU and 28% faster than the AMD Instinct MI250X at the expense of the FP64. On the FP16’s account, this unit is 3x faster than the A100 and 5.2x faster than the MI250X which is insane.

The PCIe variant, a scale model, was recently listed in Japan for over US$30,000, so one can imagine that the SXM variant with a more robust configuration will easily cost around $50,000.

Tesla A100 specs based on NVIDIA Ampere GA100 GPU:

NVIDIA Tesla graphics card NVIDIA H100 (SMX5) NVIDIA H100 (PCIe) NVIDIA A100 (SXM4) NVIDIA A100 (PCIe4) Tesla V100S (PCIe) Tesla V100 (SXM2) Tesla P100 (SXM2) Tesla P100
(PCI Express)
Tesla M40
(PCI Express)
Tesla K40
(PCI Express)
GPUs GH100 (Hopper) GH100 (Hopper) GA100 (Ampere) GA100 (Ampere) GV100 (Volta) GV100 (Volta) GP100 (Pascal) GP100 (Pascal) GM200 (Maxwell) GK110 (Kepler)
Process node 4nm 4nm 7nm 7nm 12nm 12nm 16nm 16nm 28nm 28nm
Transistors 80 billion 80 billion 54.2 billion 54.2 billion 21.1 billion 21.1 billion 15.3 billion 15.3 billion 8 billion 7.1 billion
GPU die size 814mm2 814mm2 826mm2 826mm2 815mm2 815mm2 610mm2 610mm2 601mm2 551mm2
SMS 132 114 108 108 80 80 56 56 24 15
CPT 66 57 54 54 40 40 28 28 24 15
CUDA FP32 cores per SM 128 128 64 64 64 64 64 64 128 192
FP64/SM CUDA Cores 128 128 32 32 32 32 32 32 4 64
CUDA FP32 cores 16896 14592 6912 6912 5120 5120 3584 3584 3072 2880
CUDA FP64 cores 16896 14592 3456 3456 2560 2560 1792 1792 96 960
Tensor cores 528 456 432 432 640 640 N / A N / A N / A N / A
Texture units 528 456 432 432 320 320 224 224 192 240
Boost clock To be determined To be determined 1410MHz 1410MHz 1601MHz 1530MHz 1480MHz 1329MHz 1114MHz 875MHz
TOP (DNN/AI) 2000 TOPs
4000 TOPs
1600 TOPs
3200 TOPs
1248 TOP
2496 TOP sparingly
1248 TOP
2496 TOP sparingly
130 TOPs 125 TOPs N / A N / A N / A N / A
FP16 Calculation 2000 TFLOPs 1600 TFLOPs 312 TFLOPs
624 TFLOPs sparingly
312 TFLOPs
624 TFLOPs sparingly
32.8 TFLOPs 30.4 TFLOPs 21.2 TFLOPs 18.7 TFLOPs N / A N / A
Calculation FP32 1000 TFLOPs 800 TFLOPs 156 TFLOPs
(19.5 TFLOP standard)
156 TFLOPs
(19.5 TFLOP standard)
16.4 TFLOPs 15.7 TFLOPs 10.6 TFLOPs 10.0 TFLOPs 6.8 TFLOPs 5.04 TFLOPs
Calculation FP64 60 TFLOPs 48 TFLOPs 19.5 TFLOPs
(9.7 TFLOP standard)
19.5 TFLOPs
(9.7 TFLOP standard)
8.2 TFLOPs 7.80 TFLOPs 5.30 TFLOPs 4.7 TFLOPs 0.2 TFLOPs 1.68 TFLOPs
Memory interface HBM3 5120 bit HBM2e 5120 bit HBM2e 6144 bit HBM2e 6144 bit HBM2 4096 bit HBM2 4096 bit HBM2 4096 bit HBM2 4096 bit 384-bit GDDR5 384-bit GDDR5
Memory size Up to 80 GB HBM3 at 3.0 Gbps Up to 80 GB HBM2e at 2.0 Gbps Up to 40 GB HBM2 at 1.6 TB/s
Up to 80 GB HBM2 at 1.6 TB/s
Up to 40 GB HBM2 at 1.6 TB/s
Up to 80 GB HBM2 at 2.0 TB/s
16 GB HBM2 at 1134 GB/s 16 GB HBM2 at 900 GB/s 16 GB HBM2 at 732 GB/s 16 GB HBM2 at 732 GB/s
12 GB HBM2 at 549 GB/s
24 GB GDDR5 at 288 GB/s 12 GB GDDR5 at 288 GB/s
L2 cache size 51200 KB 51200 KB 40960 KB 40960 KB 6144 KB 6144 KB 4096 KB 4096 KB 3072 KB 1536 KB
PDT 700W 350W 400W 250W 250W 300W 300W 250W 250W 235W

#NVIDIA #Hopper #H100 #GPU #shown #glory #worlds #fastest #4nm #GPU #worlds #HBM3 #memory

Tags: , , , , , , , , ,

Leave a Reply

Your email address will not be published.