Deep Learning GPU Benchmarks

GPU training/inference speeds using PyTorch®/TensorFlow for computer vision (CV), NLP, text-to-speech (TTS), etc.

PyTorch Training GPU Benchmarks

Relative Training Throughput w.r.t 1xLambdaCloud V100 16GB (All Models) 0 1 2 3 4 5 6 7 8 9 LambdaCloud 1x H100 80GB SXM5 24.03 H100 80GB SXM5 LambdaCloud 1x H100 80GB PCIe 24.03 H100 80GB PCIe Gen5 LambdaCloud H100 80GB PCIe Gen5 A100 80GB SXM4 A100 80GB PCIe LambdaCloud A100 40GB PCIe RTX 4090 RTX 6000 Ada RTX A6000 RTX 3090 LambdaCloud A10 LambdaCloud V100 16GB Quadro RTX 8000
RECORD_NAME Speedup
LambdaCloud 1x H100 80GB SXM5 24.03 8.54
H100 80GB SXM5 7.78
LambdaCloud 1x H100 80GB PCIe 24.03 5.73
H100 80GB PCIe Gen5 5.55
LambdaCloud H100 80GB PCIe Gen5 5.31
A100 80GB SXM4 4.64
A100 80GB PCIe 4.45
LambdaCloud A100 40GB PCIe 3.56
RTX 4090 2.94
RTX 6000 Ada 2.82
RTX A6000 2.15
RTX 3090 1.82
LambdaCloud A10 1.39
LambdaCloud V100 16GB 1
Quadro RTX 8000 0.96

PyTorch Training GPU Benchmarks 2022

Relative Training Throughput w.r.t 1xV100 32GB (All Models) 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 A100 80GB SXM4 A100 80GB PCIe A100 40GB SXM4 A100 40GB PCIe RTX A6000 Lambda Cloud — RTX A6000 RTX A5500 RTX 3090 RTX A40 RTX A5000 RTX A4500 V100 32GB Quadro RTX 8000 RTX 3080 Titan RTX Quadro RTX 6000 RTX A4000 RTX 2080Ti RTX 3080 Max-Q Quadro RTX 5000 GTX 1080Ti RTX 3070 RTX 2080 SUPER MAX-Q RTX 2080 MAX-Q RTX 2070 MAX-Q
RECORD_NAME Speedup
A100 80GB SXM4 3.89
A100 80GB PCIe 3.76
A100 40GB SXM4 3.1
A100 40GB PCIe 2.85
RTX A6000 1.83
Lambda Cloud — RTX A6000 1.8
RTX A5500 1.53
RTX 3090 1.49
RTX A40 1.36
RTX A5000 1.19
RTX A4500 1.1
V100 32GB 1
Quadro RTX 8000 0.88
RTX 3080 0.86
Titan RTX 0.85
Quadro RTX 6000 0.83
RTX A4000 0.75
RTX 2080Ti 0.66
RTX 3080 Max-Q 0.58
Quadro RTX 5000 0.55
GTX 1080Ti 0.5
RTX 3070 0.49
RTX 2080 SUPER MAX-Q 0.37
RTX 2080 MAX-Q 0.34
RTX 2070 MAX-Q 0.33

YoloV5 Inference GPU Benchmarks

Relative Inference Latency w.r.t 1xRTX 8000 (All Models) 0.0 0.2 0.4 0.6 0.8 RTX 8000 3080 A100 80GB PCIe RTX A6000
RECORD_NAME Relative Latency
RTX 8000 1
3080 0.94
A100 80GB PCIe 0.73
RTX A6000 0.7

GPU Benchmark Methodology

To measure the relative effectiveness of GPUs when it comes to training neural networks we’ve chosen training throughput as the measuring stick. Training throughput measures the number of samples (e.g. tokens, images, etc...) processed per second by the GPU.

Using throughput instead of Floating Point Operations per Second (FLOPS) brings GPU performance into the realm of training neural networks. Training throughput is strongly correlated with time to solution — since with high training throughput, the GPU can run a dataset more quickly through the model and teach it faster.

In order to maximize training throughput it’s important to saturate GPU resources with large batch sizes, switch to faster GPUs, or parallelize training with multiple GPUs. Additionally, it’s also important to test throughput using state of the art (SOTA) model implementations across frameworks as it can be affected by model implementation.

pytorch

PyTorch®

We are working on new benchmarks using the same software version across all GPUs. Lambda's PyTorch® benchmark code is available here.

The 2023 benchmarks used using NGC's PyTorch® 22.10 docker image with Ubuntu 20.04, PyTorch® 1.13.0a0+d0d6b1f, CUDA 11.8.0, cuDNN 8.6.0.163, NVIDIA driver 520.61.05, and our fork of NVIDIA's optimized model implementations.

The 2022 benchmarks used using NGC's PyTorch® 21.07 docker image with Ubuntu 20.04, PyTorch® 1.10.0a0+ecc3718, CUDA 11.4.0, cuDNN 8.2.2.26, NVIDIA driver 470, and NVIDIA's optimized model implementations in side of the NGC container.

PyTorch® is a registered trademark of The Linux Foundation. pytorch.org/

YOLOv5_banner-1799x309

YoloV5

YOLOv5 is a family of SOTA object detection architectures and models pretrained by Ultralytics. We use the opensource implementation in this repo to benchmark the inference lantency of YOLOv5 models across various types of GPUs and model format (PyTorch®, TorchScript, ONNX, TensorRT, TensorFlow, TensorFlow GraphDef). Details for input resolutions and model accuracies can be found here.