I don’t trust the magic numbers in the datasheets.
Script just matmul’s a bunch of matrices with random numbers of given size and precision.
NVIDIA RTX A4000
Size |
Precision |
TFLOPs |
1024 | FP32 | 9.70 ± 0.19 |
1024 | FP32+TF32 | 18.45 ± 0.18 |
1024 | FP16 | 35.59 ± 1.80 |
1024 | FP16+TF32 | 37.56 ± 0.62 |
1024 | FP16_REDUCED | 38.20 ± 0.18 |
1024 | BF16 | 34.71 ± 1.62 |
1024 | BF16+TF32 | 36.17 ± 0.29 |
2048 | FP32 | 11.41 ± 0.10 |
2048 | FP32+TF32 | 29.47 ± 0.61 |
2048 | FP16 | 57.91 ± 0.21 |
2048 | FP16+TF32 | 54.99 ± 0.05 |
2048 | FP16_REDUCED | 54.89 ± 0.11 |
2048 | BF16 | 55.04 ± 0.10 |
2048 | BF16+TF32 | 55.12 ± 0.05 |
4096 | FP32 | 9.07 ± 0.35 |
4096 | FP32+TF32 | 30.33 ± 0.22 |
4096 | FP16 | 65.45 ± 0.03 |
4096 | FP16+TF32 | 65.88 ± 0.31 |
4096 | FP16_REDUCED | 63.56 ± 0.10 |
4096 | BF16 | 65.33 ± 0.73 |
4096 | BF16+TF32 | 64.83 ± 0.07 |
8192 | FP32 | 11.84 ± 0.07 |
8192 | FP32+TF32 | 30.27 ± 0.46 |
8192 | FP16 | 60.78 ± 1.12 |
8192 | FP16+TF32 | 61.47 ± 0.66 |
8192 | FP16_REDUCED | 61.00 ± 0.93 |
8192 | BF16 | 59.32 ± 0.76 |
8192 | BF16+TF32 | 57.89 ± 0.43 |
NVIDIA GeForce RTX 3090
Size |
Precision |
TFLOPs |
1024 | FP32 | 17.18 ± 0.50 |
1024 | FP32+TF32 | 22.42 ± 0.80 |
1024 | FP16 | 36.30 ± 1.90 |
1024 | FP16+TF32 | 38.56 ± 0.61 |
1024 | FP16_REDUCED | 39.17 ± 0.13 |
1024 | BF16 | 36.75 ± 2.05 |
1024 | BF16+TF32 | 39.26 ± 0.15 |
2048 | FP32 | 23.22 ± 0.47 |
2048 | FP32+TF32 | 27.73 ± 0.59 |
2048 | FP16 | 54.59 ± 0.12 |
2048 | FP16+TF32 | 54.29 ± 0.04 |
2048 | FP16_REDUCED | 61.00 ± 0.08 |
2048 | BF16 | 60.83 ± 0.08 |
2048 | BF16+TF32 | 61.44 ± 0.27 |
4096 | FP32 | 23.17 ± 0.65 |
4096 | FP32+TF32 | 30.30 ± 0.47 |
4096 | FP16 | 66.67 ± 0.50 |
4096 | FP16+TF32 | 66.65 ± 0.07 |
4096 | FP16_REDUCED | 67.63 ± 0.01 |
4096 | BF16 | 67.81 ± 0.11 |
4096 | BF16+TF32 | 67.25 ± 0.11 |
8192 | FP32 | 21.67 ± 0.18 |
8192 | FP32+TF32 | 36.24 ± 0.27 |
8192 | FP16 | 64.77 ± 1.16 |
8192 | FP16+TF32 | 64.08 ± 0.35 |
8192 | FP16_REDUCED | 62.99 ± 0.60 |
8192 | BF16 | 66.12 ± 0.31 |
8192 | BF16+TF32 | 68.06 ± 0.43 |
NVIDIA A40
Size |
Precision |
TFLOPs |
1024 | FP32 | 14.79 ± 0.39 |
1024 | FP32+TF32 | 32.12 ± 2.10 |
1024 | FP16 | 49.54 ± 3.30 |
1024 | FP16+TF32 | 53.04 ± 0.31 |
1024 | FP16_REDUCED | 52.95 ± 0.46 |
1024 | BF16 | 47.15 ± 3.09 |
1024 | BF16+TF32 | 44.51 ± 2.10 |
2048 | FP32 | 20.29 ± 0.33 |
2048 | FP32+TF32 | 44.98 ± 1.16 |
2048 | FP16 | 93.13 ± 0.38 |
2048 | FP16+TF32 | 90.48 ± 0.93 |
2048 | FP16_REDUCED | 88.76 ± 0.27 |
2048 | BF16 | 88.96 ± 0.36 |
2048 | BF16+TF32 | 89.25 ± 0.31 |
4096 | FP32 | 22.98 ± 0.09 |
4096 | FP32+TF32 | 55.94 ± 0.74 |
4096 | FP16 | 111.99 ± 0.20 |
4096 | FP16+TF32 | 114.65 ± 0.24 |
4096 | FP16_REDUCED | 114.80 ± 0.24 |
4096 | BF16 | 114.89 ± 0.30 |
4096 | BF16+TF32 | 114.90 ± 0.25 |
8192 | FP32 | 22.83 ± 0.05 |
8192 | FP32+TF32 | 59.55 ± 0.17 |
8192 | FP16 | 79.35 ± 0.76 |
8192 | FP16+TF32 | 79.39 ± 0.55 |
8192 | FP16_REDUCED | 79.54 ± 0.54 |
8192 | BF16 | 113.85 ± 1.36 |
8192 | BF16+TF32 | 112.31 ± 1.25 |
NVIDIA A100 80GB PCIe
Size |
Precision |
TFLOPs |
1024 | FP32 | 14.53 ± 0.22 |
1024 | FP32+TF32 | 42.12 ± 3.65 |
1024 | FP16 | 61.55 ± 4.43 |
1024 | FP16+TF32 | 66.37 ± 0.67 |
1024 | FP16_REDUCED | 66.90 ± 0.65 |
1024 | BF16 | 60.51 ± 5.38 |
1024 | BF16+TF32 | 66.36 ± 0.55 |
2048 | FP32 | 17.02 ± 0.25 |
2048 | FP32+TF32 | 86.62 ± 4.41 |
2048 | FP16 | 191.27 ± 2.31 |
2048 | FP16+TF32 | 194.15 ± 0.99 |
2048 | FP16_REDUCED | 193.76 ± 0.86 |
2048 | BF16 | 174.12 ± 0.73 |
2048 | BF16+TF32 | 176.25 ± 0.55 |
4096 | FP32 | 18.65 ± 0.06 |
4096 | FP32+TF32 | 120.11 ± 5.62 |
4096 | FP16 | 246.15 ± 0.27 |
4096 | FP16+TF32 | 245.20 ± 0.30 |
4096 | FP16_REDUCED | 245.46 ± 0.34 |
4096 | BF16 | 249.14 ± 0.32 |
4096 | BF16+TF32 | 249.27 ± 0.31 |
8192 | FP32 | 17.35 ± 1.68 |
8192 | FP32+TF32 | 121.59 ± 2.78 |
8192 | FP16 | 234.64 ± 0.82 |
8192 | FP16+TF32 | 232.71 ± 0.46 |
8192 | FP16_REDUCED | 233.38 ± 0.42 |
8192 | BF16 | 237.60 ± 0.46 |
8192 | BF16+TF32 | 244.52 ± 1.93 |