Dumping ground for benchmarks on GPUs

I don’t trust the magic numbers in the datasheets. Script just matmul’s a bunch of matrices with random numbers of given size and precision.

NVIDIA RTX A4000

Size Precision TFLOPs
1024 FP32 9.70 ± 0.19
1024 FP32+TF32 18.45 ± 0.18
1024 FP16 35.59 ± 1.80
1024 FP16+TF32 37.56 ± 0.62
1024 FP16_REDUCED 38.20 ± 0.18
1024 BF16 34.71 ± 1.62
1024 BF16+TF32 36.17 ± 0.29
2048 FP32 11.41 ± 0.10
2048 FP32+TF32 29.47 ± 0.61
2048 FP16 57.91 ± 0.21
2048 FP16+TF32 54.99 ± 0.05
2048 FP16_REDUCED 54.89 ± 0.11
2048 BF16 55.04 ± 0.10
2048 BF16+TF32 55.12 ± 0.05
4096 FP32 9.07 ± 0.35
4096 FP32+TF32 30.33 ± 0.22
4096 FP16 65.45 ± 0.03
4096 FP16+TF32 65.88 ± 0.31
4096 FP16_REDUCED 63.56 ± 0.10
4096 BF16 65.33 ± 0.73
4096 BF16+TF32 64.83 ± 0.07
8192 FP32 11.84 ± 0.07
8192 FP32+TF32 30.27 ± 0.46
8192 FP16 60.78 ± 1.12
8192 FP16+TF32 61.47 ± 0.66
8192 FP16_REDUCED 61.00 ± 0.93
8192 BF16 59.32 ± 0.76
8192 BF16+TF32 57.89 ± 0.43

NVIDIA GeForce RTX 3090

Size Precision TFLOPs
1024 FP32 17.18 ± 0.50
1024 FP32+TF32 22.42 ± 0.80
1024 FP16 36.30 ± 1.90
1024 FP16+TF32 38.56 ± 0.61
1024 FP16_REDUCED 39.17 ± 0.13
1024 BF16 36.75 ± 2.05
1024 BF16+TF32 39.26 ± 0.15
2048 FP32 23.22 ± 0.47
2048 FP32+TF32 27.73 ± 0.59
2048 FP16 54.59 ± 0.12
2048 FP16+TF32 54.29 ± 0.04
2048 FP16_REDUCED 61.00 ± 0.08
2048 BF16 60.83 ± 0.08
2048 BF16+TF32 61.44 ± 0.27
4096 FP32 23.17 ± 0.65
4096 FP32+TF32 30.30 ± 0.47
4096 FP16 66.67 ± 0.50
4096 FP16+TF32 66.65 ± 0.07
4096 FP16_REDUCED 67.63 ± 0.01
4096 BF16 67.81 ± 0.11
4096 BF16+TF32 67.25 ± 0.11
8192 FP32 21.67 ± 0.18
8192 FP32+TF32 36.24 ± 0.27
8192 FP16 64.77 ± 1.16
8192 FP16+TF32 64.08 ± 0.35
8192 FP16_REDUCED 62.99 ± 0.60
8192 BF16 66.12 ± 0.31
8192 BF16+TF32 68.06 ± 0.43

NVIDIA A40

Size Precision TFLOPs
1024 FP32 14.79 ± 0.39
1024 FP32+TF32 32.12 ± 2.10
1024 FP16 49.54 ± 3.30
1024 FP16+TF32 53.04 ± 0.31
1024 FP16_REDUCED 52.95 ± 0.46
1024 BF16 47.15 ± 3.09
1024 BF16+TF32 44.51 ± 2.10
2048 FP32 20.29 ± 0.33
2048 FP32+TF32 44.98 ± 1.16
2048 FP16 93.13 ± 0.38
2048 FP16+TF32 90.48 ± 0.93
2048 FP16_REDUCED 88.76 ± 0.27
2048 BF16 88.96 ± 0.36
2048 BF16+TF32 89.25 ± 0.31
4096 FP32 22.98 ± 0.09
4096 FP32+TF32 55.94 ± 0.74
4096 FP16 111.99 ± 0.20
4096 FP16+TF32 114.65 ± 0.24
4096 FP16_REDUCED 114.80 ± 0.24
4096 BF16 114.89 ± 0.30
4096 BF16+TF32 114.90 ± 0.25
8192 FP32 22.83 ± 0.05
8192 FP32+TF32 59.55 ± 0.17
8192 FP16 79.35 ± 0.76
8192 FP16+TF32 79.39 ± 0.55
8192 FP16_REDUCED 79.54 ± 0.54
8192 BF16 113.85 ± 1.36
8192 BF16+TF32 112.31 ± 1.25