Dumping ground for benchmarks on GPUs

I don’t trust the magic numbers in the datasheets. Script just matmul’s a bunch of matrices with random numbers of given size and precision.

NVIDIA RTX A4000
Size Precision TFLOPs
1024FP329.70 ± 0.19
1024FP32+TF3218.45 ± 0.18
1024FP1635.59 ± 1.80
1024FP16+TF3237.56 ± 0.62
1024FP16_REDUCED38.20 ± 0.18
1024BF1634.71 ± 1.62
1024BF16+TF3236.17 ± 0.29
2048FP3211.41 ± 0.10
2048FP32+TF3229.47 ± 0.61
2048FP1657.91 ± 0.21
2048FP16+TF3254.99 ± 0.05
2048FP16_REDUCED54.89 ± 0.11
2048BF1655.04 ± 0.10
2048BF16+TF3255.12 ± 0.05
4096FP329.07 ± 0.35
4096FP32+TF3230.33 ± 0.22
4096FP1665.45 ± 0.03
4096FP16+TF3265.88 ± 0.31
4096FP16_REDUCED63.56 ± 0.10
4096BF1665.33 ± 0.73
4096BF16+TF3264.83 ± 0.07
8192FP3211.84 ± 0.07
8192FP32+TF3230.27 ± 0.46
8192FP1660.78 ± 1.12
8192FP16+TF3261.47 ± 0.66
8192FP16_REDUCED61.00 ± 0.93
8192BF1659.32 ± 0.76
8192BF16+TF3257.89 ± 0.43

NVIDIA GeForce RTX 3090
Size Precision TFLOPs
1024FP3217.18 ± 0.50
1024FP32+TF3222.42 ± 0.80
1024FP1636.30 ± 1.90
1024FP16+TF3238.56 ± 0.61
1024FP16_REDUCED39.17 ± 0.13
1024BF1636.75 ± 2.05
1024BF16+TF3239.26 ± 0.15
2048FP3223.22 ± 0.47
2048FP32+TF3227.73 ± 0.59
2048FP1654.59 ± 0.12
2048FP16+TF3254.29 ± 0.04
2048FP16_REDUCED61.00 ± 0.08
2048BF1660.83 ± 0.08
2048BF16+TF3261.44 ± 0.27
4096FP3223.17 ± 0.65
4096FP32+TF3230.30 ± 0.47
4096FP1666.67 ± 0.50
4096FP16+TF3266.65 ± 0.07
4096FP16_REDUCED67.63 ± 0.01
4096BF1667.81 ± 0.11
4096BF16+TF3267.25 ± 0.11
8192FP3221.67 ± 0.18
8192FP32+TF3236.24 ± 0.27
8192FP1664.77 ± 1.16
8192FP16+TF3264.08 ± 0.35
8192FP16_REDUCED62.99 ± 0.60
8192BF1666.12 ± 0.31
8192BF16+TF3268.06 ± 0.43

NVIDIA A40
Size Precision TFLOPs
1024FP3214.79 ± 0.39
1024FP32+TF3232.12 ± 2.10
1024FP1649.54 ± 3.30
1024FP16+TF3253.04 ± 0.31
1024FP16_REDUCED52.95 ± 0.46
1024BF1647.15 ± 3.09
1024BF16+TF3244.51 ± 2.10
2048FP3220.29 ± 0.33
2048FP32+TF3244.98 ± 1.16
2048FP1693.13 ± 0.38
2048FP16+TF3290.48 ± 0.93
2048FP16_REDUCED88.76 ± 0.27
2048BF1688.96 ± 0.36
2048BF16+TF3289.25 ± 0.31
4096FP3222.98 ± 0.09
4096FP32+TF3255.94 ± 0.74
4096FP16111.99 ± 0.20
4096FP16+TF32114.65 ± 0.24
4096FP16_REDUCED114.80 ± 0.24
4096BF16114.89 ± 0.30
4096BF16+TF32114.90 ± 0.25
8192FP3222.83 ± 0.05
8192FP32+TF3259.55 ± 0.17
8192FP1679.35 ± 0.76
8192FP16+TF3279.39 ± 0.55
8192FP16_REDUCED79.54 ± 0.54
8192BF16113.85 ± 1.36
8192BF16+TF32112.31 ± 1.25

NVIDIA A100 80GB PCIe
Size Precision TFLOPs
1024FP3214.53 ± 0.22
1024FP32+TF3242.12 ± 3.65
1024FP1661.55 ± 4.43
1024FP16+TF3266.37 ± 0.67
1024FP16_REDUCED66.90 ± 0.65
1024BF1660.51 ± 5.38
1024BF16+TF3266.36 ± 0.55
2048FP3217.02 ± 0.25
2048FP32+TF3286.62 ± 4.41
2048FP16191.27 ± 2.31
2048FP16+TF32194.15 ± 0.99
2048FP16_REDUCED193.76 ± 0.86
2048BF16174.12 ± 0.73
2048BF16+TF32176.25 ± 0.55
4096FP3218.65 ± 0.06
4096FP32+TF32120.11 ± 5.62
4096FP16246.15 ± 0.27
4096FP16+TF32245.20 ± 0.30
4096FP16_REDUCED245.46 ± 0.34
4096BF16249.14 ± 0.32
4096BF16+TF32249.27 ± 0.31
8192FP3217.35 ± 1.68
8192FP32+TF32121.59 ± 2.78
8192FP16234.64 ± 0.82
8192FP16+TF32232.71 ± 0.46
8192FP16_REDUCED233.38 ± 0.42
8192BF16237.60 ± 0.46
8192BF16+TF32244.52 ± 1.93