I don’t trust the magic numbers in the datasheets.
Script just matmul’s a bunch of matrices with random numbers of given size and precision.
NVIDIA RTX A4000
Size |
Precision |
TFLOPs |
1024 |
FP32 |
9.70 ± 0.19 |
1024 |
FP32+TF32 |
18.45 ± 0.18 |
1024 |
FP16 |
35.59 ± 1.80 |
1024 |
FP16+TF32 |
37.56 ± 0.62 |
1024 |
FP16_REDUCED |
38.20 ± 0.18 |
1024 |
BF16 |
34.71 ± 1.62 |
1024 |
BF16+TF32 |
36.17 ± 0.29 |
2048 |
FP32 |
11.41 ± 0.10 |
2048 |
FP32+TF32 |
29.47 ± 0.61 |
2048 |
FP16 |
57.91 ± 0.21 |
2048 |
FP16+TF32 |
54.99 ± 0.05 |
2048 |
FP16_REDUCED |
54.89 ± 0.11 |
2048 |
BF16 |
55.04 ± 0.10 |
2048 |
BF16+TF32 |
55.12 ± 0.05 |
4096 |
FP32 |
9.07 ± 0.35 |
4096 |
FP32+TF32 |
30.33 ± 0.22 |
4096 |
FP16 |
65.45 ± 0.03 |
4096 |
FP16+TF32 |
65.88 ± 0.31 |
4096 |
FP16_REDUCED |
63.56 ± 0.10 |
4096 |
BF16 |
65.33 ± 0.73 |
4096 |
BF16+TF32 |
64.83 ± 0.07 |
8192 |
FP32 |
11.84 ± 0.07 |
8192 |
FP32+TF32 |
30.27 ± 0.46 |
8192 |
FP16 |
60.78 ± 1.12 |
8192 |
FP16+TF32 |
61.47 ± 0.66 |
8192 |
FP16_REDUCED |
61.00 ± 0.93 |
8192 |
BF16 |
59.32 ± 0.76 |
8192 |
BF16+TF32 |
57.89 ± 0.43 |
NVIDIA GeForce RTX 3090
Size |
Precision |
TFLOPs |
1024 |
FP32 |
17.18 ± 0.50 |
1024 |
FP32+TF32 |
22.42 ± 0.80 |
1024 |
FP16 |
36.30 ± 1.90 |
1024 |
FP16+TF32 |
38.56 ± 0.61 |
1024 |
FP16_REDUCED |
39.17 ± 0.13 |
1024 |
BF16 |
36.75 ± 2.05 |
1024 |
BF16+TF32 |
39.26 ± 0.15 |
2048 |
FP32 |
23.22 ± 0.47 |
2048 |
FP32+TF32 |
27.73 ± 0.59 |
2048 |
FP16 |
54.59 ± 0.12 |
2048 |
FP16+TF32 |
54.29 ± 0.04 |
2048 |
FP16_REDUCED |
61.00 ± 0.08 |
2048 |
BF16 |
60.83 ± 0.08 |
2048 |
BF16+TF32 |
61.44 ± 0.27 |
4096 |
FP32 |
23.17 ± 0.65 |
4096 |
FP32+TF32 |
30.30 ± 0.47 |
4096 |
FP16 |
66.67 ± 0.50 |
4096 |
FP16+TF32 |
66.65 ± 0.07 |
4096 |
FP16_REDUCED |
67.63 ± 0.01 |
4096 |
BF16 |
67.81 ± 0.11 |
4096 |
BF16+TF32 |
67.25 ± 0.11 |
8192 |
FP32 |
21.67 ± 0.18 |
8192 |
FP32+TF32 |
36.24 ± 0.27 |
8192 |
FP16 |
64.77 ± 1.16 |
8192 |
FP16+TF32 |
64.08 ± 0.35 |
8192 |
FP16_REDUCED |
62.99 ± 0.60 |
8192 |
BF16 |
66.12 ± 0.31 |
8192 |
BF16+TF32 |
68.06 ± 0.43 |
NVIDIA A40
Size |
Precision |
TFLOPs |
1024 |
FP32 |
14.79 ± 0.39 |
1024 |
FP32+TF32 |
32.12 ± 2.10 |
1024 |
FP16 |
49.54 ± 3.30 |
1024 |
FP16+TF32 |
53.04 ± 0.31 |
1024 |
FP16_REDUCED |
52.95 ± 0.46 |
1024 |
BF16 |
47.15 ± 3.09 |
1024 |
BF16+TF32 |
44.51 ± 2.10 |
2048 |
FP32 |
20.29 ± 0.33 |
2048 |
FP32+TF32 |
44.98 ± 1.16 |
2048 |
FP16 |
93.13 ± 0.38 |
2048 |
FP16+TF32 |
90.48 ± 0.93 |
2048 |
FP16_REDUCED |
88.76 ± 0.27 |
2048 |
BF16 |
88.96 ± 0.36 |
2048 |
BF16+TF32 |
89.25 ± 0.31 |
4096 |
FP32 |
22.98 ± 0.09 |
4096 |
FP32+TF32 |
55.94 ± 0.74 |
4096 |
FP16 |
111.99 ± 0.20 |
4096 |
FP16+TF32 |
114.65 ± 0.24 |
4096 |
FP16_REDUCED |
114.80 ± 0.24 |
4096 |
BF16 |
114.89 ± 0.30 |
4096 |
BF16+TF32 |
114.90 ± 0.25 |
8192 |
FP32 |
22.83 ± 0.05 |
8192 |
FP32+TF32 |
59.55 ± 0.17 |
8192 |
FP16 |
79.35 ± 0.76 |
8192 |
FP16+TF32 |
79.39 ± 0.55 |
8192 |
FP16_REDUCED |
79.54 ± 0.54 |
8192 |
BF16 |
113.85 ± 1.36 |
8192 |
BF16+TF32 |
112.31 ± 1.25 |