Benchmarking Matrix-Matrix Multiply (via BLAS)¶
NOTE: One way to compare different BLAS implementations is through conda forge.
conda install "libblas=*=*openblas"
conda install "libblas=*=*netlib"
In [1]:
import os
# For fun: try without this. Develop a theory that explains your observation.
os.environ["OPENBLAS_NUM_THREADS"] = "1"
import numpy as np
from time import process_time
- What is
process_time
? - What are different time options available?
- How suitable are they to timing experiments?
In [2]:
if 0:
A = np.random.randn(25, 25)
B = np.random.randn(25, 800000)
else:
A = np.random.randn(2048, 2048)
B = np.random.randn(2048, 2048)
In [3]:
start = process_time()
A@B
elapsed = process_time() - start
print(elapsed)
0.35817869300000005
- What criteria would you apply to this type of measurement?
- Was that... efficient?
In [4]:
print(f"{A.size * B.shape[1] * 2/1e9/elapsed} GFlops/s")
print(f"{A.nbytes*3/elapsed/1e9} GB/s")
47.96452027926742 GFlops/s 0.28104211101133253 GB/s
How would we come up with reference quantities with which to compare the attained performance?
In [ ]:
- Try different BLAS implementations.
- Explore the underlying technology stack. (
perf top -z
may help.)
In [ ]:
start = process_time()
for i in range(B.shape[1]):
A @ B[:,i]
elapsed = process_time() - start
print(elapsed)
In [ ]: