Benchmarking Matrix-Matrix Multiply (via BLAS)¶

NOTE: One way to compare different BLAS implementations is through conda forge.

conda install "libblas=*=*openblas"
conda install "libblas=*=*netlib"
In [1]:
import os
# For fun: try without this. Develop a theory that explains your observation.
os.environ["OPENBLAS_NUM_THREADS"] = "1"

import numpy as np

from time import process_time
  • What is process_time?
  • What are different time options available?
  • How suitable are they to timing experiments?
In [2]:
if 0:
    A = np.random.randn(25, 25)
    B = np.random.randn(25, 800000)
else:
    A = np.random.randn(2048, 2048)
    B = np.random.randn(2048, 2048)
In [3]:
start = process_time()
A@B
elapsed = process_time() - start
print(elapsed)
0.35817869300000005
  • What criteria would you apply to this type of measurement?
  • Was that... efficient?
In [4]:
print(f"{A.size * B.shape[1] * 2/1e9/elapsed} GFlops/s")
print(f"{A.nbytes*3/elapsed/1e9} GB/s")
47.96452027926742 GFlops/s
0.28104211101133253 GB/s

How would we come up with reference quantities with which to compare the attained performance?

In [ ]:
 
  • Try different BLAS implementations.
  • Explore the underlying technology stack. (perf top -z may help.)

Point of comparison: Repeated matrix-vector multiplication¶

Explain what you observe.

In [ ]:
start = process_time()
for i in range(B.shape[1]):
    A @ B[:,i]
elapsed = process_time() - start
print(elapsed)
In [ ]: