## The reason

In one of the pervious posts I had described about sparse matrices and why every researcher should understand and use them. That handled one of the big problems of RAM requirement. The next step is to now speed up the computation by developing the code based on good programming practices. I had compiled most of my findings in this presentation: Run-time from 300 years to 300 min: Lessons learned in large-scale modeling in FEniCS. By properly profiling the code for bottlenecks we can figure out ways to increase its speed. In this post I would like to share one such finding.

## The finding

Lets create a test problem to understand

from scipy.sparse import csr_matrix
import numpy as np

n=100000
m=40

t = np.ones(n)
rows_n=np.arange(n)
cols_n=np.arange(n)
val_n = np.ones(n)
mat_n = csr_matrix((val_n, (rows_n, cols_n)), shape=(n, n), dtype=int)
mat_m=csr_matrix(np.random.randint(10, size=(m, n)))/100


Here the multiplication order is

1. $[n \times n]\cdot[n\times1] = [n\times 1]$
2. $[m \times n]\cdot[n\times1] = [m\times 1]$
%%time
mat_m.dot(mat_n.dot(t))

CPU times: user 6.53 ms, sys: 1.54 ms, total: 8.07 ms
Wall time: 6.77 ms

array([4495.66, 4500.96, 4515.95, 4497.58, 4495.07, 4500.8 , 4515.05,
4511.28, 4487.11, 4492.63, 4504.96, 4502.63, 4501.69, 4505.42,
4494.06, 4504.79, 4517.84, 4501.67, 4507.63, 4495.58, 4497.92,
4495.33, 4495.05, 4506.82, 4509.54, 4510.65, 4497.09, 4506.66,
4496.97, 4503.77, 4489.17, 4499.72, 4501.36, 4499.38, 4485.01,
4490.23, 4502.94, 4505.94, 4503.82, 4499.95])


Here the multiplication order is

1. $[m \times n]\cdot[n\times n] = [m\times n]$
2. $[m \times n]\cdot[n\times 1] = [m\times 1]$
%%time
(mat_m.dot(mat_n)).dot(t)

CPU times: user 44.9 ms, sys: 5.98 ms, total: 50.9 ms
Wall time: 57.5 ms

array([4495.66, 4500.96, 4515.95, 4497.58, 4495.07, 4500.8 , 4515.05,
4511.28, 4487.11, 4492.63, 4504.96, 4502.63, 4501.69, 4505.42,
4494.06, 4504.79, 4517.84, 4501.67, 4507.63, 4495.58, 4497.92,
4495.33, 4495.05, 4506.82, 4509.54, 4510.65, 4497.09, 4506.66,
4496.97, 4503.77, 4489.17, 4499.72, 4501.36, 4499.38, 4485.01,
4490.23, 4502.94, 4505.94, 4503.82, 4499.95])


From the simple example presented above we can see that there is almost a $$7\times$$ speed boost by just changing the multiplication order. In the actual problem I achieved a speed boost of around $$100\times$$. The run time for a single iteration reduces from 2 seconds to around 0.02 seconds. The simulation that was taking 2-3 hours to complete now gets completed in around 2mins. 🥳