# Linear algebra performances

Abstract

In this page, we present performance of various Scilab scripts involving linear algebra. We emphasize the use of Mflops as a measure of performance of linear algebra routines used in Scilab. We consider here two benchmarks:

• the dense, real, matrix-matrix multiply,
• the solution of dense, real linear systems of equations.

See "Programming in Scilab" [1] for more details on this topic.

## Introduction

In order to get better performances, users may install ATLAS or the Intel MKL inside Scilab (see [1] for details).

In all cases, comparing the various performances requires to have the following parameters:

• the version of Scilab,
• the version of the operating system,
• the parameters of the CPU (and, if possible, the amount of physical memory),
• the linear algebra library.

There are (at least) three linear algebra libraries for the benchmark presented here:

• Reference Blas,
• ATLAS,
• the Intel MKL.

By default, Scilab uses the Intel MKL on Windows and Reference Blas on Linux (see [1] for details).

The size n of the matrix is a parameter which can be changed to get higher performances. The time should be kept in a reasonable range, say from 1 second to 10 seconds. In order to find the value n which allows your machine to express its best performance, run the two scripts in attachment:

In the Scilab terminal, we can launch the script, which performs a loop over the size of the matrix. The following session presents the result of a typical session. The first column is n, the second is the time in seconds, the third one is the Mflops.

```-->exec C:\Users\baudin\Desktop\bench_matmul.sce;
Memory: 1085 (MB)
Maximum n: 11646
Run #1: n=  1107, T=0.187 (s), Mflops= 14508
Run #2: n=  1329, T=0.249 (s), Mflops= 18854
Run #3: n=  1595, T=0.811 (s), Mflops= 10006
Run #4: n=  1914, T=0.645 (s), Mflops= 21741
Run #5: n=  2297, T=1.157 (s), Mflops= 20949
Run #6: n=  2757, T=1.929 (s), Mflops= 21727
Run #7: n=  3309, T=3.323 (s), Mflops= 21806
Run #8: n=  3971, T=4.680 (s), Mflops= 26759
Run #9: n=  4766, T=7.878 (s), Mflops= 27483
Best performance: N=4766, T=7.878 (s), MFLOPS=27483```

We see that the performance increases with the size of the matrix. We can take the best performance, associated with the largest mflops.

## Matrix-Matrix Product

This product involves the product of two square, real, dense, matrices of doubles.

### The script

The following is a short benchmark.

```stacksize("max");
s = stacksize();
floor(sqrt(s(1))) // The maximum size of a square dense matrix of doubles
round(s(1)*8/10^6) // The memory, in MB
rand( "normal" );
n = 1000;
A = rand(n,n);
B = rand(n,n);
tic();
C = A * B;
t = toc();
mflops = round(2*n^3/t/1.e6);
disp([n t mflops])```

A more complete benchmark is available in bench_matmul.sce or [3].

### The results

 Scilab OS CPU Physical Memory Library n Time (s) MFLOPS scilab-5.4.1 Windows Vista Business 32 bits Intel Xeon 8*2.93GHz 24 GB Intel MKL 3971 1.794 69808 scilab-5.3.0-beta-4-x64 Windows Seven Ultimate 64 bits Intel Xeon X5570 16*2.93GHz 4 GB Intel MKL 3309 1.248 58063 scilab-5.3.0-beta-4 Windows Vista Ultimate 32 bits Intel Xeon E5410 4*2.33 GHz 4 GB Intel MKL 4766 8.172 26494 scilab-5.2.2-x64 Windows Seven Ultimate 64 bits Intel Core 2 6600 4*2.4 Ghz 8 GB Intel MKL 3971 4.727 26493 scilab-5.3.0-beta-4 Debian GNU/Linux 32 bits Intel Core2 4*2.66 GHz 4 GB ATLAS 32 bits tuned (sse&mt) 4766 8.073 26819 scilab-5.3.3 Windows 7 Prof. 32 bits Intel i5 2520M 4*2.5GHz 4 GB Intel MKL 3971 6.656 18815 scilab-5.3.3 x64 Windows 7 64 bits Intel Pent. P6200 2*2.13GHz 4 GB Intel MKL 3309 7.928 9140 scilab-5.3.0-beta-4 Debian GNU/Linux 32 bits Intel Core2 4*2.66 GHz 4 GB AMD ACML 4.3.0 3309 8.694 8334 scilab-5.4.1 Windows 7 Prof. 32 bits Intel Celeron T3100 2*1.90GHz 4 GB Intel MKL 3309 10.199 7104 scilab-5.3.0-beta-4 Fedora Linux 13 64 bits Intel Core2 6600 2*2.4 GHz 4 GB ATLAS 64 bits sse2 (tuned) 2757 10.140 4133 scilab-5.3.0-beta-4 Fedora Linux 13 64 bits Intel Core2 6600 2*2.4 GHz 4 GB ATLAS 64 bits sse2 2297 5.897 4110 scilab-5.3.2 Windows Seven Ultimate 64 bits AMD Fusion E-350 1.6 Ghz 8 GB Intel MKL 1914 5.504 2547 scilab-5.3.0-beta-4 Windows Vista Ultimate 32 bits Intel Xeon E5410 4*2.33 GHz 4 GB ATLAS 1595 3.698 2194 scilab-5.3.0-beta-4 Fedora Linux 13 64 bits Intel Core2 6600 2*2.4 GHz 4 GB Ref. BLAS 64 bits 533 0.162 1869 scilab-5.3.0-beta-4 Windows Vista Ultimate 32 bits Intel Xeon E5410 4*2.33 GHz 4 GB Ref. BLAS 444 0.125 1400 scilab-5.3.0-beta-4 Debian GNU/Linux 32 bits Intel Core2 4*2.66 GHz 4 GB Ref BLAS 444 0.129 1357 scilab-5.3.3 Windows 7 64 bits Intel Pent. P6200 2*2.13GHz 4 GB Intel MKL 1914 13.187 1063 scilab-5.3.0-beta-4 Windows XP 32 bits AMD Athlon 3200+ 2 GHz 1 GB ATLAS 1500 ? ~2300 scilab-5.3.0-beta-4 Windows XP 32 bits AMD Athlon 3200+ 2 GHz 1 GB Intel MKL 1500 ? ~2300 scilab-5.3.0-beta-4 Windows XP 32 bits AMD Athlon 3200+ 2 GHz 1 GB Ref. BLAS 1000 ? ~500

• The Intel MKL or the ATLAS libraries improves the performances over the Ref. BLAS. See for example the following experiment where the performance ratio is x5 on a single core processor.
 Scilab OS CPU Physical Memory Library n Time (s) MFLOPS scilab-5.3.0-beta-4 Windows XP 32 bits AMD Athlon 3200+ 2 GHz 1 GB ATLAS 1500 ? ~2300 scilab-5.3.0-beta-4 Windows XP 32 bits AMD Athlon 3200+ 2 GHz 1 GB Intel MKL 1500 ? ~2300 scilab-5.3.0-beta-4 Windows XP 32 bits AMD Athlon 3200+ 2 GHz 1 GB Ref. BLAS 1000 ? ~500
• On a 64 bits system, the 64 bits Scilab improves the performances over the Ref. BLAS. See for example the following experiment where the performance ratio is x9 on a dual core processor.
 Scilab OS CPU Physical Memory Library n Time (s) MFLOPS scilab-5.3.3 x64 Windows 7 64 bits Intel Pent. P6200 2*2.13GHz 4 GB Intel MKL 3309 7.928 9140 scilab-5.3.3 Windows 7 64 bits Intel Pent. P6200 2*2.13GHz 4 GB Intel MKL 1914 13.187 1063

## Backslash

This product involves the computation of the solution of a linear system of equations. This is often called the "LINPACK" benchmark [2], but Scilab uses LAPACK.

### The script

```s= stacksize("max");
s = stacksize();
floor(sqrt(s(1))) // The maximum size of a square dense matrix of doubles
round(s(1)*8/10^6) // The memory, in MB
rand( "normal" );
n = 1000;
A = rand(n,n);
b = rand(n,1);
tic();
x = A\b;
t = toc();
mflops = round((2/3*n^3 + 2*n^2)/t/1.e6);
disp([n t mflops])```

A more complete benchmark is available in bench_backslash.sce or [4].

### The results

 Scilab OS CPU Physical Memory Library n Time (s) MFLOPS scilab-5.2.2-x64 Windows Seven Ultimate 64 bits Intel Core2 6600 4*2.4 GHz 8 GB Intel MKL 6864 9.655 22339 scilab-5.3.0-beta-4 Windows Vista Ultimate 32 bits Intel Xeon E5410 4*2.33 GHz 4 GB Intel MKL 5720 6.376 19578 scilab-5.3.0-beta-4 Debian GNU/Linux 32 bits Intel Core2 4*2.66 GHz 4 GB ATLAS 32 bits tuned (sse&mt) 6864 11.304 19080 scilab-5.3.0-beta-4 Debian GNU/Linux 32 bits Intel Core2 4*2.66 GHz 4 GB AMD ACML 4.3.0 3971 5.498 7598 scilab-5.3.0-beta-4 Fedora Linux 13 64 bits Intel Core2 6600 2*2.4 GHz 4 GB ATLAS 64 bits sse2 (tuned) 2757 10.140 4133 scilab-5.3.2 Windows Seven Ultimate 64 bits AMD Fusion E-350 1.6 Ghz 8 GB Intel MKL 3309 10.802 2238 scilab-5.3.0-beta-4 Fedora Linux 13 64 bits Intel Core2 6600 2*2.4 GHz 4 GB Ref. BLAS 64 bits 1914 2.570 1821 scilab-5.3.0-beta-4 Windows Vista Ultimate 32 bits Intel Xeon E5410 4*2.33 GHz 4 GB Ref. BLAS 2757 10.514 1330 scilab-5.3.0-beta-4 Windows Vista Ultimate 32 bits Intel Xeon E5410 4*2.33 GHz 4 GB ATLAS 3309 12.074 2002 scilab-5.3.0-beta-4 Debian GNU/Linux 32 bits Intel Core2 4*2.66 GHz 4 GB Ref. BLAS 1914 3.29 1422 scilab-5.3.0-beta-4 Linux Ubuntu 32 bits Intel Pentium M 2 GHz 1 GB Ref. BLAS 1000 ? ~700 scilab-5.3.0-beta-4 Linux Ubuntu 32 bits Intel Pentium M 2 GHz 1 GB ATLAS 3000 ? ~1400

## Notes

• The backslash operator may use the multi-core of our machine, depending on the configuration of Scilab.
• Both benchmarks may fail, because the maximum stack size has been reached.
• The timer function should not be used, because of it measures the CPU time, and not the elapsed time. On multi-core machines, the CPU time measured by the timer function is the sum of the times of all cores. This is why the tic()/toc() functions should be used instead (see bug #8276: http://bugzilla.scilab.org/show_bug.cgi?id=8276 for the lack of documentation of this point in the help page of timer).

• For large matrices, the backslash test may fail, because the backslash operator switches to a least squares computation algorithm, instead of keeping on the Gaussian elimination. This is bug #7497 : http://bugzilla.scilab.org/show_bug.cgi?id=7497

• See a message on this topic : http://lists.scilab.org/cgi-bin/ezmlm-browse?list=dev&cmd=showmsg&msgnum=1849

• We have packaged these benchmarks into an ATOMS module:

```atomsInstall("scibench")

To run the matmul benchmark:

```lines(0);
stacksize("max");
scf();
perftable = scibench_matmul ( %t , %t , 0.1 , 8 , 1.2 )```

To run the backslash benchmark:

```lines(0);
stacksize("max");
scf();
perftable = scibench_backslash ( %t , %t , 0.1 , 8 , 1.2 )```

## References

[1] "Programming in Scilab", Michael Baudin, 2010, (HTTP)

[2] "Benchmarks: LINPACK and MATLAB - Fame and fortune from megaflops", Cleve Moler, 1994, (PDF)

[3] Benchmarking matrix-matrix product, Michael Baudin, 2010, (bench_matmul.sce)

[4] Benchmarking backslash, Michael Baudin, 2010, (bench_backslash.sce)

[5] Benchmark programs and reports, http://www.netlib.org/benchmark/

[6] Automatically tuned linear algebra software, R. Clint Whaley and Jack J. Dongarra. In Supercomputing '98: Proceedings of the 1998 ACM/IEEE conference on Supercomputing (CDROM), pages 1-27, Washington, DC, USA, 1998. IEEE Computer Society.

[7] Automated empirical optimizations of software and the atlas project, R. Clint Whaley, Antoine Petitet, R. Clint, Whaley Antoine, Petitet Jack, and Jack J. Dongarra, 2000

2022-09-08 09:27