Linear algebra performances
Abstract
In this page, we present performance of various Scilab scripts involving linear algebra. We emphasize the use of Mflops as a measure of performance of linear algebra routines used in Scilab. We consider here two benchmarks:
- the dense, real, matrix-matrix multiply,
- the solution of dense, real linear systems of equations.
See "Programming in Scilab" [1] for more details on this topic.
Contents
Introduction
In order to get better performances, users may install ATLAS or the Intel MKL inside Scilab (see [1] for details).
In all cases, comparing the various performances requires to have the following parameters:
- the version of Scilab,
- the version of the operating system,
- the parameters of the CPU (and, if possible, the amount of physical memory),
- the linear algebra library.
There are (at least) three linear algebra libraries for the benchmark presented here:
- Reference Blas,
- ATLAS,
- the Intel MKL.
By default, Scilab uses the Intel MKL on Windows and Reference Blas on Linux (see [1] for details).
The size n of the matrix is a parameter which can be changed to get higher performances. The time should be kept in a reasonable range, say from 1 second to 10 seconds. In order to find the value n which allows your machine to express its best performance, run the two scripts in attachment:
In the Scilab terminal, we can launch the script, which performs a loop over the size of the matrix. The following session presents the result of a typical session. The first column is n, the second is the time in seconds, the third one is the Mflops.
-->exec C:\Users\baudin\Desktop\bench_matmul.sce; Memory: 1085 (MB) Maximum n: 11646 Run #1: n= 1107, T=0.187 (s), Mflops= 14508 Run #2: n= 1329, T=0.249 (s), Mflops= 18854 Run #3: n= 1595, T=0.811 (s), Mflops= 10006 Run #4: n= 1914, T=0.645 (s), Mflops= 21741 Run #5: n= 2297, T=1.157 (s), Mflops= 20949 Run #6: n= 2757, T=1.929 (s), Mflops= 21727 Run #7: n= 3309, T=3.323 (s), Mflops= 21806 Run #8: n= 3971, T=4.680 (s), Mflops= 26759 Run #9: n= 4766, T=7.878 (s), Mflops= 27483 Best performance: N=4766, T=7.878 (s), MFLOPS=27483
We see that the performance increases with the size of the matrix. We can take the best performance, associated with the largest mflops.
Matrix-Matrix Product
This product involves the product of two square, real, dense, matrices of doubles.
The script
The following is a short benchmark.
stacksize("max"); s = stacksize(); floor(sqrt(s(1))) // The maximum size of a square dense matrix of doubles round(s(1)*8/10^6) // The memory, in MB rand( "normal" ); n = 1000; A = rand(n,n); B = rand(n,n); tic(); C = A * B; t = toc(); mflops = round(2*n^3/t/1.e6); disp([n t mflops])
A more complete benchmark is available in bench_matmul.sce or [3].
The results
Scilab |
OS |
CPU |
Physical Memory |
Library |
n |
Time (s) |
MFLOPS |
scilab-5.4.1 |
Windows Vista Business 32 bits |
Intel Xeon 8*2.93GHz |
24 GB |
Intel MKL |
3971 |
1.794 |
69808 |
scilab-5.3.0-beta-4-x64 |
Windows Seven Ultimate 64 bits |
Intel Xeon X5570 16*2.93GHz |
4 GB |
Intel MKL |
3309 |
1.248 |
58063 |
scilab-5.3.0-beta-4 |
Windows Vista Ultimate 32 bits |
Intel Xeon E5410 4*2.33 GHz |
4 GB |
Intel MKL |
4766 |
8.172 |
26494 |
scilab-5.2.2-x64 |
Windows Seven Ultimate 64 bits |
Intel Core 2 6600 4*2.4 Ghz |
8 GB |
Intel MKL |
3971 |
4.727 |
26493 |
scilab-5.3.0-beta-4 |
Debian GNU/Linux 32 bits |
Intel Core2 4*2.66 GHz |
4 GB |
ATLAS 32 bits tuned (sse&mt) |
4766 |
8.073 |
26819 |
scilab-5.3.3 |
Windows 7 Prof. 32 bits |
Intel i5 2520M 4*2.5GHz |
4 GB |
Intel MKL |
3971 |
6.656 |
18815 |
scilab-5.3.3 x64 |
Windows 7 64 bits |
Intel Pent. P6200 2*2.13GHz |
4 GB |
Intel MKL |
3309 |
7.928 |
9140 |
scilab-5.3.0-beta-4 |
Debian GNU/Linux 32 bits |
Intel Core2 4*2.66 GHz |
4 GB |
AMD ACML 4.3.0 |
3309 |
8.694 |
8334 |
scilab-5.4.1 |
Windows 7 Prof. 32 bits |
Intel Celeron T3100 2*1.90GHz |
4 GB |
Intel MKL |
3309 |
10.199 |
7104 |
scilab-5.3.0-beta-4 |
Fedora Linux 13 64 bits |
Intel Core2 6600 2*2.4 GHz |
4 GB |
ATLAS 64 bits sse2 (tuned) |
2757 |
10.140 |
4133 |
scilab-5.3.0-beta-4 |
Fedora Linux 13 64 bits |
Intel Core2 6600 2*2.4 GHz |
4 GB |
ATLAS 64 bits sse2 |
2297 |
5.897 |
4110 |
scilab-5.3.2 |
Windows Seven Ultimate 64 bits |
AMD Fusion E-350 1.6 Ghz |
8 GB |
Intel MKL |
1914 |
5.504 |
2547 |
scilab-5.3.0-beta-4 |
Windows Vista Ultimate 32 bits |
Intel Xeon E5410 4*2.33 GHz |
4 GB |
ATLAS |
1595 |
3.698 |
2194 |
scilab-5.3.0-beta-4 |
Fedora Linux 13 64 bits |
Intel Core2 6600 2*2.4 GHz |
4 GB |
Ref. BLAS 64 bits |
533 |
0.162 |
1869 |
scilab-5.3.0-beta-4 |
Windows Vista Ultimate 32 bits |
Intel Xeon E5410 4*2.33 GHz |
4 GB |
Ref. BLAS |
444 |
0.125 |
1400 |
scilab-5.3.0-beta-4 |
Debian GNU/Linux 32 bits |
Intel Core2 4*2.66 GHz |
4 GB |
Ref BLAS |
444 |
0.129 |
1357 |
scilab-5.3.3 |
Windows 7 64 bits |
Intel Pent. P6200 2*2.13GHz |
4 GB |
Intel MKL |
1914 |
13.187 |
1063 |
scilab-5.3.0-beta-4 |
Windows XP 32 bits |
AMD Athlon 3200+ 2 GHz |
1 GB |
ATLAS |
1500 |
? |
~2300 |
scilab-5.3.0-beta-4 |
Windows XP 32 bits |
AMD Athlon 3200+ 2 GHz |
1 GB |
Intel MKL |
1500 |
? |
~2300 |
scilab-5.3.0-beta-4 |
Windows XP 32 bits |
AMD Athlon 3200+ 2 GHz |
1 GB |
Ref. BLAS |
1000 |
? |
~500 |
Some comments
- The Intel MKL or the ATLAS libraries improves the performances over the Ref. BLAS. See for example the following experiment where the performance ratio is x5 on a single core processor.
Scilab |
OS |
CPU |
Physical Memory |
Library |
n |
Time (s) |
MFLOPS |
scilab-5.3.0-beta-4 |
Windows XP 32 bits |
AMD Athlon 3200+ 2 GHz |
1 GB |
ATLAS |
1500 |
? |
~2300 |
scilab-5.3.0-beta-4 |
Windows XP 32 bits |
AMD Athlon 3200+ 2 GHz |
1 GB |
Intel MKL |
1500 |
? |
~2300 |
scilab-5.3.0-beta-4 |
Windows XP 32 bits |
AMD Athlon 3200+ 2 GHz |
1 GB |
Ref. BLAS |
1000 |
? |
~500 |
- On a 64 bits system, the 64 bits Scilab improves the performances over the Ref. BLAS. See for example the following experiment where the performance ratio is x9 on a dual core processor.
Scilab |
OS |
CPU |
Physical Memory |
Library |
n |
Time (s) |
MFLOPS |
scilab-5.3.3 x64 |
Windows 7 64 bits |
Intel Pent. P6200 2*2.13GHz |
4 GB |
Intel MKL |
3309 |
7.928 |
9140 |
scilab-5.3.3 |
Windows 7 64 bits |
Intel Pent. P6200 2*2.13GHz |
4 GB |
Intel MKL |
1914 |
13.187 |
1063 |
Backslash
This product involves the computation of the solution of a linear system of equations. This is often called the "LINPACK" benchmark [2], but Scilab uses LAPACK.
The script
s= stacksize("max"); s = stacksize(); floor(sqrt(s(1))) // The maximum size of a square dense matrix of doubles round(s(1)*8/10^6) // The memory, in MB rand( "normal" ); n = 1000; A = rand(n,n); b = rand(n,1); tic(); x = A\b; t = toc(); mflops = round((2/3*n^3 + 2*n^2)/t/1.e6); disp([n t mflops])
A more complete benchmark is available in bench_backslash.sce or [4].
The results
Scilab |
OS |
CPU |
Physical Memory |
Library |
n |
Time (s) |
MFLOPS |
scilab-5.2.2-x64 |
Windows Seven Ultimate 64 bits |
Intel Core2 6600 4*2.4 GHz |
8 GB |
Intel MKL |
6864 |
9.655 |
22339 |
scilab-5.3.0-beta-4 |
Windows Vista Ultimate 32 bits |
Intel Xeon E5410 4*2.33 GHz |
4 GB |
Intel MKL |
5720 |
6.376 |
19578 |
scilab-5.3.0-beta-4 |
Debian GNU/Linux 32 bits |
Intel Core2 4*2.66 GHz |
4 GB |
ATLAS 32 bits tuned (sse&mt) |
6864 |
11.304 |
19080 |
scilab-5.3.0-beta-4 |
Debian GNU/Linux 32 bits |
Intel Core2 4*2.66 GHz |
4 GB |
AMD ACML 4.3.0 |
3971 |
5.498 |
7598 |
scilab-5.3.0-beta-4 |
Fedora Linux 13 64 bits |
Intel Core2 6600 2*2.4 GHz |
4 GB |
ATLAS 64 bits sse2 (tuned) |
2757 |
10.140 |
4133 |
scilab-5.3.2 |
Windows Seven Ultimate 64 bits |
AMD Fusion E-350 1.6 Ghz |
8 GB |
Intel MKL |
3309 |
10.802 |
2238 |
scilab-5.3.0-beta-4 |
Fedora Linux 13 64 bits |
Intel Core2 6600 2*2.4 GHz |
4 GB |
Ref. BLAS 64 bits |
1914 |
2.570 |
1821 |
scilab-5.3.0-beta-4 |
Windows Vista Ultimate 32 bits |
Intel Xeon E5410 4*2.33 GHz |
4 GB |
Ref. BLAS |
2757 |
10.514 |
1330 |
scilab-5.3.0-beta-4 |
Windows Vista Ultimate 32 bits |
Intel Xeon E5410 4*2.33 GHz |
4 GB |
ATLAS |
3309 |
12.074 |
2002 |
scilab-5.3.0-beta-4 |
Debian GNU/Linux 32 bits |
Intel Core2 4*2.66 GHz |
4 GB |
Ref. BLAS |
1914 |
3.29 |
1422 |
scilab-5.3.0-beta-4 |
Linux Ubuntu 32 bits |
Intel Pentium M 2 GHz |
1 GB |
Ref. BLAS |
1000 |
? |
~700 |
scilab-5.3.0-beta-4 |
Linux Ubuntu 32 bits |
Intel Pentium M 2 GHz |
1 GB |
ATLAS |
3000 |
? |
~1400 |
Notes
- The backslash operator may use the multi-core of our machine, depending on the configuration of Scilab.
- Both benchmarks may fail, because the maximum stack size has been reached.
The timer function should not be used, because of it measures the CPU time, and not the elapsed time. On multi-core machines, the CPU time measured by the timer function is the sum of the times of all cores. This is why the tic()/toc() functions should be used instead (see bug #8276: http://bugzilla.scilab.org/show_bug.cgi?id=8276 for the lack of documentation of this point in the help page of timer).
For large matrices, the backslash test may fail, because the backslash operator switches to a least squares computation algorithm, instead of keeping on the Gaussian elimination. This is bug #7497 : http://bugzilla.scilab.org/show_bug.cgi?id=7497
See a message on this topic : http://lists.scilab.org/cgi-bin/ezmlm-browse?list=dev&cmd=showmsg&msgnum=1849
- We have packaged these benchmarks into an ATOMS module:
atomsInstall("scibench") atomsLoad("scibench")
To run the matmul benchmark:
lines(0); stacksize("max"); scf(); perftable = scibench_matmul ( %t , %t , 0.1 , 8 , 1.2 )
To run the backslash benchmark:
lines(0); stacksize("max"); scf(); perftable = scibench_backslash ( %t , %t , 0.1 , 8 , 1.2 )
References
[1] "Programming in Scilab", Michael Baudin, 2010, (HTTP)
[2] "Benchmarks: LINPACK and MATLAB - Fame and fortune from megaflops", Cleve Moler, 1994, (PDF)
[3] Benchmarking matrix-matrix product, Michael Baudin, 2010, (bench_matmul.sce)
[4] Benchmarking backslash, Michael Baudin, 2010, (bench_backslash.sce)
[5] Benchmark programs and reports, http://www.netlib.org/benchmark/
[6] Automatically tuned linear algebra software, R. Clint Whaley and Jack J. Dongarra. In Supercomputing '98: Proceedings of the 1998 ACM/IEEE conference on Supercomputing (CDROM), pages 1-27, Washington, DC, USA, 1998. IEEE Computer Society.
[7] Automated empirical optimizations of software and the atlas project, R. Clint Whaley, Antoine Petitet, R. Clint, Whaley Antoine, Petitet Jack, and Jack J. Dongarra, 2000