Parallel Computing In Scilab
Contents
Abstract
In this page, we make an overview on parallel computations in Scilab.
Various methods for parallel computing
To make the analysis clearer, we distinguish several forms of parallel computing.
First, we make a difference between implicit and explicit parallelization. That is to say, we separate the case where we use the Scilab code without modification, as is, from the case where we have to transform our code into another code (in the same language or not).
Second, we make a difference between the various hardwares on which the parallel computations are done:
- on the multicore processor of any classical desktop or laptop - this is called multithreading,
- on the graphics card of the computer - this is called GPGPU, for General Purpose GPU,
- on a cluster of physically separated machines (but network connected) - this is distributed computing.
Implicit Multi-core dense linear algebra
By default, Scilab v5 can use all your processors on Windows, for some operations. We can do the same on Linux, with a little more work (see the document below for details).
For dense linear algebra, the library which makes this possible is the Intel MKL on Windows and the ATLAS on Linux. For example, consider the following matrix-matrix multiplication in Scilab :
C=A*B
where A and B are real dense matrices of doubles which are compatible for the multiplication. If we use the Intel MKL on Windows, then Scilab use all the cores available on the processor. This is the same with the backslash operator:
C=A\B
where A and B are real dense matrices of doubles which are compatible for the left-division.
The core of the problem is vectorization, a method by which we can avoid to use "for" loops to make an efficient use of Scilab. Actually, there is more than multicore : even on a single core, the Intel MKL or ATLAS libraries make a very efficient use of the processor.
You can find more details in "Programming in Scilab", section 5 "Performance" :
http://forge.scilab.org/index.php/p/docprogscilab/downloads/
Explicit Multi-Core Computing
The parallel_run function makes parallel calls (on a multicore system) to the provided function on the supplied vectors of arguments. The function can be the name of either a compiled foreign function (see ilib_for_link) or a Scilab macro. In the latter case, the macro should not rely on side effects because some of them will be lost (those performed in other processes than the main scilab process). The number of calls (and dimension of the result vectors) is given by the length of the longest vector of arguments.
For example, consider the following for loop:
for i=1:10 res(i)= i*i; end
For parallel_run, we need to have a function performing the computation.
function a=g(arg1) a=arg1*arg1 endfunction res=parallel_run(1:10, g); // res = [1., 4., 9., 16., 25., 36., 49., 64., 81., 100.];
More details can be found in the help page of parallel_run :
http://help.scilab.org/parallel_run
Although parallel_run is a powerful function, it also has limitations, the most important of which being that it only runs on Linux (and not on Windows). This has been reported at :
http://bugzilla.scilab.org/show_bug.cgi?id=7697
Distributed Computing with MPI
Distributed computing can be done with Scilab from version 5.5.0 beta 1.
However, because of the nature of MPI various runtimes, MPI features are not compiled by default. Can be enabled with --with-mpi can ba used in a normal MPI way:
mpirun -n 4 bin/scilab-cli
GPGPU Computing
You can also do GPGPU computing with the module by Delamarre :
http://atoms.scilab.org/toolboxes/sciGPGPU
This toolbox provides a gpu computing capabilities at Scilab. It uses an implementation of BLAS (cuBLAS) and FFT (cuFFT) through gpuAdd, gpuMult, gpuFFT and other functions. This toolbox uses essentially Cuda but some functions, as gpuBuild, have been created for build and use kernels developed with OpenCL or Cuda.
This can be done provided that your video card support Cuda and doubles.
The following is an example of the sciGPGPU toolbox.
stacksize('max'); // Init host data (CPU) A = rand(1000,1000); B = rand(1000,1000); C = rand(1000,1000); // Set host data on the Device (GPU) dA = gpuSetData(A); dC = gpuSetData(C); d1 = gpuMult(A,B); d2 = gpuMult(dA,dC); d3 = gpuMult(d1,d2); result = gpuGetData(d3); // Get result on host // Free device memory dA = gpuFree(dA); dC = gpuFree(dC); d1 = gpuFree(d1); d2 = gpuFree(d2); d3 = gpuFree(d3);
Parallel C Code Generation
Par4All and Wild Cruncher are two products by Silkan (formerly HPC Project) based on the automatic generation of a parallel C code. Wild Cruncher is built on top of Par4All, and this is why we present Par4All first.
Par4All is an automatic parallelizing and optimizing compiler (workbench) for C and Fortran sequential programs. The purpose of this source-to-source compiler is to adapt existing applications to various hardware targets such as multicore systems, high performance computers and GPUs. It creates new OpenMP, CUDA or OpenCL source codes and thus allows the original source codes of the application to remain mainly unchanged for well formed programs. Par4All is an open source project that merges various open source developments.
More details on Par4All can be found at:
With Wild Cruncher from Silkan (formerly HPC Project) it is also possible to compile and parallelize Scilab programs to speed up computations. Wild Cruncher is a combination of hardware and software. The Scilab to C translation software is a specific development from HPC Project. Parallelisation, code generation and optimized compilation for the associated hardware are performed with Par4All, open source software, supported and promoted by HPC Project. The hardware is built with the most performing technologies from Intel and NVIDIA embedded in an office environment compatible device.
http://www.silkan.com/2012/01/13/wild-cruncher-breakfast-february-14th/
http://www.par4all.org/wp-content/uploads/2011/06/CP-CR-en.pdf
PVM in Scilab
The PVM module for Scilab is a toolbox that enables a collection of heterogeneous computers to be used as a coherent and flexible concurrent computational resource.
The individual computers may be shared- or local-memory multiprocessors, vector supercomputers, specialized graphics engines, or scalar workstations, that may be interconnected by a variety of networks, such as ethernet, FDDI.
Daemon programs (pvmd3) provide communication and process control between computers.
More details can be found in the PVM help page: