Distribution functions in Scilab
Contents
Introduction
Scilab provides several probability and statistical features and provides several distribution functions:
- PDF: probability density function,
- CDF: cumulated density function,
- iCDF: inverse cumulated density function,
- RNG: random number generator.
In fact, a detailed analysis shows that the existing features would be easily enhanced on the following points:
- accuracy of the CDF: currently, there is no accuracy test,
- creation of accurate PDFs: currently, there is no PDF in Scilab,
- creation of missing accurate PDF, CDF and inverse CDF: currently, there is a large number of missing distribution functions in Scilab.
The goal would be to provide a quality which could not be easily be proved wrong. The current state is that it would be easy to investigate the accuracy of Scilab in the same way that the accuracy of Excel was investigated [1,2,3]. We notice that Matlab and R provide both accurate and various distribution functions. The small number of distribution functions has been noticed in [4] in which Scilab receives for this topic (section 3.5) a note equal to 35% with respect to 47% for Matlab (fortunately, this author did not investigate the numerical accuracy).
More tests of accuracy of distribution functions
The accuracy of distribution function is a central point in the context of the assessment of the quality of Scilab. This particular point lead several researchers to inquire this topic in Excel, but also Gnumeric, R and others [1,2,3]. But Scilab does not have tools to assess the quality of its distribution functions. Worse, we have evidences that the function cdfbeta only provides 8 accurate digits instead of roughly 16. In fact, it is extremely easy, by using symbolic computations systems such as Mathematica or Maple to get the required number of significant digits and to compare with Scilab.
Update (03/2012). The bug #7569 (http://bugzilla.scilab.org/show_bug.cgi?id=7569) has been fixed, which increases the accuracy of many cdf* functions. Still, there is a need for improved tests, as shown by at least two known bugs :
More accurate probability distribution functions
Scilab only provides a limited number of CDF and a large number of very common PDFs are not provided. For example, the hypergeometric distribution function is not provided. Worse, if the user uses toolboxes such as the Stixbox, we have evidences that the hypergeometric distribution function provided in this package is numericaly inaccurate, i.e. does not provide any single significant digits for moderate input arguments.
This corresponds to the bug report : http://forge.scilab.org/index.php/p/stixbox/issues/98/
The actual problem is not to fix this particular bug. The real problem is to test all the distributions in Stixbox, so that we can be sure that all functions are accurate. Since this requires a lot of work, it is more efficient to redesign a new set of functions.
More PDFs and CDFs
Scilab provide some cumulated distribution functions (CDF) but does not provide any probability distribution function (PDF). Practical experience shows that it is non trivial to implement an accurate probability distribution function. For example, it is very easy to develop an extremely inaccurate Poisson distribution function (see for example in Excel). But it is easy to implement an accurate PDF, given that we are aware of the limitations of the floating point arithmetic.
The progress during 2012-2013
The distfun project has improved a lot since its creation in 2012, where it provided only 5 distributions. Part of this success is based on the GSOC 2012 (see Contributor-stats-GSOC2012) At this time, we have added several basic distributions not included in previous releases : Binomial, Poisson, Chi-Square, Hypergeometric, F, Geometric.
Another boost has been done after the completion of the GSOC, where most of the work has been translated into C source code for increased performance and consistency. The T distribution were also added. A lot of accuracy bugs were fixed, typically for large or small input parameters. The nonlinear equation solver was also updated, leading to an improved robustness, speed and portability. The uniform random number generator was updated, with a clarified API (and a clarified implementation). Distfun now provides 13 documented, tested, robust distributions.
Ideas
In this section, we provide a list of of potential taks related to this topic. We especially detail the expected outputs of each potential task. We also analyse the tools which might be required. In each case, a small scientific report at the end of the task will be welcome. We emphasize the benefits of the task for the student. We also detail the software management of the produced source code.
The expected output of these tasks is a collection of Scilab macros (.sci), unit tests (.tst) and help pages (.xml). If possible, a set of .c source code may be provided.
Task #1 : Add the distributions available in the cdf* library of Scilab. These are the following ones: non-central F, non-central Chi-Square, negative Binomial. This will make Scilab's cdf* functions obsolete.
Task #2 : Add the distributions available in the grand function of Scilab: Multivariate Gaussian, multinomial, permutations, and Markov chains. This will make Scilab's grand function obsolete.
Task #3 : Add intermediate distributions available in R/Matlab. This includes : Gumbel, Cauchy, Weibull, discrete uniform. Drafts of implementations in Scilab are already availables.
Task #4 : Recode the argument expansion routine in C and implement all the gateways in pure C. This will make the coding simpler and will increase performance.
Task #5 : Add advanced features : log-PDF, fitting procedures (based on maximum likelyhood estimator), generic truncated distribution, kernel smoothing density (ksdensity in Matlab). Drafts of implementations in Scilab are already availables.
Other projects related to the same issues are welcome.
Sources of inspiration
- The page
http://cran.r-project.org/web/views/Distributions.html
gives an overview of what distributions are available in R.
Bibliography
- [1] The accuracy of statistical distributions in Microsoft Excel 2007 - A. Talha Yalta, Computational Statistics and Data Analysis 52 (2008) 4579–4586
- [2] Fixing Statistical Errors in Spreadsheet Software: The Cases of Gnumeric and Excel Export - B. D. Mc Cullough, CSDA Statistical Software Newsletter - 2004
- [3] On the Accuracy of Statistical Distributions in Microsoft Excel 97 - Leo Knusel, SSNinCSDA 26, 375-379, January 1998
[4] Comparison of mathematical programs for data analysis - Edition 5.04, Stefan Steinhaus, http://www.scientificweb.com/ncrunch/
[5] http://forge.scilab.org/index.php/p/distfun/, Distribution Functions
[6] http://www.iecn.u-nancy.fr/~pincon/nsp/nsp_manual/manualindex.html and look for "cdf", "pdf", "icdf" and "grand"
Author
2013 - Michaël Baudin