Daily Reports for Machine Learning Toolbox - GSOC 2017
Daily Reports for Machine Learning Toolbox - GSOC 2017
- 1st July
- 2nd July
- 3rd July
- 4th July
- 5th July
- 6th July
- 7th July
- 8th July
- 10th July
- 12th July
- 13th July
- 15th July
- 17th July
- 19th July
- 21st July
- 22nd July
- 25th July
- 31st July
- 1st August
- 2nd - 3rd August
- 4th - 5th August
- 6th August
- 7th - 9th August
- 10th August
- 13th August
- 14th August
- 15th August
- 17th - 19th August
- 21th August
- 23rd August
- 25th August
- Project Info
Student Name -
- Mandar Deshpande
- Yann Debray
- Philippe Saadé
- Dhruv Khattar
- Caio Souza
- Cloned latest version of PIMS toolbox from forge. Tried to build it with VS 2013, some errors in building it. Have contacted Simon Marchetto regarding it.
- Created a workaround for the pyExecFile function, using the pyEvalStr function for linear regression example from scikit-learn.
- Committed PIMS_direct_run.sce to forge
- Read documentation for execfile function in python environment
- Tested python script to test running entire scikit-learn scripts on jupyter server, using the jupyter client approach
- Able to retrieve linear regression learned paramters - coef_ and intercept_
Committed jupyter_lr.py to forge, demonstrating the work. Commit
- Tested ridge_regression through PIMS, was able to retrieve coefficent and intercept numpy objects for use in scilab
Predict method for ridge regression giving results same as in pythons prediction. Commit
- Similarly RidgeCV learned parameters - alpha transferred to Scilab console through PIMS, and also tested using the pyEvalStr
- Knn Classification's predict function studied, and learned parameters identified as neigh_dist, neigh_ind. Will be transferred to Scilab and predict function to be written tomorrow.
- python_local.py handles all the communication with the jupyter_server from the local machine.
- python_server.py, is the python script on the jupyter server or remote machine which handles all the ml training operations using python libraries like scikit-learn, pandas, scipy etc. The dataset used by the python_server.py file exists on the remote machine itself.
- I have tested running the python_server.py through jupyter_server through python_local.py script, and the models are trained succesfully.
- Only need to use the pyExecFile() function from the scilab console now.
- Tracked scikit-learn source code for kernel ridge model's attributes and prediction function design
- Tested kernel_ridge python script through the jupyter approach, and was able to retrieve model attributes for predicting test set results
- Also wrote kernel_ridge.sce for scilab using PIMS along with prediction function using the 'dual_coef' and 'X_fit_' attributes, through the 'get_kernel' method in kernel_ridge class. Committed the code to forge repo.
- While executing the predict function on Scilab side, there is an stack error resulting due to the returned numpy ndarray from pairwise_kernels function. Needs to be resolved.
- Also have identified (followed the scikit-learn source) how the attributes are being used for prediction using SVR, SVC and Decision-Tree Classifier. These attributes are first being modified before usage in the actual 'predict' function, using methods specific to scikit-learn classes. There is a need to discuss this with my mentor, as to how to go forward.
- I was able to transfer the model attributes of SVR like support_, support_vectors_, dual_coef_, intercept_ to Scilab console
- But the attributes that can be retrieved from the SVR model depends on the type of kernel used. For example : coef_ and intercept_ are only available when a 'linear' kernel is being used
- Checking other methods and classes whose attributes can be retrieved with the Jupyter approach.
- Committed the code for SVR.py and SVR.sce, with all trained model attributes.
- Searched for possible methods to transfer ml model learned attributes from one python instance to another. Major highlights were fabric (for uploading/downloading files), paramiko ( for transferring files between two servers).
- Also these attributes needs to be saved in a file format on the Jupyter_server, which can be easily extracted on the jupyter_client side. For this 'pickle' inbuilt python library would be very useful.
- Read the pickle usage documentation for saving and retrieving python objects.
- Was able to store the model attributes for kernel_ridge through the 'dump' method in a 'data.p' file. This file was later used in the 'load' method to extract all model attributes in a new python instance. Committed the code demonstrating its usage.
- Had a troubleshooting session with Simon Marchetto regarding the PIMS loader.sce issues. He is working out the errors and helping me use it as soon as possible.
- Investigated Kmeans clustering model source code on scikit-learn repository. Learned attributes cluster_centers_, labels_ and inertia_ of kmeans extracted from model
- Considering usage of a ssh client for sharing files between server and local machine. As Philippe suggested Samba would be suitable for Windows to Linux connection.
- Reading the Samba documentation for file transfer protocol.
- Tried out many examples for usage of Samba and paramiko. Wasn't able to test it, since I don't have access to any SFTP server.
- Wrote a file_server.py script for starting a server and sending the attributes.p file to the file_client.py script requesting it. But this implementation was done using basic python networking libraries.
- Was able to use the transferred pickle file in a new python instance and extract all the learned attributes.
- PIMS/scipython was successfully built with the help of Simon Marchetto. There was some error with the availability of python27.dll file. Have resolved it, now scipython is working as desired.
- Used scipython for creating a single ml_scilab.sce file, which imports necessary libraries and runs the pyExecFile function from scipython, which executes a local_ml python script.
- This local_ml script trains the machine learning model and saves the attributes in a pickle file - 'attributes.p'
- After the creation of the pickle file, it is loaded in Scilab using the load method from pickle lib and stores all attributes in a variable eg. 'abc'.
- Now we can use index addressing on the object 'abc' too extract individual attributes eg. coef_ = abc(0); intercept_ = abc(1)
- So now we can train any machine learning model through Scilab and get the learned attributes back to the Scilab Console for further usage
- Searching for possible use cases of the learned attributes within Scilab
- Having discussion with Harsh Tomar regarding changes possible in the current implementation
- Able to retrieve prediction results from linear regression, SVR, SVC, Decision Tree Classifier, using the pickle file storage and retrieval
- Covered multiple regression, logistic regression, k-means clustering through local machine learning in Scilab.
- Was able to retrieve model attributes and prediction results in the Scilab console using the existing PIMS approach.
- Read through the scikit-learn documentation for hierarchical clustering and naive bayes classification.
- Tried out python as well as Scilab scripts to run the above two scripts.
- As expected was able to transfer back the predictions of test set back to the Scilab. But still for complex ml models like these, haven't found a way to do prediction on Scilab side. As these models don't have any learned attributes, but direct functions which manipulate the dataset without generating any attributes.
- Created a walk-through to give a demo of the machine learning in Scilab through this approach, along with software requirements.
- Used linear regression for the ease of understanding the steps involved in returning the learned attributes.
- More compact packaging possible at a later stage, once end goal of running ml through Jupyter is achieved.
- Using the steps given, any ml model from any ml library can be used through Scilab without any loss of performance metrics (confusion matrix)
- Communicated with Harsh for checking out current working status of the code produced and reproducing the errors in Jupyter approach.
- The scilab to python connection and ml training, pickle storage and retrieval has been verified to work as reported.
- Jupyter connection is now working correctly as Harsh reported, only theres an error in the pickle file generation.
- Will be working over the weekend to fix the pickle error.
- The pickle file generation error has been now fixed for the python implementation on Jupyter side.
- Currently the python_local.py script was only responsible for handling the Jupyter communication and sending execution commands to the server.
- The python_server.py was instead generating the pickle file in the root directory where it was present.
- There needs to be a 2nd level of abstraction(through a common script running the python_local.py and pickle file importing) which the user isn't aware of, so that the entire process is simplified.
- Jupyter server approach is working on the local machine.
- Created a network shared drive from another windows machine, to save and load the pickle file.
- Successfully working to run the script present on the shared drive, saving the pickle file and then importing/extracting the attributes inside scilab console.
- Had a detailed discussion with Philippe and Harsh regarding the current status of project and the deliverables in August.
- Major focus has been on creating the POC following the discussed approach, once jupyter approach illustration has been approved.
- Sent out a crude form of workflow diagram for Jupyter approach to Philippe and Harsh, will be converting it into a proper SVG tomorrow with required changes.
- Have tried to connect to a remote jupyter server through ssh, but haven't yet succeeded.
- Created a SVG to demonstrate the workflow, with all the requirements and steps involved in making use of ML in scilab.
- Shared the illustration with all people involved in the ML project.
- Started working on the POC, now with the assumption that a Jupyter server is already running on the remote machine.
- Currently trying open-ssh for creating a connection with the Jupyter server.
- Was able to configure the ssh-host-config and ssh-user-config file on the Scilab machine and the Jupyter server.
- Host is visible to the local machine, still an error in making an connection due to ssh-key sharing and pass-phrase not being accepted.
2nd - 3rd August
Using the steps mentioned here, I was able to change the configuration settings of the Jupyter notebook through the jupyter_notebook_config.py and jupyter_notebook_config.json files, to always start at a pre-specified port (eg. 9999) and available at all IP address available on the system. In our case the IP address is the IPv4 address. The jupyter server required a token authentication if a password has not been enabled, so I have disable both for now.
Now the Jupyter server is accessible on any other machine using the same IP & port specified in the configuration file. I connected to this alread running Jupyter server from the browser of my Scilab Machine. And was able to run the machine learning scripts(python_server_lr.py) on the jupyter server. These scripts are running successfully on the Jupyter server and the pickle file is also being saved in the shared folder Z:/ This shared folder Z:/ is accessible to the Scilab console, for importing and extracting the learned attributes from the attributes.p file as planned.
But the first step, which connects to the jupyter server is through a browser using the ip_address:port specified (attached below is snippet showing of the same). We are looking for a way to connect to the Jupyter server through the python_local.py script, which doesn't involve the browser. All other steps in the workflow are working.
I am configuring the the Jupyter_client connection_file present in C:\Users\Mandar\AppData\Roaming\jupyter\runtime to always initiate connection with the IP and port specified before hand. So that this connection file can be used to extract the necessary connection details every time a script needs to be run on the Jupyter server. Connection file is simply a JSON dictionary specifying the ports and ip, I read this here in the connection file section.
This connection_file is given as argument to the kernelmanager(connection_file) function. Once this is done, we can simply used the python_local.py script to initiate the connection with the server and setup the kernel.
The python_local.py script is responsible only for starting the kernel and managing the communication with the kernel. I want this script to start the kernel on the Jupyter server configured earlier. For this I haven't yet found any method to do so.
I have read the Jupyter documentation through for method to connect with remote Jupyter notebook, but all these options are for browser based interaction with the jupyter server.
4th - 5th August
I have found way to connect to an already running Jupyter kernel on a remote server.
As I had reported yesterday, that the connection_file needs to be configured for connecting to a specific IP address and the 4 port numbers (shell, iopub, hb, stdin) of the jupyter kernel. So it is essential that the same connection_file be used on both the server and client side.
As of now, I am able to start a IPython kernel on the remote machine/server and configure it for connection from the local Scilab machine. The approach is related to this post : https://github.com/ipython/ipython/wiki/Cookbook%3a-Connecting-to-a-remote-kernel-via-ssh
I am not using ssh, its a direct connection using the IP address in the connection_file of the IPython kernel.
Here are the steps I took for making the connection:
1. Server side : Run in console/cmd
- ipython kernel --ip='local_machine_ip'
2. Copy the connection_file(kernel_name.json) -
Located in "C:\Users\UserName\AppData\Roaming\jupyter\runtime" to the client side, using the shared folder
3. Client Side: Run in console/cmd
- jupyter console -- existing./kernel_name.json
4. Once Jupyter/Ipython console opens up in cmd, type the following in a cell
- %run python_server_lr.py
Now we have got independence from accessing the jupyter/ipython kernel from the notebook interface within a browser and we can directly run all these steps through a script, using subprocess in python. I just finished testing this approach on the 2 available windows machine, and it's working. The attributes file is being generated after the script has been run on the remote IPython kernel.
Now using the subprocess module (builtin python), we simply need to run command line programs on client side as mentioned in step 3. above. Also I am searching how to execute python scripts through Ipython/Jupyter console non-interactively. So that step 4. can directly be run by passing the command along with step 3. Something like : jupyter console -- existing./kernel_name.json --execute python_server_lr.py
I was able to use the subprocess module for executing the cmd command mentioned in step 4 for connecting to an existing ipython kernel directly through a python script.
import subprocess subprocess.call ('jupyter console --existing ./kernel-name.json')
This successfully starts a IPython console on the client side, which is using the already running ipython kernel on the server side, initiated in step 1.
I am still working on executing the python script directly with the subprocess module, such that once the ipython console is started on the client- it should automatically run the prespecified script.py
Or at least run the magic command %run script.py
I have found few methods to run the above command through a python script, and Harsh has also shared few references to check out.
7th - 9th August
The good news is that I have got the proof of concept working now. The only thing which was hindering the 'python_local.py' from connecting to an existing kernel was the absence of the connection file (kernel_name.json) from the required existing server instance of the kernel.
Now we don't need to start the kernel using the python_local.py script. Instead the following steps need to be followed (I am including steps mentioned earlier also, with changes) :
1. Server side : Run in console/cmd for starting the server side kernel
- ipython kernel --ip='local_machine_ip'
2. Copy the connection_file(kernel_name.json) :
Located in "C:\Users\UserName\AppData\Roaming\jupyter\runtime" to the client side, using the shared folder
3. Client Side: Run ml_scilab.sce
- It executes pyExecFile('python_local.py') which now uses the shared json file to initiate connection with the running ipython kernel on the server side. Once the connection has been made, any command can be sent for execution to the server through the python_local.py script.
4. Import the generated pickle file into Scilab:
- Once the execution completed the attributes.p file is saved in the shared folder, which can easily be imported from the Scilab console for further usage, as considered earlier.
Note: The main flaw in the approach being followed earlier was that the jupyter console or jupyter notebook does not have a non-interactive code execution mechanism. Most our needs were satisfied also by the notebook method, but is only accessible through a web browser. Also in the earlier version of python_local.py, whenever a new kernel was being started using the python_local.py script, it changed the existing json file. The changed json file was of no use after this, that's why it wasn't connecting as expected.
I have tested the steps mentioned above using 2 separate Windows machine, and they are working as expected on all mI example scripts have pushed on the forge earlier. The pickle file is being generated and imported by Scilab too. I have pushed the working python_local.py script too.
After we (Philippe, Harsh and me) have a discussion today about this, I will also create a walkthrough/demo considering 3-4 different scikit-learn examples.
mandroid6 [7:59 PM] Also now the scripts and data.csv files, are now being transferred to the shared drive using the python_local.py script. I have pushed the changes to forge repo.
import jupyter_client from shutil import copyfile
copyfile('F:/Internship/SCILAB ML/Machine learning toolbox/Integration Approach/Jupyter_Approach/python_server_lr.py', 'X:/python_server_lr.py')
copyfile('F:/Internship/SCILAB ML/Machine learning toolbox/Integration Approach/Jupyter_Approach/Salary_Data.csv', 'X:/Salary_Data.csv')
cf="X:/kernel-6048.json" km=jupyter_client.BlockingKernelClient(connection_file=cf) km.load_connection_file()
I have checked running the script even with these changes, and its working. We still need to figure out how to automatically transfer the kernel.json file from the remote server to the shared drive, so that its available to the client machine.
I have shared the screen-cast with Philippe and Harsh yesterday, along with a minimal commentary. Also the python_scripts required to be run can now be transferred to the shared drive using python_local.py as reported yesterday.
As we had discussed in our last week's call, and Harsh also communicated to me about the work to be done on the neural network side.
Considering that the neural network is already trained, we only need to transfer the trained model to the server side and use it for making predictions based on images or any other related data type involved. For this I have been reading how to use model persistence, for reusing the same trained model again and again. Scikit-learn already offers this functionality through the joblib module :
While considering the case of neural networks, I realized its even possible to save other trained models using the joblib dump method and reuse them as needed. I have tested this with logistic regression, and it is able to make prediction on the Scilab console using default python syntaxes.
The major changes/improvements which need to be made according to me are :
1. The kernel.json file generated on the server side, needs to be automatically transferred to the shared folder or the client machine
2. Replacing the shared folder method, with a more sophisticated mechanism (someone from your team who knows more about http or sftp methods can assist in this)
3. Packing all the python scripts and other files involved into a toolbox, for ease of usage. Simon Marchetto could help me with the packaging.
4. As Harsh also suggested, I need to make all the code neater. Including comments, error handling and fewer steps on the users side. When I said error handling, I meant when a script is remotely executed on the kernel, it doesn’t display errors (if any) on the Scilab client side. Only the msg_id is returned.
5. Creating documentation and tutorials for using the ML toolbox in Scilab. Making it open for other developers to understand each and every step/ mechanism used to enable ml functionality in Scilab.
I was able to use joblib module from sklearn.externals, to successfully save trained ml models like linear reg, logistic reg, DT, SVM and K-means. But joblib cannot be used to save trained neural network models having a large number of layers. It raises an error: RuntimeError: maximum recursion depth exceeded
I had a brief discussion about this with Harsh. As Philippe has suggested, we are first considering a trained neural network model to be already present on the remote server. By saved, we mean the "weights" of the network will be saved in form of .h5 format. These weights will directly be loaded when a classification task is to be executed on the remote machine, on images/data sent from the local Scilab machine.
Currently, i am working on making neural networks usable through our toolbox,Using the pretrained vgg16 cnn model for carrying out tests/
Today I was able to use a pre-trained vgg16 cnn model for loading weights. Once the weights were loaded, a sample image was used for checking prediction.
vgg16 model has been written using Keras (with Theano backend in my case). Also if we are considering usage of neural networks or deep learning through Scilab, we can use libs available only for python 2.7. As scipython/PIMS currently is based on python 2.7. 1. Theano 2. Keras with Theano 3. Lasagne based on Theano 4. pylearn2
TensorFlow is compatible with python 2.7 only on a Linux machine, so cant use it as of now.
Using the vgg16 model, the weights have been loaded and the prediction results stored in pickle file. The entire procedure has been achieved through the ipython kernel method on the remote machine. Right now the test-image has been sent to the remote machine through a simple file transfer command in the python_local.py script. The extracted features have been transferred on the Scilab side too.
Though this approach is working as we want, I think it would be of much more utility to the user if we provide a set of pre-trained dl models like vgg16, RestNet50, mobileNet available in ‘Keras.applications’ for direct usage. Otherwise anyways, the user can modify and use any other dl model he/she prefers based on the type of problem to be solved using their own python script.
@harshtomar thanks for guiding me today in the vgg16 loading.
Major highlights of this month's work:
1. Testing out the Jupyter notebook approach and closing the issue of its usability (since its only available through a browser) 2. Finalizing the IPython kernel usage on the server side, for non-interactive script execution. 3. Making corrections in the python_local.py, for connecting to existing kernel on remote server 3. Automating transfer of necessary python scripts and datasets (if required) to the shared folder of the remote server. 4. Sharing screen-cast showing basic steps required for ML in SCILAB 5. Using joblib to store trained ml models(other than nn), which can be reused on the client side 6. Using pre-trained vgg16 cnn model (using keras) for loading weights stored on server and returning extracted features from test image (shared from client machine)
If any more code is needed to be added before 21st August, then I can start working on it post our discussion. Currently I am reading how to package the work done into a Scilab toolbox.
17th - 19th August
As discussed earlier, I have created the User Guide for the machine learning Toolbox. I have tried to keep it as simple as possible. Still more information can be added, if that would be fine from the user's point of view. Will make any changes or improvements required.
Apart from the document, I am adding individual .sce scripts for all algorithms covered to further reduce the steps on the users side. Eg. linear_regression.sce, decision_tree_classification.sce etc User can directly run the required ml_name.sce script, which loads all learned attributes of that model by default without any changes from users end.
Also I am adding joblib commands to the ml scripts in which the trained model itself can be stored for later use.
Changes in User guide arising from these additions will reflect in the final version.
I have made changes to the document as Harsh had suggested on Friday.
Have committed the individual Scilab scripts (.sce) for ml algorithms covered in our toolbox. These scripts would be directly functional for the users, without manually coding the pickle importing part. Essentially the set of these 12-13 ml algorithms covered would be usable out of box, if the user simply sets up the folder sharing and kernel. Also the final submission/evaluation is open now, so I have until 29th August to submit it.
I have removed the fixed time delay for pickle file generation and replaced it with a file_path check method for confirming file existence.
Now the Scilab scripts waits for attributes.p file generation in proportion to the ml algorithm's training time, so it's generalized now. Have committed the code for all algorithms covered with these changes.
Harsh told me that there are some changes/tweaks to be made in the Jupyter kernel approach. I know you are busy, but if you could give me at least few thinking points then I can work and read about them before our discussion later this week.
Apart from this I am also refining the user_guide I sent today morning.
I have made the changes in the user document as per discussion with Harsh earlier this week. Currently I am reading as much as I can about Jupyter Hub implementation for ML in Scilab. Though making it work till Tuesday isn't possible, I am still covering as much info as possible on my side so that I can be used at a later stage after GSoC. I am willing to work beyond the GSoC timeline to make the Jupyter Hub approach work, if we finalise on that.
The final code submission/evaluation deadline is 29th August on my side. So if you could provide me any suggestions/improvements possible before Monday, I can work on them and make the submission on Tuesday.