Using the scikit-learn machine learning library in Ruby using PyCall

Scikit-learn is a set of simple and efficient tools for machine learning and artificial intelligence built with NumPy and SciPy in Python. This tutorial will walk you through how you can use these tools from Ruby using a gem called PyCall. PyCall will enable you to use the power of Scikit-learn from Python in your Rails, Sinatra, Hanimi or any other Ruby based application.

For this example we will use the digit sample dataset from Sci-kit learn to build a SVM handwriting classifier that can read images and output digits. We will show how to load the libraries in Ruby and highlight some of the differences you’ll need to know to use the sci-kit learn toolbox from Ruby. The code used in this tutorial can be found in this repository.

The PyCall gem

Its no secret that Python has more machine learning libraries than almost any other programming language, and that it is often the preferred language for data scientist. However for practical applications you may want to use Ruby to write your actual application that needs to interface with your machine learning algorithm.

To allow us to interface between Pythons machine learning libraries and a Ruby application we can use the gem PyCall. PyCall will allow us to call and partially interoperate with Python from the Ruby language. We can import arbitrary Python modules into Ruby modules and call Python functions with automatic type conversion from Ruby to Python.

Building a handwriting classifier in Ruby using Sci-kit learn

For our example we are building a simple SVM based classifier that takes an 8×8 image of a number and outputs the digit it thinks is on the image. For training and testing with use the scikit-learn digits sample dataset. For more information about the dataset see here.

First we need to install python, if you don’t already have it installed on your computer. Next we need to upgrade pip the python package manager. On Linux you can upgrade pip with:

$ pip install -U pip

And on windows:

$ python -m pip install -U pip

Next install numpy, scipy and scikit-learn in this order using pip (windows users read this for installing scipy):

$ pip install -U numpy
$ pip install -U scipy
$ pip install -U scikit-learn

With scikit-learn installed install the pycall ruby gem:

$ gem install pycall

Now that we have python, scikit-learn and pycall installed we start building our application. Our first step is requiring the PyCall ruby gem and including the pycall:import module into our application to access to the python import functions.

require 'pycall/import'
include PyCall::Import

Next we use these functions to import the python modules we’ll need, in our case the dataset and svm module as well as the train_test_split function that will help us divide up our training set.

pyfrom :sklearn, import: :datasets
pyfrom :sklearn, import: :svm
pyfrom :'sklearn.model_selection', import: :train_test_split

We can now grab the digits sample dataset from scikit-learn.

digits = datasets.load_digits()

When we load our image dataset each image is given to us in a 8×8 matrix of intensity values. If we render one of the images it will look something like this:

This is a problem since our classifier takes as a vector of input values and not a 8×8 matrix. To transform the matrices into a vector we use the NumPy reshape function like this:

# Our digits are stored in a 2 dimensional array lets flatten each before we can train the model
# Get number of samples
samples = digits.images.shape[0]
# Reshape array
X = digits.images.reshape([samples,-1])

Next we use the train_test_split function to split up our dataset into a training set (80%) and a test set (20%). Notice how we use the ruby syntax here for named parameters instead of the python syntax.

# Split set into a training set and a test set
X_train, X_test, y_train, y_test  = train_test_split(X, digits.target, test_size: 0.2, random_state: Time.now.to_i)

At this point it is time to setup our SVM classifier. We call the svm.SVC.new method to initialize a new class of the Support Vector Classifier with a gamma value of 0.001.

Note how we use the Ruby new method to initialize the class. This is one of the key differences you’ll need to be aware of if you’re porting code from Python to Ruby.

# Initialize a SVM with gamma=0.001
clf = svm.SVC.new(gamma:0.001)

With the SVM setup we can fit the model using our training data:

# Fit with training data
clf.fit(X_train, y_train)

And lastly we can score the model we have created using the test data:

# Score our fit using the test data
classification_score = clf.score(X_test,y_test)
puts "Prediction score for our SVM #{(classification_score*100).round(2)}%"

If we want to run a single prediction we can do this as follows. In this example we will also print out the image to verify the prediction. Before printing we reshape the input back to an 8×8 image matrix.

# Do a prediction for one sample
puts clf.predict([X_test[0]])
# Reshape back to 2 dimmensions and print
puts X_test[0].reshape(8,8)

Putting it all together we get the following (you can find the full source here):

require 'pycall/import'
include PyCall::Import

pyfrom :sklearn, import: :datasets
pyfrom :sklearn, import: :svm
pyfrom :'sklearn.model_selection', import: :train_test_split

digits = datasets.load_digits()

# Our digits are stored in a 2 dimensional array lets flatten each before we can train the model
# Get number of samples
samples = digits.images.shape[0]
# Reshape array
X = digits.images.reshape([samples,-1])

# Split set into a training set and a test set
X_train, X_test, y_train, y_test  = train_test_split(X, digits.target, test_size: 0.2, random_state: Time.now.to_i)

# Initialize a SVM with gamma=0.001
clf = svm.SVC.new(gamma:0.001)

# Fit with training data
clf.fit(X_train, y_train)

# Score our fit using the test data
classification_score = clf.score(X_test,y_test)
puts "Prediction score for our SVM #{(classification_score*100).round(2)}%"

# Do a prediction for one sample
puts clf.predict([X_test[0]])
# Reshape back to 2 dimmensions and print
puts X_test[0].reshape(8,8)

Running this file gives us the following:

$ ruby tutorial.rb
Prediction score for our SVM 99.44%
[0]
[[  0.   0.   2.   9.  14.  12.   0.   0.]
 [  0.   0.  12.  16.  10.  15.   1.   0.]
 [  0.   4.  14.   3.   2.   6.   6.   0.]
 [  0.   5.   7.   0.   0.   3.   8.   0.]
 [  0.   4.   7.   0.   0.   1.   8.   0.]
 [  0.   3.  12.   1.   0.   5.   8.   0.]
 [  0.   0.  10.  12.   7.  14.   3.   0.]
 [  0.   0.   1.  12.  16.   8.   0.   0.]]

This tutorial showed how to harness the power of machine learning in Python inside your Ruby application. This method can be used not only for scikit-learn, but for any of your favorite machine learning libraries in Python such as TensorFlow or Keras.

5 comments

Andrei Beliankou says:

September 19, 2017 at 10:33 am

One of the first articles on PyCall! Keep going!

Pingback: Implementing OCR using a Random Forest Classifier in Ruby - Practical Artificial Intelligence
tay.eb says:

December 11, 2017 at 9:46 am

Thanks for this post,

However I have this error when trying to import sklearn.

/usr/local/lib/ruby/gems/2.4.0/gems/pycall-1.0.3/lib/pycall/import.rb:46:in `import_module’: : dlopen(/Users/tay/anaconda3/lib/python3.6/site-packages/scipy/sparse/linalg/isolve/_iterative.cpython-36m-darwin.so, 2): Symbol not found: _main (PyCall::PyError)

Do you have any idea about the origin of this error??

ruby on rails online training says:

January 21, 2018 at 10:28 am

Hi,

thanks for this article, I didn’t know about pycall before
very usefull to do ML with python sci kit learn

Regards

hh says:

May 9, 2018 at 3:09 am

very interesting.
is possible using tensorflow in ruby?