7 Binary Classifier Libraries in Julia

I wished to do some machine learning for binary classification. Binary classification is perhaps the most basic of all supervised learning problems. Unsurprisingly julia has many libraries for it. Today we are looking at: LIBLINEAR (linear SVMs), LIBSVM (Kernel SVM), XGBoost (Extreme Gradient Boosting), DecisionTrees (RandomForests), Flux (neural networks), TensorFlow (also neural networks).

In this post we are only concentrating on their ability to be used for binary classification. Most (all) of these do other things as well. We’ll also not really be going into exploring all their options (e.g. different types of kernals).

Furthermore, I’m not rigeriously tuning the hyperparameters so this can’t be considered a fair test for performance. I’m also not performing preprocessing (e.g. many classifies like it if you standarise your features to zero mean unit variance). You can look at this post more as talking above what code for that package looks like, and this is roughly how long it takes and how well it does out of the box.

It’s more of a showcase of what packages exist.
For TensorFlow and Flux, you could also treat this as a bit of a demo in how to use them to define binary classifiers. Since they don’t do it out of the box.

This post, like most of my posts, is backed by a jupyter notebook.
Feel free, encouraged even, to download and run that, or view it on github, etc. Also to raise issues on that repository.

The Task: Predict if that part of the Australian Flag is Blue

This is on the mildly gnarly side of binary classification problems. The classifying regions are:

So it seams like a good, difficult, problem.

Australian Flag

Data Generation

An image of the flag gives us one datum per pixel. We’re going to sample that, just so that plotting is easier.

Input:

using Images, FileIO

Input:

img = load(download("https://upload.wikimedia.org/wikipedia/en/thumb/b/b9/Flag_of_Australia.svg/320px-Flag_of_Australia.svg.png"));

Output:

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  3333  100  3333    0     0   2494      0  0:00:01  0:00:01 --:--:--  2494

Input:

isblue(pixel) = pixel.b > pixel.r && pixel.b > pixel.g

Output:

isblue (generic function with 1 method)

Input:

colorview(Gray, .!(isblue.(img)))

Output:

png

Input:

const all_feature1 = Vector{Float64}()
const all_feature2 = Vector{Float64}()
const all_labels = Vector{Bool}()

@inbounds for ind in eachindex(IndexCartesian(), img)
    pixel = img[ind]
    push!(all_labels, isblue(pixel))
    push!(all_feature1, ind.I[1])
    push!(all_feature2, ind.I[2])
end

const all_features = [all_feature1'; all_feature2']
# standard julia Observations are in final index form (i.e columns of matrixes)
Any[all_features; all_labels']

Output:

3×51200 Array{Any,2}:
     1.0      2.0      3.0      4.0  …   157.0   158.0   159.0   160.0
     1.0      1.0      1.0      1.0      320.0   320.0   320.0   320.0
 false    false    false    false       true    true    true    true  

Normally I would do this data munging using MLDataUtils.jl, which I have blogged about before (though it might be nice to few more posts about it, it is a great package, and I don’t know that I’ve fully covered its capacities).

But since I am already about to introduce 6 packages, I thought I would minimize talking about other ones.

Input:

const all_inds = shuffle(1:length(all_labels))
const test_inds = all_inds[1:end÷5] # first 20%
const train_inds = all_inds[19end÷20:end] # last 5%

const test_features = all_features[:, test_inds]
const test_labels = all_labels[test_inds]

const train_features = all_features[:, train_inds]
const train_labels = all_labels[train_inds];

Input:

using Plots
pyplot() # Using PyPlot, because the SVGs that GR makes kill browsers with too many paths at this scale 

function plotflag(xs,ys; title="")
    scatter(xs[2,:],-xs[1,:]; zcolor=ys,
    markersize=2, markerstrokealpha=0, bg=colorant"gray", seriescolor=:blues, title=title)
end

Input:

plotflag(train_features, train_labels, title="Training Data")

Output:

Input:

plotflag(test_features, test_labels, title="Test Data")

Output:

Interface

As was discussed on the julia slack yesterday. There is a real problem with a lack of consistency in our ML packages right now.

So I am going to take a leaf from XKCD #927, and define one.

Standards

Something like this is actually in use in a bunch of places already, just not these packages, it seems.. Some packages (LibSVM, DecisionTrees.jl) use the same names, from ScikitLearnBase, but they go sideways (i.e. observations in rows, Python style). I think the real solution to a good interface does need to be thinking more like (or using) MLDataUtils.jl, which is observation dimention agnostic, defaulting to normal julia practice (ObsDim.Last()).

Using these we can define our metrics, etc. It might be nicer to be using MLMetrics.jl to do this for us. But I’ll just do it simply here.

Input:

import StatsBase: fit!, fit, predict

classify(model, features) = predict(model, features).>0.5
accuracy(model, features, ground_truth_labels) = mean(classify(model, features) .== ground_truth_labels)

Output:

accuracy (generic function with 1 method)

Evaluation function

Given a common interface we can write one function to evaluate them all.
Accessing our training and test data as a global variable. (Obviously not a good idea normally).

Input:

percent(x) = @sprintf("%0.2f%%", 100*x)

function evaluate(modeltype)    
    @time model = fit(modeltype, train_features, train_labels)
    
    println("$modeltype Train accuracy: ", percent(accuracy(model, train_features, train_labels)))
    println("$modeltype Test accuracy: ", percent(accuracy(model, test_features, test_labels)))
    
    #this is calculating the predict twice (since we did it to report accuaracy already), but predict is cheap
    plotflag(test_features, predict(model, test_features); title=string(modeltype))
end

Output:

evaluate (generic function with 1 method)

LIBLINEAR.jl

The linear SVM. Possibly the weakest classifier in modern use. It actually works ok for a lot of higher dimentional problems. In high dimensions it is easier for things to be linearly seperable.

It surprises me that the C backend was only created in 2008.

R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. LIBLINEAR: A Library for Large Linear Classification, Journal of Machine Learning Research 9(2008), 1871-1874. Software available at http://www.csie.ntu.edu.tw/~cjlin/liblinear

Because we are interesting in getting probabilities back from predict we are restricted to using L2R_LR and L1R_LR solver types, which are logistric regression. This could probably be relaxed for most applications (but might break those metrics defintions above).

Input:

using LIBLINEAR
function fit(::Type{LinearModel}, features, labels; solver_type=LIBLINEAR.L2R_LR, kwargs...)
    linear_train(labels, features; solver_type=solver_type, kwargs...)
end

function predict(model::LinearModel, features)
    classes, probs = linear_predict(model, features; probability_estimates=true)
    vec(probs)
end

Output:

predict (generic function with 2 methods)

Input:

evaluate(LinearModel)

Output:

  0.735264 seconds (158.51 k allocations: 8.306 MiB)
LIBLINEAR.LinearModel Train accuracy: 79.85%
LIBLINEAR.LinearModel Test accuracy: 80.88%

We can see from the plot that it is basically a gradient, of how much blue is in an area. This is as expected.

LIBSVM.jl

The more general SVM package. We’re here for its kernal SVM classifers. Again I am surprised that the backend was created so recently: 2005

Since version 2.8, it implements an SMO-type algorithm proposed in this paper: R.-E. Fan, P.-H. Chen, and C.-J. Lin. Working set selection using second order information for training SVM. Journal of Machine Learning Research 6, 1889-1918, 2005. https://www.csie.ntu.edu.tw/~cjlin/libsvm/

We’re looking at SVC, in this example. The other types of interst here would be NuSVC, and LinearSVC (but we got that covered by LIBLINEAR)

Input:

import LIBSVM:svmtrain, SVM, svmpredict
function fit(::Type{SVM{Bool}}, features, labels; solver_type=LIBLINEAR.L2R_LR, kwargs...)
    #could use ScikitLearnBase.fit!(SVC, features, Float64.(labels)), but it doesn't take extra args same way.
    svmtrain(features, labels; probability=true, kwargs...)
end

function predict(model::SVM{Bool}, features)
    classes, probs = svmpredict(model, features)
    probs[1,:]
end

Output:

predict (generic function with 3 methods)

Input:

evaluate(SVM{Bool})

Output:

  5.028115 seconds (128.57 k allocations: 6.933 MiB)
LIBSVM.SVM{Bool} Train accuracy: 99.96%
LIBSVM.SVM{Bool} Test accuracy: 91.69%