I wished to do some machine learning for binary classification. Binary classification is perhaps the most basic of all supervised learning problems. Unsurprisingly julia has many libraries for it. Today we are looking at: LIBLINEAR (linear SVMs), LIBSVM (Kernel SVM), XGBoost (Extreme Gradient Boosting), DecisionTrees (RandomForests), Flux (neural networks), TensorFlow (also neural networks).
In this post we are only concentrating on their ability to be used for binary classification. Most (all) of these do other things as well. We’ll also not really be going into exploring all their options (e.g. different types of kernals).
Furthermore, I’m not rigeriously tuning the hyperparameters so this can’t be considered a fair test for performance. I’m also not performing preprocessing (e.g. many classifies like it if you standarise your features to zero mean unit variance). You can look at this post more as talking above what code for that package looks like, and this is roughly how long it takes and how well it does out of the box.
It’s more of a showcase of what packages exist.
For TensorFlow and Flux, you could also treat this as a bit of a demo in how to use them to define binary classifiers.
Since they don’t do it out of the box.
This post, like most of my posts, is backed by a jupyter notebook.
Feel free, encouraged even, to download and run that, or view it on github, etc.
Also to raise issues on that repository.
The Task: Predict if that part of the Australian Flag is Blue
This is on the mildly gnarly side of binary classification problems. The classifying regions are:
- Not linearly seperable
- You can’t draw a line such that on one since is all the blue parts and on the other is all the nonblue parts.
- Not connected
- The stars, for example, are entirely separated by blue background regions
- Not convex
- With-in a section of one color, you can draw a line between two points in the same colored region and have it exit that section, then reenter.
- Unbalanced classes
- Most of the image is blue.
So it seams like a good, difficult, problem.
Data Generation
An image of the flag gives us one datum per pixel. We’re going to sample that, just so that plotting is easier.
Input:
Input:
Output:
Input:
Output:
isblue (generic function with 1 method)
Input:
Output:
Input:
Output:
3×51200 Array{Any,2}:
1.0 2.0 3.0 4.0 … 157.0 158.0 159.0 160.0
1.0 1.0 1.0 1.0 320.0 320.0 320.0 320.0
false false false false true true true true
Normally I would do this data munging using MLDataUtils.jl, which I have blogged about before (though it might be nice to few more posts about it, it is a great package, and I don’t know that I’ve fully covered its capacities).
But since I am already about to introduce 6 packages, I thought I would minimize talking about other ones.
Input:
Input:
Input:
Output:
Input:
Output:
Interface
As was discussed on the julia slack yesterday. There is a real problem with a lack of consistency in our ML packages right now.
So I am going to take a leaf from XKCD #927, and define one.
StatsBase.fit(modeltype, features, labels)
returns a model of that type that is trained on those features and labels.- Since we are only interested in binary classification, labels witll be an
AbstractVector{Bool}
with one entry per column of the feature matrix
- Since we are only interested in binary classification, labels witll be an
StatsBase.fit!(model, features, labels)
- Mutating form of the above. (
fit
is basically a construtor) - useful for allowing retraining/hot-starting
- Mutating form of the above. (
StatsBase.predict(model, features)
returns a vector of estimated probabilities of classification being true- one entry per column in features.
Something like this is actually in use in a bunch of places already, just not these packages, it seems..
Some packages (LibSVM
, DecisionTrees.jl
) use the same names, from ScikitLearnBase
, but they go sideways (i.e. observations in rows, Python style).
I think the real solution to a good interface does need to be thinking more like (or using) MLDataUtils.jl, which is observation dimention agnostic, defaulting to normal julia practice (ObsDim.Last()
).
Using these we can define our metrics, etc. It might be nicer to be using MLMetrics.jl to do this for us. But I’ll just do it simply here.
Input:
Output:
accuracy (generic function with 1 method)
Evaluation function
Given a common interface we can write one function to evaluate them all.
Accessing our training and test data as a global variable.
(Obviously not a good idea normally).
Input:
Output:
evaluate (generic function with 1 method)
LIBLINEAR.jl
The linear SVM. Possibly the weakest classifier in modern use. It actually works ok for a lot of higher dimentional problems. In high dimensions it is easier for things to be linearly seperable.
It surprises me that the C backend was only created in 2008.
R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. LIBLINEAR: A Library for Large Linear Classification, Journal of Machine Learning Research 9(2008), 1871-1874. Software available at http://www.csie.ntu.edu.tw/~cjlin/liblinear
Because we are interesting in getting probabilities back from predict
we are restricted to using L2R_LR
and L1R_LR
solver types, which are logistric regression.
This could probably be relaxed for most applications (but might break those metrics defintions above).
Input:
Output:
predict (generic function with 2 methods)
Input:
Output:
We can see from the plot that it is basically a gradient, of how much blue is in an area. This is as expected.
LIBSVM.jl
The more general SVM package. We’re here for its kernal SVM classifers. Again I am surprised that the backend was created so recently: 2005
Since version 2.8, it implements an SMO-type algorithm proposed in this paper: R.-E. Fan, P.-H. Chen, and C.-J. Lin. Working set selection using second order information for training SVM. Journal of Machine Learning Research 6, 1889-1918, 2005. https://www.csie.ntu.edu.tw/~cjlin/libsvm/
We’re looking at SVC
, in this example.
The other types of interst here would be NuSVC
, and LinearSVC
(but we got that covered by LIBLINEAR)
Input:
Output:
predict (generic function with 3 methods)
Input:
Output: