TensorFlow's SVD is significantly worse than LAPACK's, but still very good

TensorFlow’s SVD is significantly less accurate than LAPACK’s (i.e. julia’s and numpy/SciPy’s backing library for linear algebra). But still incredibly accurate, so probably don’t panic. Unless your matrices have very large ($>10^6$) values, then the accuracy difference might be relevant for you (but probably isn’t). However, both LAPACK and TensorFlow are not great then – LAPACK is still much better.

TensorFlow.jl recently gained bindings for the SVD operation. This came as a result of @malmaud’s great work to automatically creates bindings for all the ops defined by the TensorFlow backend (It is pretty awesome). Some may be surprised to find that TensorFlow supports single value decomposition(SVD) at all. After all, isn’t it a neural network library? My response to that, which I have said before and will say again, is that TensorFlow is (effectively) a linear algebra library with automatic differentiation and GPU support. And having those features, makes it great for implementing neural networks. But it has more general functionality that you would every expect. SVD is one of those features; though I am sure it does have a collection of uses for neural networks – using it to implement PCA for dimensionality reduction as preprocessing comes to mind.

After I had added the binding, to match julia’s return value ordering for Base.svd, I wanted to add a test to make sure it was working correctly. As there are multiple different correct SVD solutions for a given input $M$ I can’t just directly check the returned $U,S,V$ against those returned by julia’s svd. So instead we want to use $U,S,V$ to reconstruct $M$ and test that that reconstruction is close-enough $M\approx U\;\mathrm{diagm}(S)\;V'$ Then what is close enough?
Being as close as julia’s SVD gets makes sense. But when I tested that, it was failing, so I thought I would give it some slack: allowing 2 times the error. But on testing that, it wasn’t enough slack and the tests failed, so I gave it more (after checking the results did at least make sense). I ended up allowing 100 times as much reconstruction error, though this may have been a bit much. Bases on this, I thought I would investigate closer.

These observations are based on TensorFlow.jl, and Julia, but they really apply to any TensorFlow library,and almost any scientific computing library. All the language specific TensorFlow libraries delegate their operations to the same C/C++ backend. Most scientific computing software delegates their linear algrebra routines to some varient of LAPACK; not just julia and SciPy/numpy, but also commerial products like MatLab, and Mathematica. I’m using TensorFlow.jl and julia because that is what I am most familiar with.

There are of-course a variety of algorithms and variations to those algoriths for calculating SVD. It will become obvious that TensorFlow and LAPACK are using different ones. I’ll also point out that there is another implementation in IterativeSolves.jl. I am not going to go into any detail on the differences – I am no serious numerical computation linear-algebraist; I go and bug applied mathematicians when I need to know that kind of stuff.

Here we are just looking at the implementations from the outside.

I am not looking at speed here at all. I don’t know if TensorFlow is faster or slower than LAPACK. In general this depends significantly on your system setup, and how TensorFlow was compiled. It has been reported that it is hugely faster than numpy’s, but I’ve only seen the one report and few details.

If you want to look into TensorFlow’s accuracy checks, I am aware some of the tests for it can be found on their github. It is checking 32bit floats with a tolerance of $10^{-5}$ and 64 bit floats with a tolerance of $10^{-14}$, I think that is with sum of errors.

LAPACK tests are here. However, LAPACK has its own Domain Specific Language for testing, and I don’t speak it at all.

On to our own tests:

Input:

using TensorFlow
using Plots
using DataFrames

To be clear, since these can change with different LAPACKs, and different TensorFlow releases, this is what I am running on:

Input:

versioninfo()

Output:

Julia Version 0.5.1
Commit 6445c82 (2017-03-05 13:25 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) CPU           E5520  @ 2.27GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Nehalem)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.7.1 (ORCJIT, nehalem)

Input:

TensorFlow.tf_version()

Output:

v"1.0.1"

Also, it is worth looking at these errors in the context of the machine epsilon. Most of these errors are far below that; and so don’t matter at all.

Input:

eps(Float32)

Output:

1.1920929f-7

Input:

eps(Float64)

Output:

2.220446049250313e-16

First we define a function to conveniently call the TensorFlow SVD, on a julia matrix. This works by adding a constant to the graph. This leaks memory like crazy, since it adds a new node every time the test is run but that does not matter for purposes of our test. (But it probably should have been done with a placeholder and a feed dictionary)

Input:

sess=Session(Graph())
svd_tf(m) = run(sess, svd(constant(m)))

Now we define the reconstruction error, how we will evaluate it. We are using squared error: $err(M,U,S,V)=\sum{(M-U\;\mathrm{diagm}(S)\;V')^2}$ We get one error result per matrix per method. (So not mean squared error, since we want to look at the error distribution).
Note that the evaluation happened entirely in julia, except for the SVD itself.

The choice of sum of square error here, rather than sum of error, is perhaps not ideal.
I’m honestly not sure.
Sum of error would give a much larger result – in fact almost all the errors would be above the machine epsilon. The few papers I have seen evaluating SVD seem to mostly use sum of squared error; but this is not my field.

Input:

recon_err(m, u,s,v) = sum(abs2, m-u*diagm(s)*v')
recon_err_jl(m) = recon_err(m, svd(m)...)
recon_err_tf(m) = recon_err(m, svd_tf(m)...)

We define a function to run our trials, and collect the results. Note that this takes a function matrix_dist_fun(T, size) that is used to generate the data. By changing this function we can change the distribution of values in the trial matricies.

Input:

function generate_data(n_samples, matrix_dist_fun, T, size)
    df = DataFrame(err_jl=T[], err_tf=T[])
    for ii in 1:n_samples
        m = matrix_dist_fun(T, size)
        push!(df, Dict(:err_jl => recon_err_jl(m), :err_tf => recon_err_tf(m)))
    end
    df
end

Here we define the functions to perform our analytics/visualisation. I think a histogram showing the distribution of $err_{tf}/err_{jl}$ is informative. an absolute value histogram would also be informative, but when the values are so low, it become hard to read. As well the quartile values, that is minimum, Q1, median, Q3, maximum, are informative on the absolute values of the error; since they tell us that that say three quarters of all trials showed error less than the given value.

Input:

function plot_relative_error_hist(df)
    histogram(df[:err_tf]./df[:err_jl];
        xlabel="factor by which Tensorflow error is greater than Julia (LAPACK) error",
        ylabel="number of trials with this error",
        title="Histogram of relative error values for SVD reconstruction"
    )
end

Input:

function  quartile_summary(df, field)
    q0 = minimum(df[field])
    q1 = quantile(df[field], 0.25)
    q2 = median(df[field])
    q3 = quantile(df[field], 0.75)
    q4 = maximum(df[field])
    print("$field:\t")
    @printf("Q0=%0.2e\t Q1=%0.2e\t Q2=%0.2e\t Q3=%0.2e\t Q4=%0.2e", q0, q1, q2, q3, q4)
    println()
    (q0, q1, q2, q3, q4)
end

Input:

function display_evaluation_figures(df)
    quartile_summary(df, :err_jl)
    quartile_summary(df, :err_tf)
    plot_relative_error_hist(df)
end

So now onward to the results. In the results that follow it can been seen that all the absolute errors (even for maximum/Q4) are well below the machine epsilon for the type evaluated. (But see close to the bottom where this does not hold). It can be seen that it is very rare for TensorFlow to have a lower error than Julia. Such results would show up as bar in the histogram at $x<1$. Of which there are some, but vanishingly few.

Input:

normal100double = generate_data(1000, randn, Float64, (100,100))
display_evaluation_figures(normal100double)

Output:

err_jl:	Q0=3.99e-26	 Q1=4.84e-26	 Q2=5.33e-26	 Q3=6.22e-26	 Q4=1.27e-25
err_tf:	Q0=7.73e-26	 Q1=1.16e-25	 Q2=1.30e-25	 Q3=1.46e-25	 Q4=5.47e-25

INFO: binning = auto

<?xml version=”1.0” encoding=”utf-8”?>

Input:

normal100float = generate_data(1000, randn, Float32, (100,100))
display_evaluation_figures(normal100float)

Output:

err_jl:	Q0=9.65e-09	 Q1=1.13e-08	 Q2=1.19e-08	 Q3=1.25e-08	 Q4=1.62e-08
err_tf:	Q0=2.38e-08	 Q1=3.63e-08	 Q2=4.02e-08	 Q3=4.49e-08	 Q4=7.15e-08

INFO: binning = auto

<?xml version=”1.0” encoding=”utf-8”?>

Input:

uniform100double = generate_data(1000, rand, Float64, (100,100))
display_evaluation_figures(uniform100double)

Output:

err_jl:	Q0=4.57e-27	 Q1=6.39e-27	 Q2=7.46e-27	 Q3=8.99e-27	 Q4=2.23e-26
err_tf:	Q0=1.27e-26	 Q1=3.95e-26	 Q2=6.08e-26	 Q3=8.84e-26	 Q4=2.10e-25

INFO: binning = auto

<?xml version=”1.0” encoding=”utf-8”?>

Input:

uniform100float = generate_data(1000, rand, Float32, (100,100))
display_evaluation_figures(uniform100float)

Output:

err_jl:	Q0=1.07e-09	 Q1=1.31e-09	 Q2=1.47e-09	 Q3=1.69e-09	 Q4=2.95e-09
err_tf:	Q0=2.98e-09	 Q1=4.29e-09	 Q2=4.66e-09	 Q3=5.18e-09	 Q4=7.58e-09

INFO: binning = auto

<?xml version=”1.0” encoding=”utf-8”?>

Input:

normal10double = generate_data(1000, randn, Float64, (10,10))
display_evaluation_figures(normal10double)

Output:

err_jl:	Q0=3.69e-29	 Q1=9.58e-29	 Q2=1.38e-28	 Q3=2.24e-28	 Q4=3.18e-27
err_tf:	Q0=1.42e-28	 Q1=4.83e-28	 Q2=7.33e-28	 Q3=1.10e-27	 Q4=5.29e-27

INFO: binning = auto

<?xml version=”1.0” encoding=”utf-8”?>

Input:

normal10float = generate_data(1000, randn, Float32, (10,10))
display_evaluation_figures(normal10float)

Output:

err_jl:	Q0=8.95e-12	 Q1=2.14e-11	 Q2=2.80e-11	 Q3=3.74e-11	 Q4=1.11e-10
err_tf:	Q0=3.56e-11	 Q1=1.52e-10	 Q2=2.36e-10	 Q3=3.52e-10	 Q4=1.19e-09

INFO: binning = auto

<?xml version=”1.0” encoding=”utf-8”?>

In the prior tests all the matrix elements have been small. Either normally distributes, mean 0 and variance 1, or uniformly distributed between 0 and 1. But what happens when we look at matrices with element larger values? To do this, we crank up the variance on the randn. That is to say we generate our trial matrices using variance*randn(T,size). Results follow for variance 10 thousand, 10 million and 10 billion.

Input:

var10Knormal100double = generate_data(1000, (args...)->10_000*randn(args...), Float64, (100,100))
display_evaluation_figures(var10Knormal100double)

Output:

err_jl:	Q0=3.83e-18	 Q1=4.83e-18	 Q2=5.32e-18	 Q3=6.06e-18	 Q4=1.18e-17
err_tf:	Q0=7.46e-18	 Q1=1.16e-17	 Q2=1.29e-17	 Q3=1.46e-17	 Q4=2.15e-17

INFO: binning = auto

<?xml version=”1.0” encoding=”utf-8”?>

Input:

var10Mnormal100double = generate_data(1000, (args...)->10_000_000*randn(args...), Float64, (100,100))
display_evaluation_figures(var10Mnormal100double)

Output:

err_jl:	Q0=3.74e-12	 Q1=4.85e-12	 Q2=5.37e-12	 Q3=6.15e-12	 Q4=1.10e-11
err_tf:	Q0=7.98e-12	 Q1=1.17e-11	 Q2=1.32e-11	 Q3=1.48e-11	 Q4=2.38e-11

INFO: binning = auto

<?xml version=”1.0” encoding=”utf-8”?>

Input:

var10Gnormal100double = generate_data(1000, (args...)->10_000_000_000*randn(args...), Float64, (100,100))
display_evaluation_figures(var10Gnormal100double)

Output:

err_jl:	Q0=3.80e-06	 Q1=4.91e-06	 Q2=5.40e-06	 Q3=6.22e-06	 Q4=1.07e-05
err_tf:	Q0=7.85e-06	 Q1=1.16e-05	 Q2=1.30e-05	 Q3=1.46e-05	 Q4=2.20e-05

INFO: binning = auto

<?xml version=”1.0” encoding=”utf-8”?>

What we see here, is that the distribution of relative errors remains the same, but the absolute errors increase. i.e. TensorFlow is still generally has around 2.5 times the error of Julia. Further for both TensorFlow and Julia, those absolute errors are increasing quadratically with the variance. This is due to the use of sum of squared error, if we did sum of error, it would be linear increase. So at high variance, this difference in accuracy could matter. Since we are now looking at differences of $10^{-6}$ for example. However, these differences remain small compared to the values in the matrix eg $10^7$.

In the end, the differences are not relevant to most people (Potentially not relevant to anyone). It is merely a curiosity. LAPACK is consistently better at SVD than TensorFlow. Really, one should not be too surprised given that having excellent matrix factorisation is what LAPACK is all about.

Input: