The Azimuth Project
Blog - Exploring regression on the El Niño data

This page is a blog article in progress, written by David Tweed. To see discussions of this article while it was being written, visit the Azimuth Forum.

Blog - Exploring regression on the El Niño data

A (e)x+ϵ (e)=y (e)(1) A^{(e)} \mathbf{x} + \mathbf{\epsilon}^{(e)} = \mathbf{y}^{(e)} \qquad(1)

Not only do we want to look at the basic fit of this regression model, but also investigate the effects of searching for sparse solutions (as discussed a little here).

Regression using an L qL_{q} prior

But before we look at some results on the actual data, let’s look in a bit more detail at how various kinds of sparsity promoting regularization work. Recall that in estimating parameters we work with an error function that measures the total differences between the model predictions and the actual outputs, which we’ll refer to as the cost. This takes the form

cost=1E e=1 E( i=1 PM i (e)x iy (e)) 2(2) cost = \frac{1}{E}\sum_{e=1}^E \left( \sum_{i=1}^P M^{(e)}_i x_i - y^{(e)}\right)^2 \qquad(2)

The minimum of this kind of optimization will involve every x ix_i being non-zero, with all kinds of crazily large coefficients being used to magnify tiny random fluctuations in the input data to get the best possible match with the output. Since we think we’ve got a simple-ish, noisy system we don’t want those kind of solutions, so we add an extra term to the cost we’re minimizing. This is generally some function of the x ix_is, weighted by some scaling λ\lambda. In an ideal world we’d want a simple function counting how many of the x ix_is have non-zero values. Unfortunately this is a very combinatorial problem which is computationally intractable (technically NP-hard, as discussed here.) for large dimensional tasks. Faced with such a problem, we take the numerical analyst’s standard dodge and come up with a continuous problem which is simpler to solve and where we think the answer will have similar properties.

The functions we’ll use are the square, value and square root of the absolute values of the x ix_is. (The absolute value is needed just to weight the magnitude of the x ix_i whether it’s positive or negative), giving us the three costs below.

We can write the regression cost function—adding an L qL_{q} prior—using explicit summations as

L 2:cost 2=cost+λ i=1 P|x i| 2 L_2:\qquad cost_2 = cost + \lambda \sum_{i=1}^P |x_i|^2
L 1:cost 1=cost+λ i=1 P|x i| L_1:\qquad cost_1 = cost + \lambda \sum_{i=1}^P |x_i|
L 1/2:cost 1/2=cost+λ i=1 P|x i| L_{1/2}:\qquad cost_{1/2} = cost + \lambda \sum_{i=1}^P \sqrt{|x_i|}

(You can see the qq in L qL_q corresponds to the exponent the |x i||x_i| is raised to.) To gain some understanding of how these work, let’s start with an artificial example. Starting with a single ( ia ix iy) 2(\sum_i a_i x_i - y)^2, when we multiply that out we’ll get a big sum of weighted terms with various x i 2x_i^2s, x ix jx_i x_js and x ix_is. This means that it’s a multi-dimensional analogue of the quadratic function you remember from high-school. When we add lots of ( ia i (e)x iy (e)) 2(\sum_i a_i^{(e)} x_i - y^{(e)})^2 together as in Eq 2, we still get the same kind of multidimensional quadratic. I’ll just tell you that for this kind of thing if you consider all the points in x 1,,x Px_1, \dots, x_P space that have the same cost=Ccost = C you get a higher dimensional ellipsoid, with the ellipsoids for different CC values all centred on a common point and growing outwards as CC increases. (Energetic readers can figure out how this is a consequence of the quadratic nature of the cost.) So the unregularized cost function is like a hyperdimensional squashed onion where each layer is an “equal-cost contour line”, with the minimum at the centre of the onion.

Now we can investigate how the various kinds of regularization interact with this portion of the cost. Looking at the equal value contours for the L 2L_2, L 1L_1 and L 1/2L_{1/2} in 2-D:

Regularization weights

  • L 2L_2: the circular contours. They get more closely spaced going outwards because the weight rises faster than linearly.
  • L 1L_1: the straight-diamond contours. These stay equally spaced going outwards as the weight rises linearly.
  • L 1/2L_{1/2}: the “curvy” diamond contours. Note these become further spaced going outwards because the weight rises slower than linearly.

They behave analogously in higher dimensions.

With this we can plot how the minimum of cost 2cost_2, cost 1cost_1 and cost 1/2cost_{1/2} varies as λ\lambda is progressively increased. In the image below the ellipsoid shows one of the constant costcost contours for the data error with the curves showing the sparsity regularized minimums, starting off from the centre of the ellipsoid when λ=0\lambda=0.

Regression regularization paths

The purple path is the for cost 2cost_2, the blue is for cost 1cost_1 and the green is cost 1/2cost_{1/2}. We stop increasing λ\lambda once we’ve got only one non-zero x ix_i since the final progression to all the x ix_i being zero is quite dull! (The blue and green curves both hit the “floor” where they have a sharp bend in the middle). We can also plot the values on the same axes vertical axis against λ\lambda on the horizontal,starting again with the λ=0\lambda=0 minimum:

1-D regression trajectories

Again we start with the minimum x ix_i values for cost 2cost_2, then cost 1cost_1 and finally cost 1/2cost_{1/2}.

Puzzle: can you relate the shape of the contours for the regularization above to the x ix_i trajectories shown here? (Note the answer is in the paragraphs below.)

So suppose we’ve got the data-term proportion of the cost, and we want a sparse representation with only one non-zero x ix_i. We can see that “mathematical” answer is that we should set the two x ix_is corresponding to the two slowest increasing axes to 0 and leave the other alone to get the least change in the cost from the natural minimum. For an axis aligned ellipsoid in only got 3 dimensions but when we’ve got an non-axis aligned ellipsoid in the thousands of dimensions or more we can’t find this exact answer. So lets look at how using the regularizer to “push” values to 0 affects things.

Because the L 2L_2 regularizer increases quadratically, it penalizes even slightly larger values much more highly, so it tends to “push” the coefficients to have similar magnitudes. You can see that in the way that all x ix_is move towards the origin at roughly equal speeds, with the each x ix_is progressively getting smaller but only becoming zero at roughly the same time as the others. Thus the L 2L_2 regularizer is useless for sparsifying the solution to the regression.

For the L 1L_1 regularizer a non-axis aligned ellipsoid will start to intersect… Thus we can see that while this has pulled all the x ix_i towards zero, there are still two distinct points where one and two variables have become zero where the remaining values are perturbed but close to their original values.

For the L 1/2L_{1/2} regularizer… Looking at the 1-D plots we can see that like the L 1L_1 regularizer it is pulling all the non-zero coefficients from their values, but its happening along a slower curve up until one x ix_i gets sufficiently close to 0 that it undergoes a very rapid drop to 0. This has the effect that, although still perturbed, the non-zero x ix_i values are affected less than with the L 1L_1 regularizer.


Practical aspects of regression using an L 1/2L_{1/2} prior

If we restrict to one variable from the numerous vectors and denote this variable by xx, we get

costx=Ax+B+λsgn(x)2|x|=0 \frac{\partial cost}{\partial x} = A x + B + \frac{\lambda sgn(x)}{2\sqrt{|x|}} = 0

where AA and BB don’t depend on xx. If we denote λsgn(x)/2\lambda sgn(x)/2 by CC, we can multiply through by y=|x|y =\sqrt{|x|} to find the minimum (along this co-ordinate) is

±Ay 3+By+C=0 \pm A y^3 + B y + C = 0

where the ±\pm depends whether xx is positive/negative and all subject to needing to ensure the solutions in yy are also consistent with the original equation. Since this is a cubic equation we have a simple closed form for the solutions to this equation and hence can efficiently solve the original equation.

Puzzle: If we restrict ourselves to regularizers where the co-ordinate descent update can be computed analytically (and there’s no good reason to do this), what regularization functions can we use?

Back to the experimental results