Blog - Exploring regression on the El Niño data

This page is a blog article in progress, written by David Tweed. To see discussions of this article while it was being written, visit the Azimuth Forum.

$A^{(e)} \mathbf{x} + \mathbf{\epsilon}^{(e)} = \mathbf{y}^{(e)} \qquad(1)$

Not only do we want to look at the basic fit of this regression model, but also investigate the effects of searching for *sparse solutions* (as discussed a little here).

But before we look at some results on the actual data, let’s look in a bit more detail at how various kinds of sparsity promoting regularization work. Recall that in estimating parameters we work with an error function that measures the total differences between the model predictions and the actual outputs, which we’ll refer to as the *cost*. This takes the form

$cost = \frac{1}{E}\sum_{e=1}^E \left( \sum_{i=1}^P M^{(e)}_i x_i - y^{(e)}\right)^2 \qquad(2)$

The minimum of this kind of optimization will involve every $x_i$ being non-zero, with all kinds of crazily large coefficients being used to magnify tiny random fluctuations in the input data to get the best possible match with the output. Since we think we’ve got a simple-ish, noisy system we don’t want those kind of solutions, so we add an extra term to the cost we’re minimizing. This is generally some function of the $x_i$s, weighted by some scaling $\lambda$. In an ideal world we’d want a simple function counting how many of the $x_i$s have non-zero values. Unfortunately this is a very combinatorial problem which is computationally intractable (technically NP-hard, as discussed here.) for large dimensional tasks. Faced with such a problem, we take the numerical analyst’s standard dodge and come up with a continuous problem which is simpler to solve and where we think the answer will have similar properties.

The functions we’ll use are the square, value and square root of the absolute values of the $x_i$s. (The absolute value is needed just to weight the magnitude of the $x_i$ whether it’s positive or negative), giving us the three costs below.

We can write the regression cost function—adding an $L_{q}$ prior—using explicit summations as

$L_2:\qquad cost_2 = cost + \lambda \sum_{i=1}^P |x_i|^2$

$L_1:\qquad cost_1 = cost + \lambda \sum_{i=1}^P |x_i|$

$L_{1/2}:\qquad cost_{1/2} = cost + \lambda \sum_{i=1}^P \sqrt{|x_i|}$

(You can see the $q$ in $L_q$ corresponds to the exponent the $|x_i|$ is raised to.) To gain some understanding of how these work, let’s start with an artificial example. Starting with a single $(\sum_i a_i x_i - y)^2$, when we multiply that out we’ll get a big sum of weighted terms with various $x_i^2$s, $x_i x_j$s and $x_i$s. This means that it’s a multi-dimensional analogue of the quadratic function you remember from high-school. When we add lots of $(\sum_i a_i^{(e)} x_i - y^{(e)})^2$ together as in Eq 2, we still get the same kind of multidimensional quadratic. I’ll just tell you that for this kind of thing if you consider all the points in $x_1, \dots, x_P$ space that have the *same* $cost = C$ you get a higher dimensional ellipsoid, with the ellipsoids for different $C$ values all centred on a common point and growing outwards as $C$ increases. (Energetic readers can figure out how this is a consequence of the quadratic nature of the cost.) So the unregularized cost function is like a hyperdimensional squashed onion where each layer is an “equal-cost contour line”, with the minimum at the centre of the onion.

Now we can investigate how the various kinds of regularization interact with this portion of the cost. Looking at the equal value contours for the $L_2$, $L_1$ and $L_{1/2}$ in 2-D:

- $L_2$: the circular contours. They get more closely spaced going outwards because the weight rises faster than linearly.
- $L_1$: the straight-diamond contours. These stay equally spaced going outwards as the weight rises linearly.
- $L_{1/2}$: the “curvy” diamond contours. Note these become further spaced going outwards because the weight rises slower than linearly.

They behave analogously in higher dimensions.

With this we can plot how the minimum of $cost_2$, $cost_1$ and $cost_{1/2}$ varies as $\lambda$ is progressively increased. In the image below the ellipsoid shows one of the constant $cost$ contours for the data error with the curves showing the sparsity regularized minimums, starting off from the centre of the ellipsoid when $\lambda=0$.

The purple path is the for $cost_2$, the blue is for $cost_1$ and the green is $cost_{1/2}$. We stop increasing $\lambda$ once we’ve got only one non-zero $x_i$ since the final progression to all the $x_i$ being zero is quite dull! (The blue and green curves both hit the “floor” where they have a sharp bend in the middle). We can also plot the values on the same axes vertical axis against $\lambda$ on the horizontal,starting again with the $\lambda=0$ minimum:

Again we start with the minimum $x_i$ values for $cost_2$, then $cost_1$ and finally $cost_{1/2}$.

**Puzzle**: can you relate the shape of the contours for the regularization above to the $x_i$ trajectories shown here? (**Note the answer is in the paragraphs below.**)

So suppose we’ve got the data-term proportion of the cost, and we want a sparse representation with only *one* non-zero $x_i$. We can see that “mathematical” answer is that we should set the two $x_i$s corresponding to the two slowest increasing axes to 0 and leave the other alone to get the least change in the cost from the natural minimum. For an axis aligned ellipsoid in only got 3 dimensions but when we’ve got an non-axis aligned ellipsoid in the thousands of dimensions or more we can’t find this exact answer. So lets look at how using the regularizer to “push” values to 0 affects things.

Because the $L_2$ regularizer increases quadratically, it penalizes even slightly larger values much more highly, so it tends to “push” the coefficients to have similar magnitudes. You can see that in the way that all $x_i$s move towards the origin at roughly equal speeds, with the each $x_i$s progressively getting smaller but only becoming zero at roughly the same time as the others. Thus the $L_2$ regularizer is useless for sparsifying the solution to the regression.

For the $L_1$ regularizer a non-axis aligned ellipsoid will start to intersect… Thus we can see that while this has pulled all the $x_i$ towards zero, there are still two distinct points where one and two variables have become zero where the remaining values are perturbed but close to their original values.

For the $L_{1/2}$ regularizer… Looking at the 1-D plots we can see that like the $L_1$ regularizer it is pulling all the non-zero coefficients from their values, but its happening along a slower curve up until one $x_i$ gets sufficiently close to 0 that it undergoes a very rapid drop to 0. This has the effect that, although still perturbed, the non-zero $x_i$ values are affected less than with the $L_1$ regularizer.

((more))

If we restrict to one variable from the numerous vectors and denote this variable by $x$, we get

$\frac{\partial cost}{\partial x} = A x + B + \frac{\lambda sgn(x)}{2\sqrt{|x|}} = 0$

where $A$ and $B$ don’t depend on $x$. If we denote $\lambda sgn(x)/2$ by $C$, we can multiply through by $y =\sqrt{|x|}$ to find the minimum (along this co-ordinate) is

$\pm A y^3 + B y + C = 0$

where the $\pm$ depends whether $x$ is positive/negative and all subject to needing to ensure the solutions in $y$ are also consistent with the original equation. Since this is a cubic equation we have a simple closed form for the solutions to this equation and hence can efficiently solve the original equation.

**Puzzle**: If we restrict ourselves to regularizers where the co-ordinate descent update can be computed analytically (and there’s *no* good reason to do this), what regularization functions can we use?