# The Azimuth Project Blog - Exploring regression on the El Ni&ntilde;o data (Rev #5, changes)

Showing changes from revision #4 to #5: Added | Removed | Changed

## Regression using an $l_{1/2}$ prior

This page is a blog article in progress, written by David Tweed. To see discussions of this article while it was being written, visit the Azimuth Forum.

We can write the regression cost function—adding an $l_{1/2}$ prior—using explicit summations as

# Blog - Exploring regression on the El Niño data

$cost_2 = \sum_{e=1}^{E} \left( \sum_{i=1}^P M^{(e)}_{i} x_{i} - y^{(e)}\right)^2 + \lambda \sum_{i=1}^P |x_i|^2$
$cost_1 = \sum_{e=1}^{E} \left( \sum_{i=1}^P M^{(e)}_{i} x_{i} - y^{(e)}\right)^2 + \lambda \sum_{i=1}^P |x_i|$
$cost_{1/2} = \sum_{e=1}^{E} \left( \sum_{i=1}^P M^{(e)}_{i} x_{i} - y^{(e)}\right)^2 + \lambda \sum_{i=1}^P \sqrt{|x_i|}$

## Regression using an $L_{q}$ prior

But before we look at some results on the actual data, let’s look in a bit more detail at how various kinds of sparsity promoting regularization work. Recall that in estimating parameters we work with an error function that measures the total differences between the model predictions and the actual outputs, which we’ll refer to as the cost. This takes the form

$cost = \frac{1}{E}\sum_{e=1}^E \left( \sum_{i=0}^P M^{(e)}_i x_i - y^{(e)}\right)^2$

The “natural” minimum of this kind of optimization will involve every $x_i$ being non-zero, with all kinds of crazily large coefficients being used to magnify tiny random fluctuations in the input data to get the best possible match with the output. Since we think we’ve got a simple-ish, noisy system we don’t want those kind of solutions, so we add an extra term to the cost we’re minimizing. This is generally some function of the $x_i$s, weighted by some scaling $\lambda$. In an ideal world we’d want a simple function counting how many of the $x_i$s have non-zero values. Unfortunately this is a very combinatorial problem which is computationally intractable (technically NP-hard) for large problems. Faced with such a problem, we take the numerical analyst’s standard dodge and come up with a continuous problem which is simpler to solve and where we think the answer will have similar properties.

The functions we’ll use are the square, value and square root of the absolute values of the $x_i$s. (The absolute value is needed just to weight the magnitude of the $x_i$ whether it’s positive or negative), giving us the three costs below.

We can write the regression cost function—adding an $L_{q}$ prior—using explicit summations as

$L_2:\qquad cost_2 = \frac{1}{E}\sum_{e=1}^{E} \left( \sum_{i=1}^P M^{(e)}_{i} x_{i} - y^{(e)}\right)^2 + \lambda \sum_{i=1}^P |x_i|^2$
$L_1:\qquad cost_1 = \frac{1}{E}\sum_{e=1}^{E} \left( \sum_{i=1}^P M^{(e)}_{i} x_{i} - y^{(e)}\right)^2 + \lambda \sum_{i=1}^P |x_i|$
$L_{1/2}:\qquad cost_{1/2} = \frac{1}{E}\sum_{e=1}^{E} \left( \sum_{i=1}^P M^{(e)}_{i} x_{i} - y^{(e)}\right)^2 + \lambda \sum_{i=1}^P \sqrt{|x_i|}$

(You can see the $q$ in $L_q$ corresponds to the exponent the $|x_i|$ is raised to.) To gain some understanding of how these work, let’s start with an artificial example. Let’s start with a single $(\sum_i a_i x_i - y)^2$: when we expand that out we’ll get a big set of weighted terms with $x_i^2$s, $x_i x_j$s and $x_i$s. This means that it’s a multi-dimensional analogoue of the quadratic function you remember from high-school. When we add lots of these together as in Eq xxx, we still get the same kind of thing. I’ll just tell you that for this kind of thing if you consider all the points in $x_1, \dots, x_p$ space that have the same $cost = C$ you get a higher dimensional ellipsoid, with the ellipsoids for different $C$ values all centred on a common point and growing outwards as $C$ increases. (Energetic readers can figure out how this is a consequence of the quadratic nature of the cost.) So the unregularized cost function is like a hyperdimensional squashed onion where each layer is an “equal-cost contour line”. Clearly the natural minimum is at the centre of this onion.

Now we can investigate how the various kinds of regularization interact with this portion of the cost. Looking at the equal value contours for the $L_2$, $L_1$ and $L_{1/2}$ in 2-D:

• $L_2$: the circular contours. They get more closely spaced going outwards because the weight rises faster than linearly.
• $L_1$: the straight-diamond contours. These stay equally spaced going outwards as the weight rises linearly.
• $L_{1/2}$: the “curvy” diamond contours. Note these become further spaced going outwards because the weight rises slower than linearly.

They behave analogously in higher dimensions.

Puzzle: can you relate the shape of the contours for the regularization above to the curves shown here? (Answer below)

So suppose we’ve got the data-term proportion of the cost, and we want a sparse representation with only one non-zero $x_i$. We can see that “mathematical” answer is that we should set the two $x_i$s corresponding to the two slowest increasing axes to 0 and leave the other alone to get the least change in the cost from the natural minimum. For an axis aligned ellipsoid in only got 3 dimensions but when we’ve got an non-axis aligned ellipsoid in the thousands of dimensions or more we can’t find this exact answer. So lets look at how using the regularizer to “push” values to 0 affects things.

Because the $L_2$ regularizer increases quadratically, it penalizes even slightly larger varlues much more highly, so it tends to “push” the coefficients to have similar magnitudes. You can see that in the way that all $x_i$s move towards the origin at roughly equal speeds, with the each $x_i$s progressively getting smaller but only becoming zero at roughly the same time as the others.

For the $L_1$ regularizer a non-axis aligned ellipse will start to intersect

((more))

## Practical aspects of regression using an $l_{1/2}$ prior

If we restrict to one variable from the numerous vectors and denote this variable by $x$, we get

$\frac{\partial cost}{\partial x} = A x + B + \frac{\lambda sgn(x)}{2\sqrt{|x|}} = 0$

where $A$ and $B$ don’t depend on $x$. If we denote $\lambda sgn(x)/2$ by $C$, we can multiply through by $y =\sqrt{|x|}$ to find the minimum (along this co-ordinate) is

$\pm A y^3 + B y + C = 0$

where the $\pm$ depends whether $x$ is positive/negative and all subject to needing to ensure the solutions in $y$ are also consistent with the original equation. Since this is a cubic equation we have a simple closed form for the solutions to this equation and hence can efficiently solve the original equation.

Puzzle: If we restrict ourselves to regularizers where the co-ordinate descent update can be computed analytically (and there’s no good reason to do this), what regularization functions can we use?