This page is a blog article in progress, written by David Tweed. To see discussions of this article while it was being written, visit the Azimuth Forum.
But before we look at some results on the actual data, let’s look in a bit more detail at how various kinds of sparsity promoting regularization work. Recall that in estimating parameters we work with an error function that measures the total differences between the model predictions and the actual outputs, which we’ll refer to as the cost. This takes the form
The “natural” minimum of this kind of optimization will involve every being non-zero, with all kinds of crazily large coefficients being used to magnify tiny random fluctuations in the input data to get the best possible match with the output. Since we think we’ve got a simple-ish, noisy system we don’t want those kind of solutions, so we add an extra term to the cost we’re minimizing. This is generally some function of the s, weighted by some scaling . In an ideal world we’d want a simple function counting how many of the s have non-zero values. Unfortunately this is a very combinatorial problem which is computationally intractable (technically NP-hard) for large problems. Faced with such a problem, we take the numerical analyst’s standard dodge and come up with a continuous problem which is simpler to solve and where we think the answer will have similar properties.
The functions we’ll use are the square, value and square root of the absolute values of the s. (The absolute value is needed just to weight the magnitude of the whether it’s positive or negative), giving us the three costs below.
We can write the regression cost function—adding an prior—using explicit summations as
(You can see the in corresponds to the exponent the is raised to.) To gain some understanding of how these work, let’s start with an artificial example. Let’s start with a single : when we expand that out we’ll get a big set of weighted terms with s, s and s. This means that it’s a multi-dimensional analogoue of the quadratic function you remember from high-school. When we add lots of these together as in Eq xxx, we still get the same kind of thing. I’ll just tell you that for this kind of thing if you consider all the points in space that have the same you get a higher dimensional ellipsoid, with the ellipsoids for different values all centred on a common point and growing outwards as increases. (Energetic readers can figure out how this is a consequence of the quadratic nature of the cost.) So the unregularized cost function is like a hyperdimensional squashed onion where each layer is an “equal-cost contour line”. Clearly the natural minimum is at the centre of this onion.
Now we can investigate how the various kinds of regularization interact with this portion of the cost. Looking at the equal value contours for the , and in 2-D:
They behave analogously in higher dimensions.
Puzzle: can you relate the shape of the contours for the regularization above to the curves shown here? (Answer below)
So suppose we’ve got the data-term proportion of the cost, and we want a sparse representation with only one non-zero . We can see that “mathematical” answer is that we should set the two s corresponding to the two slowest increasing axes to 0 and leave the other alone to get the least change in the cost from the natural minimum. For an axis aligned ellipsoid in only got 3 dimensions but when we’ve got an non-axis aligned ellipsoid in the thousands of dimensions or more we can’t find this exact answer. So lets look at how using the regularizer to “push” values to 0 affects things.
Because the regularizer increases quadratically, it penalizes even slightly larger varlues much more highly, so it tends to “push” the coefficients to have similar magnitudes. You can see that in the way that all s move towards the origin at roughly equal speeds, with the each s progressively getting smaller but only becoming zero at roughly the same time as the others.
For the regularizer a non-axis aligned ellipse will start to intersect
((more))
If we restrict to one variable from the numerous vectors and denote this variable by , we get
where and don’t depend on . If we denote by , we can multiply through by to find the minimum (along this co-ordinate) is
where the depends whether is positive/negative and all subject to needing to ensure the solutions in are also consistent with the original equation. Since this is a cubic equation we have a simple closed form for the solutions to this equation and hence can efficiently solve the original equation.
Puzzle: If we restrict ourselves to regularizers where the co-ordinate descent update can be computed analytically (and there’s no good reason to do this), what regularization functions can we use?