+-- {: .standout} This page is a [[blog article in progress]], written by [[David Tweed]]. To see discussions of this article while it was being written, visit the [Azimuth Forum](http://forum.azimuthproject.org/discussion/1476/blog-exploring-regression-on-the-el-nintildeo-data/). =-- #Blog - Exploring regression on the El Niño data # <div align = "center"> <a href = "http://i3.kym-cdn.com/photos/images/masonry/000/698/708/df1.jpg"> <img width = "250" src = "http://i3.kym-cdn.com/photos/images/masonry/000/698/708/df1.jpg"> </a> </div> ## Regression using an $L_{q}$ prior ## But before we look at some results on the actual data, let's look in a bit more detail at how various kinds of sparsity promoting regularization work. Recall that in estimating parameters we work with an error function that measures the total differences between the model predictions and the actual outputs, which we'll refer to as the _cost_. This takes the form $$ cost = \frac{1}{E}\sum_{e=1}^E \left( \sum_{i=1}^P M^{(e)}_i x_i - y^{(e)}\right)^2 $$ The minimum of this kind of optimization will involve every $x_i$ being non-zero, with all kinds of crazily large coefficients being used to magnify tiny random fluctuations in the input data to get the best possible match with the output. Since we think we've got a simple-ish, noisy system we don't want those kind of solutions, so we add an extra term to the cost we're minimizing. This is generally some function of the $x_i$s, weighted by some scaling $\lambda$. In an ideal world we'd want a simple function counting how many of the $x_i$s have non-zero values. Unfortunately this is a very combinatorial problem which is computationally intractable (<a href="http://en.wikipedia.org/wiki/NP-hard">technically NP-hard</a>) for large problems. Faced with such a problem, we take the numerical analyst's standard dodge and come up with a continuous problem which is simpler to solve and where we think the answer will have similar properties. <div align = "center"> <a href = "http://i0.kym-cdn.com/photos/images/newsfeed/000/181/367/closeenough.png?1317606898"> <img width = "300" src = "http://i0.kym-cdn.com/photos/images/newsfeed/000/181/367/closeenough.png?1317606898"> </a> </div> The functions we'll use are the square, value and square root of the absolute values of the $x_i$s. (The absolute value is needed just to weight the magnitude of the $x_i$ whether it's positive or negative), giving us the three costs below. We can write the regression cost function—adding an $L_{q}$ prior—using explicit summations as $$ L_2:\qquad cost_2 = \frac{1}{E}\sum_{e=1}^{E} \left( \sum_{i=1}^P M^{(e)}_{i} x_{i} - y^{(e)}\right)^2 + \lambda \sum_{i=1}^P |x_i|^2 $$ $$ L_1:\qquad cost_1 = \frac{1}{E}\sum_{e=1}^{E} \left( \sum_{i=1}^P M^{(e)}_{i} x_{i} - y^{(e)}\right)^2 + \lambda \sum_{i=1}^P |x_i| $$ $$ L_{1/2}:\qquad cost_{1/2} = \frac{1}{E}\sum_{e=1}^{E} \left( \sum_{i=1}^P M^{(e)}_{i} x_{i} - y^{(e)}\right)^2 + \lambda \sum_{i=1}^P \sqrt{|x_i|} $$ (You can see the $q$ in $L_q$ corresponds to the exponent the $|x_i|$ is raised to.) To gain some understanding of how these work, let's start with an artificial example. Starting with a single $(\sum_i a_i x_i - y)^2$, when we expand that out we'll get a big set of weighted terms with various $x_i^2$s, $x_i x_j$s and $x_i$s. This means that it's a multi-dimensional analogue of the quadratic function you remember from high-school. When we add lots of these together as in Eq xxx, we still get the same kind of thing. I'll just tell you that for this kind of thing if you consider all the points in $x_1, \dots, x_P$ space that have the _same_ $cost = C$ you get a higher dimensional ellipsoid, with the ellipsoids for different $C$ values all centred on a common point and growing outwards as $C$ increases. (Energetic readers can figure out how this is a consequence of the quadratic nature of the cost.) So the unregularized cost function is like a hyperdimensional squashed onion where each layer is an "equal-cost contour line", with the minimum at the centre of the onion. Now we can investigate how the various kinds of regularization interact with this portion of the cost. Looking at the equal value contours for the $L_2$, $L_1$ and $L_{1/2}$ in 2-D: [[regularizaton.png:pic]] * $L_2$: the circular contours. They get more closely spaced going outwards because the weight rises faster than linearly. * $L_1$: the straight-diamond contours. These stay equally spaced going outwards as the weight rises linearly. * $L_{1/2}$: the "curvy" diamond contours. Note these become further spaced going outwards because the weight rises slower than linearly. They behave analogously in higher dimensions. With this we can plot how the minimum of $cost_2$, $cost_1$ and $cost_{1/2}$ varies as $\lambda$ is progressively increased. In the image below the ellipsoid shows one of the constant $cost$ contours for the data error with the curves showing the sparsity regularized minimums, starting off from the centre of the ellipsoid when $\lambda=0$. [[regressionPaths.png:pic]] The purple path is the for $cost_2$, the blue is for $cost_1$ and the green is $cost_{1/2}$. We've stop increasing $\lambda$ once we've got only one non-zero $x_i$ since the final progresion to all the $x_i$ being zero is quite dull! (The blue and green curves both hit the "floor" where they have a sharp bend in the middle). We can also plot the values on the same axes vertical axis against $\lambda$ on the horizontal,starting again with the $\lambda=0$ minimum: [[trajectory1D.png:pic]] Again we start with the minimum $x_i$ values for $cost_2$, then $cost_1$ and finally $cost_{1/2}$. **Puzzle**: can you relate the shape of the contours for the regularization above to the $x_i$ trajectories shown here? (**Note the answer is in paragraphs below.**) So suppose we've got the data-term proportion of the cost, and we want a sparse representation with only *one* non-zero $x_i$. We can see that "mathematical" answer is that we should set the two $x_i$s corresponding to the two slowest increasing axes to 0 and leave the other alone to get the least change in the cost from the natural minimum. For an axis aligned ellipsoid in only got 3 dimensions but when we've got an non-axis aligned ellipsoid in the thousands of dimensions or more we can't find this exact answer. So lets look at how using the regularizer to "push" values to 0 affects things. Because the $L_2$ regularizer increases quadratically, it penalizes even slightly larger varlues much more highly, so it tends to "push" the coefficients to have similar magnitudes. You can see that in the way that all $x_i$s move towards the origin at roughly equal speeds, with the each $x_i$s progressively getting smaller but only becoming zero at roughly the same time as the others. For the $L_1$ regularizer a non-axis aligned ellipse will start to intersect ((more)) ## Practical aspects of regression using an $l_{1/2}$ prior ## If we restrict to one variable from the numerous vectors and denote this variable by $x$, we get $$ \frac{\partial cost}{\partial x} = A x + B + \frac{\lambda sgn(x)}{2\sqrt{|x|}} = 0 $$ where $A$ and $B$ don't depend on $x$. If we denote $\lambda sgn(x)/2$ by $C$, we can multiply through by $y =\sqrt{|x|}$ to find the minimum (along this co-ordinate) is $$ \pm A y^3 + B y + C = 0 $$ where the $\pm$ depends whether $x$ is positive/negative and all subject to needing to ensure the solutions in $y$ are also consistent with the original equation. Since this is a cubic equation we have a simple closed form for the solutions to this equation and hence can efficiently solve the original equation. **Puzzle**: If we restrict ourselves to regularizers where the co-ordinate descent update can be computed analytically (and there's *no* good reason to do this), what regularization functions can we use? ##Back to the experimental results##