Blog - Science, models and machine learning (Rev #14, changes)

Showing changes from revision #13 to #14:
Added | ~~Removed~~ | ~~Chan~~ged

This page is a blog article in progress, written by David Tweed. To see discussions of this article while it was being written, visit the Azimuth Forum.

Please remember that blog articles need HTML, not Markdown.

The members of the Azimuth Project have been working on both predicting and understanding the El Niño phenomenon, along with expository articles. So far we've mostly talked about the physics and data of the El Niño, along with looking at one method of actually trying to predict El Niño events. Since there's going to more data exploration using methods more typical of machine learning, it's a good time to briefly describe the mindset and highlight some of differences between different kinds of predictive models. Here we'll concentrate on the concepts rather than the fine details and particular techniques.

We also stress there's not a fundamental
distinction between **machine learning (ML)**
and **statistical
modelling and inference**.
There are certainly differences in culture, background and terminology, but in terms of the actual algorithms and mathematics used there's a great commonality. Throughout the rest of the article we'll talk about "machine learning models", but could equally have used "statistical models".

For our purposes, a **model** is any systematic procedure for
taking some input data and providing a prediction of some output. There's a spectrum of models, ranging from physically based
models at one end to purely data driven models at the other. As a very
simple example, suppose you commute by car from your place of
work to your home and you want to leave work in order to arrive home
at 6.30pm. You can tackle this by building a model which takes as
input the day of the week and gives you back a time to leave.

There's the data driven approach, where you try various leaving times on various days and record whether or not you get home by 6.30pm. You might find that the traffic is lighter on weekend days so you can leave at 6.10pm while on weekdays you have to leave at 5.45pm, except on Wednesdays when you have to leave at 5.30pm. Since you've just crunched on the data you have no idea why this works, but it's a very reliable rule when you use it to predict when you need to leave.

There's the physical model approach, where you attempt to infer how many people are doing what on any given day and then figure out what that implies for the traffic levels and thus what time you need to leave. (Of course this is just an illustrative example; in climate modelling a physical model would be based upon actual physical laws such as conservation of energy, conservation of momentum, Boyle's law, etc.)

In this case you find out that there's a mid-week sports game on Wednesday evenings which leads to even higher traffic. This not only predicts that you've got to leave at 5.30pm on Wednesdays but also lets you understand why.

The situation with data driven techniques is analogous to one of those American medication adverts: there's the big message about how "using a purely data driven technique can change your life for the better" while the voiceover gabbles out all sorts of small print. The remainder of this post will try to cover some of the basic principles in that "small print".

There's a popular misconception that machine learning works well when you simply collect some data and throw it into a machine learning algorithm. In practice that kind of approach yields a model that is often quite poor. Almost all successful machine learning applications are preceded by some form of data preprocessing. Sometimes this is simply rescaling so that different variables have similar magnitudes, are zero centred, etc.
However, there are often steps that are more involved. For example,
many machine learning techniques have what are called *kernel
variants* which involves (in a way whose details don't matter
here) choosing a nonlinear mapping from the original data to a new
space which is more amenable to the core algorithm. There are various
kernels with the right mathematical properties, and frequently the choice of a
good kernel is made either by experimentation or knowledge
of the physical principles. Here's an example (from Wikipedia's entry
on the SVM) of how a good
choice of kernel can convert a not linearly separable dataset into a
linearly separable one:

An extreme example of preprocessing is explicitly forming new features in the data. For example, in the work by Ludescher *et al* that we've been looking at, the correlation between different points are taken as the basic features to consider. While these could theoretically be learned by the ML algorithm, this is quite a complicated function to learn. By explicitly choosing to represent the data using this feature the amount the algorithm has to discover is reduced and hence the likelihood of it finding an excellent model is dramatically increased.

Some of the problems that we describe below would vanish if we had unlimited amounts of data to use for model development. However, in real cases we often have a strictly limited amount of data we can obtain. Consequently we need methodologies to address the issues that arise when data is limited.

The most common way to work with collected data is to split it into a **training
set** and a **test set**.
(Sometimes there is a division into three sets, the third being
a **validation
set**.) The training and validation sets are used in the process
of determining the best model parameters, while the test set –
which is not used in *any* way in determining the best model
parameters – is then used to see how effective the model is
likely to be on new, unseen data. This division of data into multiple
sets acts to further reduce the effective amount of data used in
setting model parameters.

After we've made this split we have to be careful how much of the test data we scrutinise in any detail since once it has been investigated it can't meaningfully be used for testing again, although it can still be used for future training. (Examining the test data is often informally known as **burning data**.) That only applies to detailed inspection however; one common way to develop a model is to look at some training data and then **train the model** (also known as **fitting the model**) on that training data. It can then be evaluated on the test data to see how well it does. It's also then OK to purely mechanically train the model on the test data and evaluate it on the training data to see how "stable" the performance is. (If you get dramatically different scores then your model is probably flaky!) However, once we start to look at precisely *why* the model failed on the test data — in order to change the form of the model — the test data has now become training data and can't be used as test data for future variants of that model. (Remember, the real goal is to accurately predict the outputs for *new, unseen* inputs!)

Suppose we're modelling a system where some aspect has a
*true* probability distribution $ P() $. We can't directly
observe this, but we have some samples $ S $ obtained from observation
of the system and hence come from $ P() $. Clearly there are problems if we generate this sample in a way that will bias the area of the distribution we sample from: it wouldn't be a good idea to try to get training data featuring heights in the American population by only handing out surveys in the locker rooms of basketball facilities.
But if we take care to avoid (as much as possible) any sampling bias, then we
can make various kinds of estimates of the distribution that we think
$ S $ comes from; let's consider the estimate $ P'() $ implied for $ S $ by some
particular technique. It would be nice if $ P = P' $, wouldn't
it? And indeed many good estimators have the property that as the size
of $ S $ tends to infinity $ P' $ will tend to $ P $. However, for finite
sizes of $ S $, and especially for *small sizes*, $ P' $ may have
some spurious detail that's not present in $ P $.

As a simple illustration of this, my computer has a pseudo-random number generator which generates essentially uniformly distributed random numbers between 0 and 32767. I just asked for 8 numbers and got

2928, 6552, 23979, 1672, 23440, 28451, 3937, 18910.

Almost all modelling techniques, while not necessarily estimating an
explicit probability distribution from the training samples,
can be seen as building functions that are related to that probability
distribution: for example, a thresholding classifier for dividing
input into two output classes will place the threshold at the optimal
point for the distribution implied by the samples. As a consequence,
one important aim in building machine learning models is to try to
estimate the features that are present in the true probability
distribution while not learning such fine details that they are likely
to be spurious features due to the small sample size. If you think about
this, it's a bit counter-intuitive: you *deliberately don't want to
perfectly reflect every single pattern in the training
data*. Indeed, specialising a model too closely to the training data is
given the name **over-fitting**.

This brings us to **generalization**. Strictly speaking generalization is the ability of a model to work well upon unseen instances of the problem (which may be difficult for a variety of reasons). In practice however one tries hard to get representative training data so that the main issue in generalization is in preventing overfitting, and the main way to do that is – as discussed above – to split the data into a set for training and a set *only* used for testing.

One factor that's often related to generalization is **
regularization** and in particular **sparsity**. Sparsity refers
to the degree to which a model has empty elements, typically
represented as 0 coefficients. It's often
possible to incorporate a *prior* into the modelling procedure
which will encourage the model to be sparse. (Recall that in Bayesian
modelling the **prior** represents our initial ideas of how likely various different parameter values are.) There are some cases where we have various detailed Bayesian priors about sparsity for problem specific reasons. However the more common case is having a "general modelling" belief, based upon general experience in modelling, that sparser models have a better generalization performance.

As an example of using sparsity promoting priors, we can look at linear regression. For standard regression we're trying to optimise $$ \sum_{i=1}^E (y_i - \sum_{j=1}^P c_j x_{i,j})^2 $$ while with the $l_1$ prior we're optimising $$ \sum_{i=1}^E (y_i - \sum_{j=1}^P c_j x_{i,j})^2 + \lambda \sum_{j=1}^P |c_j| $$ where $ \lambda $ is the prior weight. We can see how the prior weight affects the sparsity:

On the
There are a couple of other reasons for wanting sparse models. The
obvious one is speed of model evaluation, although this is much less
significant with modern computing power. A less obvious reason is that
one can often only *effectively utilise* a sparse model, eg, if you're attempting to see how the input factors should be physically modified in order to affect the real system in a particular way. In this case one might want a good sparse model rather than an excellent dense model.

It is certainly possible to take a predictive model obtained by
machine learning and use it to figure out a physically based model;
this is one way of performing what's known as **data mining**. However in practice there are a couple of reasons why it's necessary to take some care when doing this:

The variables in the training set may be related by some non-observed

**latent variables**which may be difficult to reconstruct without knowledge of the physical laws that are in play. (There are machine learning techniques which attempt to reconstruct unknown latent variables but this is a much more difficult problem than estimating known but unobserved latent variables.)Machine learning models have a maddening ability to find variables that are predictive due to the way the data was gathered. For example, in a vision system aimed at finding tanks all the images of tanks were taken during one day on a military base when there was accidentally a speck of grime on the camera lens, while all the images of things that weren't tanks were taken on other days. A neural net cunningly learned that to decide if it was being shown a tank it should look for the shadow from the grime.

It's common to have some groups of

*very highly correlated*input variables. In that case a model will generally learn a function which utilises an arbitrary linear combination of the correlated variables and an equally good model would result from using any other linear combination. (This is an example of the statistical problem of**identifiability**.) Certain sparsity encouraging priors have the useful property of encouraging the model to select only one representative from a group of correlated variables. However, even in that case it's still important not to assign too much significance to the particular division of model parameters in groups of correlated variables.One can often come up with good machine learning models even when physically important variables haven't been collected in the training data. A related issue is that if all the training data is collected from a particular subspace factors that aren't important there won't be found. For example, if in a collision system to be modelled all data is collected about low speeds the machine learning model won't learn about relativistic effects that only have a big effect at a substantial fraction of the speed of light.

While there are some situations where a model is sought purely to
develop knowledge of the universe, in many cases we are interested in
models in order to direct actions. For example, having forewarning of
El Niño events would enable all sorts of mitigation
actions. However, these actions are costly so they shouldn't be
undertaken when
there *isn't* an upcoming El Niño. When presented with
an unseen input the model can either match the actual output (ie, be
right) or differ from the actual output (ie, be wrong). While it's
impossible to know in advance if a single output will be right or wrong – if we could tell that we'd be better off using *that* in our model – from the
training data it's generally possible to estimate the fractions of
predictions that will be right and will be wrong in a large number of
uses. So we want to link these probabilities with the effects of
actions taken in response to model predictions.

We can do this using a **utility
function** and a **loss
function**. The utility maps each
possible output to a numerical value proportional to the benefit from
taking actions when
that output was correctly anticipated while the loss maps outputs to a number
proportional to the loss from the actions when the output was
incorrectly predicted by the model. (There is a evidence that human
beings often have inconsistent utility/loss functions, but that's
a story for another day...)

There are three common ways the utility and loss functions are used:

Maximising the expected value of the utility minus the loss.

Minimising the expected loss while ensuring that the expected utility is at least some value.

Maximising the expected utility while ensuring that the expected loss is at most some value.

Of course sometimes when building a model we don't know enough details of how it will be used to get accurate utility and loss functions (or indeed know how it will be used at all).

All of the ideas discussed above are really just ways of making sure that work developing statistical/machine learning models for a real problem is producing meaningful results. As Bob Dylan (almost) sang, "to live outside the physical law, you must be honest; I know you always say that you agree".

category: blog