# The Azimuth Project Blog - Science, models and machine learning (Rev #10, changes)

Showing changes from revision #9 to #10: Added | Removed | Changed

This page is a blog article in progress, written by David Tweed. To see discussions of this article while it was being written, visit the Azimuth Forum.

Please remember that blog articles need HTML, not Markdown.

# Science, models and machine learning

The members of the Azimuth Project have been working on both predicting and understanding the El Niño phenomenon, along with expository articles such as this. So far we've mostly talked about the physics and data of the El Niño, along with looking at one method of actually trying to predict El Niño events. Since there's going to more data exploration using methods more typical of machine learning, it's an opportune time to briefly describe the mindset and highlight some of differences between different kinds of predictive models.

It should also be pointed out that there's not a fundamental distinction between machine learning and statistics. There are certainly differences in culture, background, terminology and typical real-world tasks considered, but in terms of the actual algorithms and mathematics used there's a great commonality. Throughout the rest of the article we'll use the expression "machine learning models", but could equally have used "statistical models".

For our purposes here, a model is any systematic procedure for taking some input data and providing a prediction of some output data. There's a spectrum of models, ranging from physically based models at one end to purely data driven models at the other. As a very simple example, suppose you have a commute by car from your place of work to your home and you want to leave work in order to arrive home at 6.30pm. You can tackle this by building a model which takes as input the day of the week and gives you back a time to leave.

• There's the data driven approach, where you try various leaving times on various days and record whether or not you get home by 6.30pm. You might find that the traffic is lighter on weekend days so you can leave at 6.10pm while on weekdays you have to leave at 5.45pm, except on Wednesdays when you have to leave at 5.30pm. Since you've just crunched on the data you have no idea why it works, but it's a very reliable rule when you use it to predict when you need to leave.

• There's the physical model approach, where you figure out how many people are doing on any given day and then figure out what that implies for the traffic levels and thence what time you need to leave. In this case you find out that there's a mid-week sports game on Wednesday evenings which leads to even higher traffic. By proceeding from first principles you've got a more detailed framework which is equally predictive but at the cost of having to investigate a lot of complicated underlying principles.

This is like one of those American medication adverts which presents a big message about "Using a purely data driven techniques is wonderful" while the voiceover lays out all sorts of small print. The remainder of this post will try to cover some of the basic principles in the that small print.

## How much data do we have?

Some of the problems that we describe below would vanish if we had unlimited amounts of training data. However, in real cases we often have a strictly limited amount of data we can use for training. Consequently we have to address the issues that arise when data is limited.

In addition, we have to be careful how much of the test data we scrutinise in detail since once it has been investigated it can't meaningfully be used for testing again (although it can still be used for future training). (Informally this is often known as burning data.) That only applies to detailed scrutiny however; one common way to develop a model is to look at some training data and then train the model (also known as fitting the model) to that training data. It can then be evaluated on the test data to see how well it does. It's also OK to purely mechanically train the model on the test data and then evaluate it on the test data to see how "stable" the performance is. However, once we start to look at precisely why the model failed on the test data — in order to change the form of the model — the test data has now become training data and can't be used as test data for future variants of the model.

## Random patterns in small samples

So suppose we're trying to model a system where the some aspect has a true probability distribution $P()$. We can't directly observe that, but we have some samples $S$ obtained from observation of the system and hence come from $P()$. Clearly there are problems if we generate this sample in a way that will bias the area of the distribution we sample from: it wouldn't be a good idea to try to get training data featuring heights in the American population by only handing out surveys in the locker rooms of basketball facilities. But if we take care to avoid (as much as possible) any bias, then we can make various kinds of estimates of the distribution that we think $S$ comes from; lets call the estimate implied for $S$ by some particular procedure $P'()$. It would be nice if $P = P'$ wouldn't it? And indeed many good estimators have the property that as the size of $S$ tends to infinity $P'$ will tend to $P$. However, for finite sizes of $S$, and especially for small sizes, $P'$ may have some spurious detail that's not present in $P$.

As a simple illustration of this, my computer has a pseudo-random number generator which generates essentially uniformly distributed random numbers between 0 and 32767. I just asked for 8 numbers and got

• 2928, 6552, 23979, 1672, 23440, 28451, 3937, 18910.

Note that we've got one subset of 4 values (2928, 6552, 1672, 3937) within the interval of length 5012 between 1540 and 6552 and another subset of 3 values (23440, 23979 and 28451) an the interval of length 5012 between 23440 and 28451. For this uniform distribution theexpected value of the number of values falling within that those range ranges is 1.2ish. about 1.2. Readers will be familiar with how the expectation of a random quantity for a small sample will have a large amount of variation around its value that only reduces as the sample size increases, so this isn't a surprise. However, it does highlight that even withcompletely unbiased sampling from the full distribution will typically give rise to extra "structure" within the distribution implied by the samples.

## Generalisation

Now almost all modelling techniques, while not necessarily estimating a full model of the probability distributions from the training samples, can be seen as building functions that are related to the probability distribution: for example, a thresholding classifier for dividing input into two output classes will place the threshold at the optimal point for the distribution implied by the samples. As a consequence, one important aim in building machine learning models is to try to estimate the features that are present in the full probability distribution while not learning such fine details that they are likely to be spurious features due to the small sampling. If you think about this, it's a bit counter-intuitive: you deliberately don't want to perfectly reflect every single the pattern in the training data. Indeed, specialising a model too closely to the training is given the name over-fitting.

This brings us to generalisation. Strictly speaking generalization is the ability of a model to work well upon unseen instances of the problem (which may be difficult for a variety of reasons). In practice however one tries hard to get representative training data so that the main issue in generalization is in preventing overfitting.

One factor that's often related to generalization is sparsity. This refers to the degree to which a model has empty elements, typically represented as 0 coefficients. There are various reasons for wanting sparse models: the obvious one is speed of model evaluation, although that is much less significant with modern computing power. It's often possible to incorporate a prior into the modelling procedure which will encourage the model to be sparse. (Recall that in Bayesian modelling the prior represents our initial ideas of how likely various different parameter values are.) There are some cases where we have various detailed Bayesian priors about sparsity for problem specific reasons. However the more common case is having a "general modelling" belief, based upon general experience in modelling, that sparser models have a better generalisation performance.

There are a couple of other reasons for wanting sparse models. The obvious one is speed of model evaluation, although that is much less significant with modern computing power. A less obvious reason is that one can often only use a sparse model, eg, if you're attempting to see how the input factors should be physically modified in order to affect the real system in a particular way. In this case one might want a good sparse model rather than an excellent dense model.

<!-- <div align = "center"> <a href = "http://www.shrimpnews.com/FreeReportsFolder/WeatherFolder/ElNino.html#Monster"> <img width = "445" src = "http://math.ucr.edu/home/baez/ecological/el_nino/ElNinoMap1998.jpg"> </a> </div> -->

## Preprocessing and feature formation

There's a popular misconception that machine learning works well when you simply collect some data and throw it into a machine learning algorithm. In practice that kind of approach yields a model that is often quite poor. Almost all successful machine learning applications are preceded by some form of preprocessing. Sometimes this is simply rescaling variables so that different variables have similar magnitudes, are zero centred, etc. However, there are often steps that are more involved. For example, many machine learning techniques have what are called kernel variants which involves (in a way whose details don't matter here) choosing a nonlinear mapping from the original data to a new space which is more amenable tothe core algorithm. There are various kernels with the right mathematical properties, and the choice of a good kernel frequently happens either by experimentation or knowledge of the physical principles.

An extreme example of preprocessing is explicitly forming new features in the data. For example, in the work by Ludescher et al that we've been looking at, the correlation between different points are taken as the basic features to consider. While these could theoretically be learned by the ML algorithm, this is quite a complicated expression. By explicitly choosing that feature the amount the algorithm has to discover is reduced and hence the likelihood of it finding an excellent model is dramatically increased.

## Inferring a physical model from a ML model

It is certainly possible to take a predictive model obtained by machine learning and use it to figure out a physically based model; this is one way of performing what's known a data mining. However in practice there are a couple of reasons why it's necessary to exhibit some care when doing this:

• The variables in the training set may be related by some non-observed latent variables which may be difficult to reconstruct without knowledge of the physical laws that are in play.

• Machine learning models have a maddening ability to find variables that are predictive due to the way the training data was gathered. For example, in an early vision system aimed at finding tanks all the images of tanks were taken during one day on a military base when there was accidentally a speck of grime on the camera lens, while all the images of things that weren't tanks were taken on other days. The neural net cunningly learned that to decide if it was being shown a tank it should see if the shadow from the grime was in place or not.

• It's common to have very highly correlated input variables. In that case a model will generally learn a function which utilises an arbitrary combination of the correlated variables and an equally good model would result from using any other combination. Certain sparsity encouraging priors have the useful property of encouraging the model to select only one representative from a group of correlated variables. However, even in that case it's still important not to assign too much importance to the particular division of model parameters in groups correlated variables.

• One can often come up with good machine learning models even when physically important variables haven't been collected in the training data. A related issue is that if all the training data is collected from a particular subspace factors that aren't important there won't be found. For example, if in a collision system to be modelled all data is collected at low speeds the machine learning model won't learn about relativistic effects that only have a big effect at a substantial fraction of the speed of light.

## Utility functions and decision theory

using models for decision

how utility fn affects model

## Conclusions

All of the ideas discussed above are really just ways of making sure that work on statistical/machine learning models are producing meaningful results in situations where the training data is scarce. As Bob Dylan (almost) sang, "to work outside the physical law, you must be honest; I know you always say that you agree".

category: blog