Blog - Science, models and machine learning (Rev #2)

This page is a blog article in progress, written by David Tweed. To see discussions of this article while it was being written, visit the Azimuth Forum.

Please remember that blog articles need HTML, not Markdown.

The members of the Azimuth Project have been working on both predicting and understanding the El Niño phenomenon, such as the xxx. So far we've mostly talked about the physics and data of the El Niño, along with looking at one method of actually trying to predict El Niño events. Since there's going to more data exploration using methods more typical of machine learning, it's an opportune time to briefly describe the mindset and highlight some of differences between different kinds of predictive models.

For our purposes here, a model is any systematic procedure for taking some input data and providing a prediction of some output data. There's a spectrum of models, ranging from physically based models at one end to purely data based models at the other. As a very simple example, suppose you have a commute by car from your place of work to your home and you want to leave work in order to arrive home at 6.30pm. You can tackle this by building a model which takes as input the day of the week and gives you back a time to leave.

There's the data driven approach, where you try various leaving times on various days and record whether or not you get home by 6.30pm. You might find that the traffic is lighter on weekend days so you can leave at 6.10pm while on weekdays you have to leave at 5.45pm, except on Wednesdays when you have to leave at 5.30pm. Since you've just crunched on the data you have no idea why it works, but it's a very reliable rule when you use it to predict when you need to leave.

There's the physical model approach, where you figure out how many people are doing on any given day and then figure out what that implies for the traffic levels and thence what time you need to leave. In this case you find out that there's a mid-week sports game on Wednesday evenings which leads to even higher traffic. By proceeding from first principles you've got a more detailed framework which is equally predictive but at the cost of having to investigate a lot of complications.

The one of those American medication adverts which presents a big message about "Using a purely data driven techniques is wonderful" while the voiceover lays out all sorts of small print. The remainder of this post will try to cover some of the simplest parts of the

data usage

subsets My computer has an pseudo-random number generator which generates essentially uniformly distributed random numbers between 0 and 32767. I just asked for 8 numbers and got

2928, 6552, 23979, 1672, 23440, 28451, 3937, 18910.

Note that we've got 3 values (2340, 23979 and 28451) an interval of length 5011. For this uniform distribution the *expected value* of the number within that range is 1.2ish. Readers will be familiar with how the expectation of a random quantity for a small sample will have a large amount of variation around its value that only reduces as the sample size increases, so this isn't a surprise. However, it does highlight that even with *completely unbiased* sampling from the *full distibution* will typically give rise to extra "structure" within the distribution implied by the samples.

generalisation

sparsity

inference

All of the ideas discussed above are really just ways of making sure that work on statistical/machine learning models are producing meaningful results in situations where the training data is scarce. As Bob Dylan (almost) sang, "to work outside the physical law, you must be honest, I know you always say that you agree".

category: blog