# The Azimuth Project Blog - El Niño project (part 4) (Rev #7)

This is a blog article in progress, written by John Baez. To see discussions of the article as it is being written, visit the Azimuth Forum.

As the first big step in our El Niño project, Graham Jones replicated the paper by Ludescher et al that I explained in Part 3. Let’s see how this works!

Graham did this using R, a programming language that’s good for statistics. So, I’ll tell you how everything works with R. If you prefer another language, go ahead and write software for that… and let us know! We can add it to our repository.

Today I’ll explain this stuff to people who know their way around computers. But I’m not one of those people! So, next time I’ll explain the nitty-gritty details in a way that may be helpful to people like me.

### Getting temperature data

Say you want to predict El Niños from 1950 to 1980 using Ludescher et al‘s method. To do this, you need daily average surface air temperatures in this 7 × 23 grid in the Pacific Ocean:

Each square here is 7.5° × 7.5°. To get this data, you have to first download area-averaged temperatures on grid with squares that are 1.5° × 1.5° in size:

• Earth System Research Laboratory, NCEP Reanalysis Daily Averages Surface Level, or ftp site.

You can get the website to deliver you temperatures in a given rectangle in a given time interval. It gives you this data in a format called NetCDF, meaning Network Common Data Form. We’ll take a different approach. We’ll download all the Earth’s temperatures from 1948 to 2013, and then extract the data we need using R scripts. That way, when we play other games with temperature data later, you’ll already have it.

So, go ahead and download all files from air.sig995.1948.nc to air.sig995.2013.nc. Or if you just want to do this one project, just up to air.sig.995.1980.nc. It will take a while…

### Getting the temperatures you need

Now you have files of daily average temperatures on a 1.5° by 1.5° grid, from 1948 to 2013. Make sure all these files are in your working directory for R, and download this R script from GitHub:

Graham wrote it; I just modified it a bit. You can use this to get the temperatures in any time interval and rectangle of grid points you want. However, the defaults are set to precisely what you need now!

When you run this, you should get a file called Pacific-1948-1980.txt. This has daily average temperatures in the region we care about, from 1948 to 1980.

### Getting the El Niño data

You’ll use this data to predict El Niños, so you also want a file of the Niño 3.4 data. Remember from last time, this says how much hotter than average the surface water is in this patch of the Pacific Ocean:

This is a copy of the Monthly Niño 3.4 index data from the US National Weather Service, which I discussed last time.

Put this file in your working directory.

### Predicting El Niños

Now you’ve got Pacific-1948-1980.txt and nino3.4-anoms.txt in your working directory. Download this R script written by Graham Jones, and run it:

It takes a bit more than half an hour. It computes the average link strength $S$ that I explained last time, with one mathematical nuance I’ll mention later. It plots $S$ in red, and plots the Niño 3.4 index in blue, like this:

(Click to enlarge.) The shaded region is where the Niño 3.4 index is below 0.5°C. When the blue curve escapes this region and then stays above 0.5°C for at least 5 months, Ludescher et al declare that there’s an El Niño!

The horizontal red line shows the threshold $\theta = 2.82$. When $S$ exceeds this, and the Niño 3.4 index is not already over 0.5°C, Ludescher et al predict that there will be an El Niño in the next calendar year.

This graph almost matches the corresponding graph in Ludescher et al:

Here the green arrows show their successful predictions, dashed arrows show false alarms, and a little letter n appears next to each El Niño they failed to predict.

The graphs don’t match perfectly. For the blue curves, we could be using Niño 3.4 from a different source. But the red curves are more interesting, since that’s where all the work is involved, and we are starting with the same data. Beside actual bugs, I can think of various explanations. None of them are extremely interesting, so I’ll stick them in the last section!

If you want to get ahold of our output, you can do so here:

So, you don’t actually have to run all these programs to get our final results. Scientists should never make data hard to get. However, these programs will help you tackle some programming challenges which I’ll list now!

### Programming challenges

There are lots of variations on the Ludescher et al paper which we could explore. Here are a few I’m really interested in. The Azimuth gang hasn’t had time to try these yet, so if you do them we’d be interested! I’ll start with a really easy one, and work on up.

Challenge 1. Repeat the calculation with temperature data from 1980 to 2013. You’ll have to get the relevant temperature data and adjust two lines in netcdf-convertor-ludescher.R:

firstyear <- 1948
lastyear <- 1980

should become

firstyear <- 1980
lastyear <- 2013

or whatever range of years you want. You’ll also have to adjust some numbers in ludescher-replication.R. Search the file for the string 19 and make the necessary changes. Ask me if you get stuck.

Challenge 1. Repeat the calculation with temperature data on a 2.5° × 2.5° grid instead of the coarser 7.5° × 7.5° grid Ludescher et al use. You’ve got the data you need. Right now, the program ludescher-replication.R averages out the temperatures over little 3 × 3 squares to get the temperature data Ludescher et al want. It starts with 27 × 69 temperatures per day and averages them out to obtain 9 × 23 temperatures. Here’s how:

# the data per day is reduced from e.g. 27x69 to 9x23.

subsample.3x3 <- function(vals) {
stopifnot(dim(vals)[2] %% 3 == 0)
stopifnot(dim(vals)[3] %% 3 == 0)
n.sslats <- dim(vals)[2]/3
n.sslons <- dim(vals)[3]/3
ssvals <- array(0, dim=c(dim(vals)[1], n.sslats, n.sslons))
for (d in 1:dim(vals)[1]) {
for (slat in 1:n.sslats) {
for (slon in 1:n.sslons) {
ssvals[d, slat, slon] <- mean(vals[d, (3*slat-2):(3*slat), (3*slon-2):(3*slon)])
}
}
}
ssvals
}


So, you’d need to eliminate this and change whatever else needs to be changed. What new value of the threshold $\theta$ looks good for predicting El Niños now? Most important: can you get better at predicting El El Niños this way

Running the calculation may take a lot longer, since you’ve got 9 times as many grid points and you’re calculating correlations between pairs. So, if you don’t have a powerful computer, maybe you can go the way and use a coarser grid and see how much (if any) that degrades your ability to predict El Niños.

### Mathematical nuances

I mentioned last time that Ludescher et al claim to normalize their time-delayed cross-covariances in a somewhat peculiar way which involves running averages of (functions of) running averages. For reasons I explained, I don’t think they could have actually used this method.

category: blog, climate