Blog - El Niño project (part 4) (Rev #8, changes)

Showing changes from revision #7 to #8:
Added | ~~Removed~~ | ~~Chan~~ged

This is a blog article in progress, written by John Baez. To see discussions of the article as it is being written, visit the Azimuth Forum.

If you want to write your own article, please read the directions on How to blog.

As the first big step in our El Niño prediction project, Graham Jones replicated the paper by Ludescher*et al* that I explained~~ in~~~~ Part~~ last~~ 3~~ time. Let’s see how this works!

Graham did this using R , a programming language that’s good for statistics.~~ So,~~ But~~ I’ll~~ if~~ tell~~~~ you~~~~ how~~~~ everything~~~~ works~~~~ with~~~~ R.~~~~ If~~ you prefer another language, go ahead and write software for that… and let us know! We can add it to our repository.

Today I’ll explain this stuff to people who know their way around computers. But I’m not one of those people! So, next time I’ll explain the nitty-gritty details in a way that may be helpful to people more like me.

Say you want to predict El Niños from 1950 to 1980 using Ludescher *et al* ‘s method. To do this, you need daily average surface air temperatures in this~~ 7~~~~ ×~~~~ 23~~ grid in the Pacific Ocean:

Each square here is 7.5° × 7.5°. To get this data, you have to first download~~ area-averaged~~ temperatures on a grid with smaller squares that are 1.5° × 1.5° in size:

• Earth System Research Laboratory, NCEP Reanalysis Daily Averages Surface Level, or ftp site.

You can get the website to deliver you temperatures in a given rectangle in a given time interval. It gives you this data in a format called **NetCDF**, meaning Network Common Data Form. We’ll take a different approach. We’ll download *all* the Earth’s temperatures from 1948 to 2013, and then extract the data we need using R scripts. That way, when we play other games with temperature data later,~~ you’ll~~ we’ll already have it.

So, go ahead and download all files from `air.sig995.1948.nc`

to `air.sig995.2013.nc`

.~~ Or~~ It~~ if~~ will~~ you~~ take~~ just~~ a~~ want~~ while…~~ to~~ but~~ do~~ you’ll~~ this~~ own~~ one~~ the~~ project,~~ world.~~ just~~~~ up~~~~ to~~`air.sig.995.1980.nc`

~~. It will take a while…~~

There are different ways to do this. If you have R fired up, just cut-and-paste this into the console:

```
for (year in 1950:1979) {
download.file(url=paste0(
"ftp://ftp.cdc.noaa.gov/Datasets/ncep.reanalysis.dailyavgs/surface/air.sig995.",
year, ".nc"),
destfile=paste0("air.sig995.", year, ".nc"), mode="wb")
}
```

Now you have files of daily average temperatures on a 1.5° by 1.5°~~ grid,~~ grid from 1948 to 2013. Make sure all these files are in your working directory for R, and download this R script from GitHub:

• `netcdf-convertor-ludescher.R`

Graham wrote it; I just modified it a bit. You can use this to get the temperatures in any time interval and any rectangle of grid points you want.~~ However,~~ The details are explained in the script. But the defaults are set to precisely what you need now!

~~ When~~ So,~~ you~~ just run~~ this,~~ this.~~ you~~ You should get a file called`Pacific-1948-1980.txt`

. This has daily average temperatures in the region we care about, from 1948 to 1980. It starts with a really long line listing locations in a 9 × 69 grid. Then come hundreds of lines listing temperatures in kelvin at those locations on successive days. The first of these lines starts with Y1948P001, meaning the first day of 1948.

And I know what you’re dying to ask: yes, leap days are omitted!

You’ll use this data to predict El Niños, so you also want a file of the **Niño 3.4** ~~ data.~~ index. Remember from last time, this says how much hotter than average~~ the~~~~ surface~~~~ water~~~~ is~~~~ in~~ this patch of~~ the~~ ocean~~ Pacific~~ happens~~ Ocean:~~ to be at any time:

You can download the file from here:

This is a copy of the Monthly Niño 3.4 index data from the US National Weather Service, which I discussed last time. It has monthly Niño 3.4 data in the column called`ANOM`

.

Put this file in your working directory.

Now you’ve got `Pacific-1948-1980.txt`

and `nino3.4-anoms.txt`

in your working directory. Download this R script written by Graham Jones, and run it:

It takes~~ a~~ about~~ bit~~ 45~~ more~~ minutes~~ than~~ on~~ half~~ my~~ an~~ laptop.~~ hour.~~ It computes the average link strength$S$ that I explained last~~ time,~~ time.~~ with~~~~ one~~~~ mathematical~~~~ nuance~~~~ I’ll~~~~ mention~~~~ later.~~ It plots$S$ in red, and plots the Niño 3.4 index in blue, like this:

(Click to enlarge.) The shaded region is where the Niño 3.4 index is below 0.5°C. When the blue curve escapes this region and then stays above 0.5°C for at least 5 months, Ludescher *et al* ~~ declare~~ say that there’s an El~~ Niño!~~ Niño.

The horizontal red line shows the threshold $\theta = 2.82$. When $S$ exceeds this, and the Niño 3.4 index is not *already* over 0.5°C, Ludescher *et al* predict that there will be an El Niño in the next calendar~~ year.~~ year!

~~ This~~ Our graph almost matches the corresponding graph in Ludescher*et al*:

Here the green arrows show their successful predictions, dashed arrows show false alarms, and a little letter n appears next to each El Niño they failed to predict.

The graphs don’t match perfectly. For the blue curves, we could be using Niño 3.4 from~~ a~~ different~~ source.~~ sources. But the red curves are more interesting, since that’s where all the work is involved, and~~ we~~ we’re~~ are~~ starting with the same data. Beside actual bugs, which are always possible, I can think of various explanations. None of them are extremely interesting, so I’ll stick them in the last section!

If you want to get ahold of our output, you can do so here:

• ~~ link-strengths.txt~~ average-link-strength.txt.

~~ So,~~ This~~ you~~ has~~ don’t~~ the~~ actually~~ average~~ have~~ link~~ to~~ strength~~ run~~~~ all~~~~ these~~~~ programs~~~~ to~~~~ get~~~~ our~~~~ final~~~~ results.~~~~ Scientists~~~~ should~~~~ never~~~~ make~~~~ data~~~~ hard~~~~ to~~~~ get.~~~~ However,~~~~ these~~~~ programs~~~~ will~~~~ help~~~~ you~~~~ tackle~~~~ some~~~~ programming~~~~ challenges~~~~ which~~~~ I’ll~~~~ list~~~~ now!~~$S$ at 10-day intervals, starting from day 730 (where the first of January 1948 is day 1) and going until day 12040.

So, you don’t actually have to run all these programs to get our final result. However, these programs will help you tackle some programming challenges which I’ll list now!

There are lots of variations on the Ludescher *et al* paper which we could explore. Here are a few~~ I’m~~ easy~~ really~~ ones~~ interested~~~~ in.~~~~ The~~~~ Azimuth~~~~ gang~~~~ hasn’t~~~~ had~~~~ time~~ to~~ try~~ get~~ these~~~~ yet,~~~~ so~~~~ if~~ you started. If you do~~ them~~ any~~ we’d~~ of~~ be~~ these,~~ interested!~~ or~~ I’ll~~ anything~~ start~~ else,~~ with~~ let~~ a~~ me~~ really~~ know!~~ easy~~~~ one,~~~~ and~~~~ work~~~~ on~~~~ up.~~

I’ll start with a really easy one, and work on up.

**Challenge 1.** Repeat the calculation with temperature data from 1980 to 2013. You’ll have to get the relevant temperature data and adjust two lines in `netcdf-convertor-ludescher.R`

:

```
firstyear <- 1948
lastyear <- 1980
```

should become

```
firstyear <- 1980
lastyear <- 2013
```

or whatever range of years you want. You’ll also have to adjust~~ some~~ names~~ numbers~~ of years in`ludescher-replication.R`

. Search the file for the string `19`

and make the necessary changes. Ask me if you get stuck.

**Challenge 1.** Repeat the calculation with temperature data on a 2.5° × 2.5° grid instead of the coarser 7.5° × 7.5° grid Ludescher *et al* use. You’ve got the data you need. Right now, the program `ludescher-replication.R`

averages out the temperatures over little 3 × 3~~ squares~~ squares: it starts with temperatures on a 27 × 69 grid and averages them out to~~ get~~ obtain temperatures on the~~ temperature~~ 9~~ data~~ ×~~ Ludescher~~ 23 grid shown here:*et al*~~ want. It starts with 27 × 69 temperatures per day and averages them out to obtain 9 × 23 temperatures. Here’s how:~~

Here’s where that happens:

```
# the data per day is reduced from e.g. 27x69 to 9x23.
subsample.3x3 <- function(vals) {
stopifnot(dim(vals)[2] %% 3 == 0)
stopifnot(dim(vals)[3] %% 3 == 0)
n.sslats <- dim(vals)[2]/3
n.sslons <- dim(vals)[3]/3
ssvals <- array(0, dim=c(dim(vals)[1], n.sslats, n.sslons))
for (d in 1:dim(vals)[1]) {
for (slat in 1:n.sslats) {
for (slon in 1:n.sslons) {
ssvals[d, slat, slon] <- mean(vals[d, (3*slat-2):(3*slat), (3*slon-2):(3*slon)])
}
}
}
ssvals
}
```

So,~~ you’d~~ you need to eliminate this and change whatever else needs to be changed. What new value of the threshold$\theta$ looks good for predicting El Niños now? Most~~ important:~~ importantly:* can you get better at predicting El El Niños this way way?*

~~ Running~~ The~~ the~~ calculation may take a lot longer, since you’ve got 9 times as many grid points and you’re calculating correlations between pairs.~~ So,~~ So if this is too tough, you~~ don’t~~~~ have~~~~ a~~~~ powerful~~~~ computer,~~~~ maybe~~~~ you~~ can go the~~ way~~ other~~ and~~ way: use a coarser grid and see how much~~ (if~~~~ any)~~ that*degrades* your ability to predict El Niños.

**Challenge 3.** Right now the average link strength for all pairs $(i,j)$ where $i$ is a node in the **El Niño basin** defined by Ludescher *et al*, and $j$ is a node outside this basin. The basin consists of the red dots here:

I mentioned last time that Ludescher *et al* claim to normalize their time-delayed cross-covariances in a somewhat peculiar way which involves running averages of (functions of) running averages. For reasons I explained, I don’t think they could have actually used this method.

What happens if you change the definition of the El Niño basin? For example, can you drop those annoying two red dots that are south of the rest, without messing things up? *Can you get better results if you change the shape of the basin?*

To study these questions you need to rewrite `ludescher-replication.R`

a bit. Here’s where Graham defines the El Niño basin:

```
ludescher.basin <- function() {
lats <- c( 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6)
lons <- c(11,12,13,14,15,16,17,18,19,20,21,22,16,22)
stopifnot(length(lats) == length(lons))
list(lats=lats,lons=lons)
}
```

These are lists of latitude and longitude coordinates: (5,11), (5,12), (5,13), etc. A coordinate like (5,11) means the little circle that’s 5 down and 11 across in the grid on the above map. So, that’s the leftmost point in Ludescher’s El Niño basin. By changing these lists, you can change the definition of the Niño basin.

There’s a lot more you can do… the sky’s the limit!

Here are two reasons our result for the average link strength could differ from Ludescher’s.

Last time I mentioned that Ludescher *et al* claim to normalize their time-delayed cross-covariances in a sort of complicated way. I explained why I don’t think they could have actually used this method. In `ludescher-replication.R`

, Graham used the simpler normalization described last time: namely, dividing by

$\sqrt{\langle T_i(t)^2 \rangle - \langle T_i(t) \rangle^2} \; \sqrt{\langle T_j(t-\tau)^2 \rangle - \langle T_j(t-\tau) \rangle^2}$

instead of

$\sqrt{ \langle (T_i(t) - \langle T_i(t)\rangle)^2 \rangle} \; \sqrt{ \langle (T_j(t-\tau) - \langle T_j(t-\tau)\rangle)^2 \rangle}$

Another reason might be the ‘subsampling’ procedure: how we get from the temperature data on a 9 × 69 grid to temperatures on a 3 × 23 grid. While the original data files give temperatures named after grid points, each is really an area-averaged temperature for a 2.5° × 2.5° square. Is this square *centered* at the grid point, or is the square having that grid point as its north-west corner, or what? I don’t know.

This data is on a grid where the coordinates are the number of steps of 2.5 degrees, counting from 1. So, for latitude, 1 means the North Pole, 73 means the South Pole. For longitude, 1 means the prime meridian, 37 means 90° east, 73 means 180° east, 109 means 270°E or 90°W, and 144 means 2.5° west. It’s an annoying system, as far as I’m concerned.

In `ludescher-replication.R`

we use this range of coordinates for the El Niño basin:

```
lat.range <- 24:50
lon.range <- 48:116
```

Maybe Ludescher *et al* used something slightly different!

There are probably lots of other nuances I haven’t noticed. Can you think of some?