The Azimuth Project
Experiments in El Niño analysis and prediction

This is part of the Azimuth Code Project, aimed at studying methods of El Niño prediction, and more generally, understanding the ENSO. There is a series of articles on the Azimuth Blog summarizing our progress, and reading those is best way to understand what we’ve done so far:

El Niño project (part 1): basic introduction to El Niño and our project here.

El Niño project (part 2): introduction to the physics of El Niño.

El Niño project (part 3): summary of the work of Ludescher et al.

El Niño project (part 4): how Graham Jones replicated the work by Ludescher et al, using software written in R.

El Niño project (part 5): how to download R and use it to get files of climate data.

El Niño project (part 6): Steve Wenner’s statistical analysis of the work of Ludescher et al.

El Niño project (part 7): the definition of El Niño.

El Niño project (part 8): Berezin et al on the stability of climate networks.

For what we’re doing next, see Experiments in El Niño analysis and prediction and related threads on the Azimuth Forum. Also see Experiments with varieties of link strength for El Niño prediction.

Contents

Background

A short video explains El Niño issues in a simple way:

Tutorial information on climate networks, and their application to El Niño signal processing:

This paper explains how climate networks can be used to recognize when El Niño events are occurring:

This paper on El Niño prediction created a stir:

• Josef Ludescher, Avi Gozolchiani, Mikhail I. Bogachev, Armin Bunde, Shlomo Havlin, and Hans Joachim Schellnhuber, Very early warning of next El Niño, Proceedings of the National Academy of Sciences, February 2014. (Click title for free version, journal name for official version.)

The methodology is explained in this earlier paper:

• Josef Ludescher, Avi Gozolchiani, Mikhail I. Bogachev, Armin Bunde, Shlomo Havlin, and Hans Joachim Schellnhuber, Improved El Niño forecasting by cooperativity detection, Proceedings of the National Academy of Sciences, 30 May 2013. (For more discussion, go to the Azimuth Forum and also below.)

This paper is also relevant:

Ludescher et al on El Niño forecasting by cooperativity detection

This paper:

uses data available from the National Centers for Environmental Prediction and the National Center for Atmospheric Research Reanalysis I Project:

More precisely, there’s a bunch of files here containing worldwide daily average temperatures on a 2.5° latitude × 2.5° longitude grid (144 × 73 grid points), from 1948 to 2010. If you go here the website will help you get data from within a chosen rectangle in a grid, for a chosen time interval. These are “NetCDF files”; an R package for working with these files is here and some information on how they look is here.

The paper uses daily temperature data for “14 grid points in the El Niño basin and 193 grid points outside this domain” from 1981 to 2014, as shown here:

They process this data in a manner explained below and attempt to use the result to predict the start of an El Niño. This is defined using Niño 3.4, which is the area-averaged sea surface temperature anomaly in the yellow region here:

Anomaly means the temperature minus its average over time: how much hotter than usual it is. When the Niño 3.4 index is over 0.5°C for at least 3 months, Ludescher et al say there’s an El Niño. The Niño 3.4 data can be found here:

Under what conditions do they predict the start of an El Niño?

For each day tt between June 1948 and November 2013, let T˜ i(t)\tilde{T}_i(t) be the average surface air temperature at the point ii on day tt.

Let T i(t)T_i(t) be T˜ i(t)\tilde{T}_i(t) minus its climatological average. For example, if tt is June 1st 1970, we average the temperature at location ii over all June 1sts from 1948 to 2013, and subtract that from T˜ i(t)\tilde{T}_i(t) to get T i(t)T_i(t). They call T i(t)T_i(t) the temperature anomaly.

(A subtlety here: when we are doing prediction we can’t know the future temperatures, so the climatological average is only the average over past days meeting the above criteria.)

For any function of time, denote its moving average over the last 365 days by:

f(t)=1365 d=0 364f(td)\langle f(t) \rangle = \frac{1}{365} \sum_{d = 0}^{364} f(t - d)

Let ii be a point in the El Niño basin, and jj be a point outside it. For any time lags τ\tau between 0 and 200 days, define the time-delayed cross-covariance by:

T i(t)T j(tτ)T i(t)T j(tτ) \langle T_i(t) T_j(t - \tau) \rangle - \langle T_i(t) \rangle \langle T_j(t - \tau) \rangle

Note that this is a way of studying the linear correlation between the temperature anomaly at node ii and the temperature anomaly a time τ\tau earlier at some node jj. So, it’s about how temperature anomalies inside the El Niño basin are correlated to temperature anomalies outside this basin at earlier times.

Ludescher et al normalize this in a somewhat funny way, defining the time-delayed cross-correlation C i,j t(τ)C_{i,j}^{t}(\tau) to be the time-delayed cross-covariance divided by

(T i(t)T i(t)) 2(T i(tτ)T i(tτ)) 2\sqrt{\langle (T_i(t) - \langle T_i(t)\rangle)^2 \rangle} \; \sqrt{\langle (T_i(t-\tau) - \langle T_i(t-\tau)\rangle)^2 \rangle}

This is something like the standard deviation of T i(t)T_i(t) times the standard deviation of T i(tτ)T_i(t - \tau). Dividing by standard deviations is what people often do to turn covariances into correlations. However, as Nadja Kutz pointed out, it’s unusual to do this when the angle brackets are defined as a moving average over the last 365 days, because then

f(t)f(t) \langle \langle f(t) \rangle \rangle \ne \langle f(t) \rangle

and the usual formal properties of expressions involving means of means don’t hold.
So, this is something we should think about.

Anyway, C i,j t(τ)C_{i,j}^{t}(\tau) is defined in a similar way, starting from

T i(tτ)T j(t)T i(tτ)T j(t) \langle T_i(t - \tau) T_j(t) \rangle - \langle T_i(t - \tau) \rangle \langle T_j(t) \rangle

So, this is about how temperature anomalies outside the El Niño basin are correlated to temperature anomalies inside this basin at earlier times.

Next, for nodes ii and jj, and for each time point tt, the maximum, the mean and the standard deviation around the mean are determined for C i,j t(τ)C_{i,j}^t(\tau), as τ\tau varies across its range.

They define the link strength S ij(t)S_{i j}(t) as the difference between the maximum and the mean value, divided by the standard deviation.

Finally, they let S(t)S(t) be the average link strength, averaging S ij(t)S_{i j}(t) over all pairs where ii is a node in the El Niño basin and jj is a node outside.

So, when S(t)S(t) goes over 2.82, they predict an El Niño.

Here is what they get:

The blue peaks are El Niños: episodes where the Niño 3.4 index is over 0.5°C for at least 3 months.

The red line is their ‘average link strength’. Whenever this exceeds a certain threshold Θ=2.82\Theta = 2.82, they predict an El Niño will start in the following calendar year.

The green arrows show their successful predictions. The dashed arrows show their false alarms. A little letter n appears next to each El Niño that they failed to predict.

Actually, chart A here shows the ‘learning phase’ of their calculation. In this phase, they adjusted the threshold Θ\Theta so their procedure would do a good job. Chart B shows the ‘testing phase’. Here they used the value of Θ\Theta chosen in the learning phase, and checked to see how good a job it did.

But what about their prediction now? That’s the green arrow at far right below. On 17 September 2013, the red line went above the threshold! So, their scheme predicts an El Niño sometime in 2014. The chart at right is a zoomed-in version that shows the red line in August, September, October and November of 2014!

Our implementation

Our implementation is at Github

Note on efficient implementation

The algorithm involves many calculations of time-delayed cross-covariances of the form

C(m)=1N i=1 Nx iy i+m(1N i=1 Nx i)(1N i=1 Ny i+m). C(m) = \frac{1}{N} \sum_{i=1}^N x_i y_{i+m} - \bigg( \frac{1}{N} \sum_{i=1}^N x_i \bigg) \bigg( \frac{1}{N} \sum_{i=1}^N y_{i+m} \bigg).

Here N=365N=365 is the period over which the covariances are found, and mm ranges over 0 to M=200M=200. It can be re-arranged as

C(m)=1N i=1 Ny i+md i C(m) = \frac{1}{N} \sum_{i=1}^N y_{i+m} d_i

where

d i=x i1N i=1 Nx i. d_i = x_i - \frac{1}{N} \sum_{i=1}^N x_i.

The expression for C(m)C(m) is now a convolution, so can be found efficiently using the discrete Fast Fourier transform.

Visualizations of the data

A first look at some of the data

This shows surface air temperatures over the Pacific for 1951:

Yearly mean temperatures over the pacific in 1951

To see how this image was made: R code for pacific1951 image

A second look

To see how this image was made: R code to display 6 years of Pacific temperatures

Third look

Temperatures in the pacific in early 1957 and 1958

The rectangle is roughly the area where the El Niño index NINO3.4 is defined.

Conversion code

Here is an R script to convert netCDF data to a simple “flat” format which can be read by other programs. The output format is described in the code.

Some correlations and covariances

The images below show local correlations and covariances of temperatures over the Pacific. Correlations are on the left and covariances on the right. They are calculated over the year of 1951. The correlations are shown on a scale where black is zero, white is 1. All of the values were positive; the smallest correlation was 0.26. The cube roots of the covariances are shown on a scale where black is zero and white is the maximum value. A linear mapping of values just shows a few pale pixels near the corners, with the rest black. Summary values for a set of covariances:

    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
 0.09237  0.22520  0.40570  1.28000  1.11200 22.77000 
Correlations and covariances in Pacific temperatures

More covariances

The PDF file Covariances near equator shows covariances between different places near the equator in the Pacific and at two different time delays. Most of the details are in the PDF, but some things are not:

  • the first graph shows a 1-day delay and the second a 5-day delay.

  • the graphs show the median of the covariances over the region

  • black means zero geographical displacement, and paler greys show displacements increasing by 2.5 degrees.

The idea is to plot, for each 5 days from 1951 through 1979, for a region straddling the equator, and for 0 to 7 eastwards steps of 2.5 degrees, the covariances of the temperature over six months (183 days). Here are the results:

Covariance maps 1951-1979

Covariance maps 1951-1979

The image shows one map of the Pacific for each quarter for the years 1951 through 1979. On the right, the NINO3.4 index is shown for the year.

The area is that used by Ludescher et al (2013). The “El Nino basin”, as defined by Ludescher et al (2013) is the black region along the Equator towards the East, plus two pixels below. For every other pixel i, the sum TC(i) of the covariances between i and the 14 pixels in the basin is shown. The covariances are calculated over the previous year. The absolute values are “squashed” by before conversion to colours. Negative values of TC(i) are red, positive values green, paler meaning bigger in absolute value.
Very big values are shown by bright red and green.

More detail

The climatological seasonal cycle (mean over years for each grid point, each day-in-year) is subtracted. The data is spatially subsampled into 7.5 by 7.5 degree squares. There are 9 by 23 such squares. The covariances are calculated for a day in the middle of each quarter. The covariances are calculated over a period of 365 days. There is no time delay between the periods. The TC(i) values are squashed by sign(TC(i)) * sqrt(abs(TC(i))) before conversion to colours. The range -3,3 is mapped to dull shades; below -3 is bright red, above 3 bright green.

The El Nino index is from http://www.cpc.ncep.noaa.gov/products/analysis_monitoring/ensostuff/detrend.nino34.ascii.txt.

To see how this image was made: R code to make covariance maps 1951-1979

Studying the dependence of covariance on distance

The graphs below are based on David Tanzer’s numbers in this thread in the forum. The top three graphs show the square roots of the absolute values of the covariances. The bottom three are similar, but the values are scaled so that the covariance for the smallest distance is always 1.

Covariances vs distances 1950-1979

Here is the code that generated the underlying data (in csv form): covariance-by-distance.R.

Time-delayed correlations and covariances at different places and times

Each of the following images show time-delayed cross-correlations and cross-covariances for one grid point versus others in the Pacific. The graphs are arranged geographically, with Australia bottom left. Adjacent graphs represent points 15 degrees apart.

Correlations, end 1956 (not El Niño), not in basin:

Time-delayed correlations

Correlations, end 1956 (not El Niño), in basin:

Time-delayed correlations

Correlations, end 1957 (El Niño), not in basin:

Time-delayed correlations

Correlations, end 1957 (El Niño), in basin:

Time-delayed correlations

Covariances, end 1957 (El Niño), in basin, same grid point as last:

Time-delayed covariances

Standard deviations for individual grid points, end 1957 (El Niño), in basin. (The point (5,13) is not special here.)

Time-delayed standard deviations

Code at Github for images in “Time-delayed correlations and covariances at different places and times”.

The graph shows a variant of the Ludescher et al algorithm. The original algorithm is in black. A variant with no normalisation is shown in red.

Correlations vs covariances in link strength

Code at Github

First attempt at replicating Ludescher et al

First attempt at replicating Ludescher et al 2013. It uses time-delays of up to 295 (not 200). It does every 73rd day (not every 10). The result appears similar to Fig 2 in their paper (see image above in section “Ludescher et al on El Niño forecasting by cooperativity detection”) but with lower values of S(t).

replicating Ludescher et al 2013

Code at Github.

Second attempt at replicating Ludescher et al

Fixed a couple of bugs: I was taking square roots of standard deviations when calculating cross-correlations, and did not take the absolute value of cross-correlations when calculating link strengths. The plot below uses every ten days. It is now very close to the original results, though not identical.

replicating Ludescher et al 2013

Code at Github. It took about 35 minutes to run.

References

For the dataset source:

category: climate, experiments