The Azimuth Project
Temporal filtering

Temporal filtering


The goal of temporal filtering is essentially to estimate “optimally” the pdf of the hidden state of a temporally evolving stochastic process. (The case of producing a concrete state estimate is relatively simple once the pdf is estimated.) This page provides an overview of the very generic problem and gives a rough indication how various assumptions lead to different filters (i.e, algorithms for “filtering” the signal from the noisy data).

This page primarily deals with filtering of discrete time probabilistic systems. For continuous time filtering see stochastic filter.


We first outline the basic model and some notations, before giving a “mathematical” expressions for the solution. We then discuss how various modelling assumptions and computational constraints lead to particular filters.

Note: in full generality filtering considers a process into which control signals are fed (often based upon the previous state estimates) and incorporates their expected results into the estimation process. This is beyond the scope of this text.

Process model

Our basic model is that we are tracking a hidden state variable x\mathbf{x} that evolves in discrete time using a model of its observable consequences. (The unobservable state is generally referred to intechangeably as hidden or latent.) There are two components:

  • Probabilistic dynamics: The value of x new\mathbf{x}_{new} given x old\mathbf{x}_{old} is drawn from the dynamics distribution p(x oldx new)p(\mathbf{x}_{old}\mapsto\mathbf{x}_{new}). This distribution typically has a deterministic component (which may be either linear or nonlinear) and a random component. For a given problem this decomposition may or may not be explicitly known.

  • Data support: p(x t|currentobservations)=D t(x t)p(\mathbf{x}_t| current observations) = D_t(\mathbf{x}_t). In the most general formulation D tD_t can be any blackbox function, although sometimes it has a known structure (e.g., a Gaussian centred on a point). We call this data rather than measurements to highlight that this is not restricted to be a single “instrument reading”. (Note that D tD_t often needs to be calculated from p(observations|x t)p(observations | \mathbf{x}_t) using Bayes’ Theorem, but we assume this has already been done.)

Note that the process model is often specified in terms of more explicit equations, but these are not general enough to highlight the similarity between the various filters.


There are some common notations that will make it easier to see what equations denote.

  • The state variable we are estimating is x 1\mathbf{x}_1, …, x T\mathbf{x}_T and each lies in some space Ω\Omega.
  • As in general probability theory, p(x)p(\mathbf{x}) denotes the pdf of x\mathbf{x} viewed as a random variable in the case of continuous Ω\Omega. We also assume the standard use of delta functions to represent probability on discrete spaces using the formalism for continous spaces is assumed.
  • Normalising a weighted set refers to scaling the weights so they sum to 1.
  • a|ba|b denotes “aa conditioned on bb” (or colloquially, “aa given knowledge of bb”).
  • It is common to use the notation v a:bv_{a:b} to denote v av_a, v a+1v_{a+1}, …, v bv_{b} and extend it so that p a|1:b(x)p_{a|1:b}(\mathbf{x}) denotes p(x a|D 1,...,D b)p(\mathbf{x}_a|D_1,...,D_b).
  • Picking elements from a set of elements with weights denotes sampling with replacement from the set, with the probability of selecting a given element being proportional to its weight.


Note the process model assumes that evolution is Markov in the sense that p t+1|1:t+1p_{t+1|1:t+1} is independent of p up_u for ut1u\le t-1 given p t|1:tp_{t|1:t}. (Standard tupling-of-past-values into the state variable can be used to turn an n-th order Markov processes into a first order Markov process). Thus the probabilistic dynamics can be represented by a set of distributions p(xx)p(\mathbf{x}'\mapsto \mathbf{x}) (possibly different for each x\mathbf{x}').

There are two consecutive output stages in filtering:

1. Prediction using only previous timesteps’ data

The first step is to estimate p t+1|1:tp_{t+1|1:t}, namely the current state’s pdf using only the inferences drawn from previous timesteps’ data along with knowledge of the dynamics. Simple probabilty leads to the commonly shown backward formulation:

xΩp t+1|1:t(x)= xΩp(xx)p t|1:t(x) \forall \mathbf{x}\in\Omega \quad\bullet\quad p_{t+1|1:t}(\mathbf{x})=\int_{\mathbf{x}'\in\Omega} p(\mathbf{x}'\mapsto\mathbf{x})p_{t|1:t}(\mathbf{x}')

which can be seen as a convolution of p t|1:tp_{t|1:t} with the “probabilistic dynamics” p(xx)p(\mathbf{x}'\mapsto\mathbf{x}). It will turn out to be clearer to consider the forward formulation:

xΩtransitionsxxp t+1|1:t(x)+=p(xx)p t|1:t(x) \forall \mathbf{x}'\in\Omega\quad\bullet\quad \forall transitions\quad\mathbf{x}' \mapsto \mathbf{x}\quad\bullet\quad p_{t+1|1:t}(\mathbf{x})\mathrel{+=} p(\mathbf{x}'\mapsto\mathbf{x})p_{t|1:t}(\mathbf{x}')

where +=\mathrel{+=} is borrowed from computer languages to denote accumulation. (A series of +=\mathrel{+=} assignments to a “variable” a+=b 1a\mathrel{+=}b_1, …, a+=b na\mathrel{+=}b_n is equivalent to a= 1 Nb ia=\sum_1^N b_i.) Providing the \foralls are over the entirety of Ω\Omega (and interpreted appropriately when Ω\Omega is a continuous space), this forward formulation is exactly correct.

These two equations actually have a much more similar structure than it initially appears because the \int in the backward formulation implicitly contains \forall and +=\mathrel{+=}. The advantage of the forward formulation is that when only a limited number of evaluations is computationally possible, having chosen an x\mathbf{x}' by sampling a “good” – in the sense of being an large contribution to the probability mass making up the new pdf – value of x\mathbf{x} can be found simply by drawing from an element from p(xx)p(\mathbf{x}'\mapsto\mathbf{x}). In contrast, it’s not clear how to use the backward formulation to pick a “good” value of x\mathbf{x} even with the knowledge of p t|1:tp_{t|1:t}.

2. Prediction incorporating the current timestep’s data

The second step in filtering is to compute p t+1|1:t+1p_{t+1|1:t+1}, namely the state pdf estimate using all data including that for the current timestep. This is much simpler:

p t+1|1:t+1(x)D t+1(x)p t+1|1:t(x) p_{t+1|1:t+1}(\mathbf{x})\propto D_{t+1}(\mathbf{x})p_{t+1|1:t}(\mathbf{x})

where the constant of proportionality is independent of the state estimation (it’s essentially p(D t+1|D 1:t)p(D_{t+1}|D_{1:t})).

Bringing in computational concerns

Now consider how to approximate this forward formulation process in the case where we have limited computational resources. Clearly it is in general impractical to perform all the “choices” implied by the \foralls. There are two choices at extreme end of the spectrum:

  1. Using representative particles,
  2. Assuming a parametric form for various elements,
  3. A hybrid between parametric and particles.

1. Using representative particles: particle filters

For state distributions that are not believed to be close to parametric forms or probabilistic dynamics which have nonlinear deterministic parts or non-Gaussian stochastic components, one could choose a pdf represention as a set of NN particles, each consisting of a position x\mathbf{x} and a weight ww. A formal mathematical way of viewing a set of particles is as a sum of weighted delta functions in Ω\Omega, and as with this sum it is convenient to allow multiple particles at the same position to correspond to the pdf having a value equal to the sum of their weights. One big advantage of using particles is that the number of particles NN can be adjusted to trade-off between computational effort and accuracy.

The particle representation clearly does not capture some characteristics of the pdfs such as continuity, but importantly gives good approximations to various statistics (e.g., mean, etc) of the pdf and can be used to generate random samples from the pdf.

This is very useful because, in its simplest form, we can simply approximate the forward formulation by direct random simulation: intially set p t+1|1:t+1p_{t+1|1:t+1} to empty and apply

  1. Pick an element x\mathbf{x}' from p t|1:tp_{t|1:t}.
  2. Pick an element of p(xx)p(\mathbf{x}'\mapsto\mathbf{x}) to determine x\mathbf{x}.
  3. Create a new particle in p t+1|1:t+1p_{t+1|1:t+1} at x\mathbf{x} with weight D t+1(x)D_{t+1}(\mathbf{x}). (The factor of p(xx)p t|1:t(x)p(\mathbf{x}'\mapsto\mathbf{x})p_{t|1:t}(\mathbf{x}') is accounted for by the sampling steps.)

NN times, finally normalising the particle representation of p t+1|1:t+1p_{t+1|1:t+1}. A minor technicality is that the sampling in step 1 can be moved after step 3 so that the representation is actually maintained as a set of constant weight particles containing many duplicates. It is important to note that this resampling, wherever it occurs, means that at each step computational effort is probabilistically concentrated on regions of the pdf according to their likelihood: this is very different, and has much greater accuracy, to following an initial set of particles over multiple timesteps without resampling.

This procedure is very simple to implement and is the basic particle filter.

In terms of the forward formulation, this corresponds to replacing the x\forall \mathbf{x}' with NN samplings from the particle set and replacing the xx\forall \mathbf{x}'\mapsto\mathbf{x} with the choice of one sample drawn from the probabilistic dynamics for each particle.

Schematic of particle filter

It is also interesting that these steps just require the ability to sample from the probabilistic dynamics, not any form of analytic representation. Thus this algorithm is very widely applicable. However it is very inefficient if the problem has some known analytical structure that could be used to replace some of the sampling.

Importance sampling in particle filtering


2. Assuming parametric forms: Kalman filters

We can take the representation (i.e., p t|1:tp_{t|1:t} and p t+1|1:t+1p_{t+1|1:t+1}) to be some parametric probability distribution. Then we can look for restrictions on the dynamical model p(xx)p(\mathbf{x}'\mapsto\mathbf{x}) and data distribution D t(x)D_t(\mathbf{x}) which guarantee that for p t+1|1:t+1p_{t+1|1:t+1} will be in the representation whenever p t|1:tp_{t|1:t} is (reminiscent of the closedness condition for a group). This is very, very computationally efficient and is accurate when the actual process fits the model well.

The case where the representation and data distribution are Gaussian and the dynamics are linear plus Gaussian noise is optimally solved by the Kalman filter?.

In terms of the forward formulation, this corresponds to replacing the x\forall \mathbf{x}' with a single particle at the parametric Gaussian state distribution mean and replacing the xx\forall \mathbf{x}'\mapsto\mathbf{x} with the deterministic linear part of the probabilistic dynamics. However in this case the covariance of the distribution can also be propagated exactly analytically so that a new Gaussian state distribution is obtained.

Kalman filter schematic

The Kalman filter is only “exactly correct” in the case the the deterministic part of the dynamics is linear. The Extended Kalman Filter is the simplest extension to non-linear deterministic dynamics and is obtained by linearising the dynamics around the current mean (using Jacobian matrices) and applying the updates for a linear function. How successful this is depends on the degree of nonlinearity relative to the curreent covariance.

In terms of the forward formulation, this corresponds to replacing the x\forall \mathbf{x}' with a single particle at the parametric Gaussian state distribution mean and replacing the xx\forall \mathbf{x}'\mapsto\mathbf{x} with the deterministic part of the probabilistic dynamics. However, by linearising the covariance of the distribution can also be propagated analytically to give an approximate new Gaussian state distribution.

Kalman filter schematic

3. A hybrid between parametric and particles

In filtering one often thinks that the state distribution ought to be a unimodal distribution which ought to be well approximated by a Gaussian but highly non-linear dynamics causes the Kalman filter to perform poorly. In this case one can attempt to propagate a Gaussian pdf through the non-linear function more accurately using the following procedure:

  1. With a Gaussian distribution in kk-d space Ω\Omega, create a particle at the mean and two particles along each principal axis of the covariance matrix at a distance of ±γ\pm\gamma times the standard deviation along that axis (called sigma points). (Here γ\gamma is a tunable parameter and there are 2k+12k+1 particles in total.)
  2. Apply the deterministic part of the dynamics update to each particle to get a new set of particles.
  3. Compute the mean and covariance of this set of particles to get a new Gaussian distribution.
  4. The analytic incorporation of the random dynamics update effect and data term can now be done exactly as in the Kalman filter.

This process is known as the Unscented Kalman Filter or the Sigma-point Kalman Filter and is generally much more accurate than the Extended Kalman Filter for nonlinear dynamics. (One can show that the symmetry causes certain terms in the deterministic update function’s Taylor expansion to cancel, providing some explanation.)

In terms of the forward formulation, this corresponds to replacing the x\forall \mathbf{x}' with 2k+12k+1 particles placed deterministically from the parametric Gaussian state representation and replacing the xx\forall \mathbf{x}'\mapsto\mathbf{x} with selection of the deterministic component of the probabilistic dynamics for each particle, finally converting the particles into a parametric Gaussian with the same mean and covariance.

Unscented Kalman filter schematic

One can take this idea further by considering the Iterated Sigma-point Filter, which attempts to use a more sophisticated iterative Gaussian fitting technique to determine the new Gaussian from the transformed sigma-points.

In terms of the forward formulation, this corresponds to replacing the x\forall \mathbf{x}' with 2k+12k+1 particles placed deterministically from the parametric Gaussian state representation and replacing the xx\forall \mathbf{x}'\mapsto\mathbf{x} with selection of the deterministic component of the probabilistic dynamics for each particle, finally converting the particles into a parametric Gaussian using an iterative fitting scheme.

Iterative unscented Kalman filter schematic

Schematic diagramatic representations

Here’s a schematic illustration of the accuracy of various “transforms” for propagating the representation of a pdf through non-linear dynamics:

types of pdf propagation in filtering


  • TODO