Experiments in software for approximating non-amenable Markov process transitions

*This page describes work in progress and is included purely as it may be interesting to programmers thinking about Azimuth programming projects. It’s not intended to be a easily understandable yet. It’s also currently slightly out of date with respect to the work described at EuroLLVM.*

The “input” code starts off as a an embedded DSL in Haskell which gets converted to a set of C++ structures. The aim of the DSL is to make it possible to write with *relatively little* “extraneous syntax” and allow the use of higher-order functions *to construct code fragments* (HOFs are not within the scope of the actual code). This is a stop-gap which requires understanding of how the expressions are converted to avoid problems in the parser. The decision to produce a set of C++ datastructures, rather than calling LLVM directly from GHC is for a pragmatic reason. This design allows replacing the DSL with something else: while GHC is a very nice and productive language it’s also a quite heavyweight build dependency given that Haskell is not involved beyond the initial “transliteration”.

The major reason for this additional C++ representation is that it’s easier to automatically specialise the code (eg, set parameter *alpha* always equal to parameter *beta*) than working with higher level representation. The derivative terms (following the naming convention of *dadb* for $\frac{\partial a}{\partial b}$) have been computed automatically in the process of construction, which explains why there are some which appear to be missing: they’ve been determined to be always zero and hence don’t need computing. (That also explains the jumps in the temporary variable *vtN* numbering. These correspond to variables coming from “degenerate instruction” that were determined to be identical to another variable, often arising from adding an exactly known to be 0 quantity to another.)

Also the code fragment below is designed for “shaking out the bugs” in the implementation and is designed to have some of the *features* expected in model code without actually being a remotely “realistic” model of anything at this stage.

Original numerical simulation specification:

```
[
oMatrix <~ [[0*alpha,1,2],[3,4*beta,5],[6,7,8*gamma]]
,rv0 <~ randsample
,rv1 <~ randsample
,rv2 <~ randsample
,outs <~ oMatrix `matMult` [rv0,rv1,rv2]
,outsDerivs <~ derivatives outs [alpha,beta,gamma]
,outs2 <~ sumVec outsDerivs
]
```

(The odd “wiggly arrow” assignment operator is used just because it’s free in standard Haskell.)

Mid level “input code”:

```
vt1 := -1
vt2 := 0
vt3 := 1
orig0 := 0
orig1 := 1
orig2 := 2
orig3 := 3
orig4 := 4
orig5 := 5
orig6 := 6
orig7 := 7
orig8 := 8
vt29 := Mul orig0 alpha
vt30 := Mul orig4 beta
vt31 := Mul orig8 gamma
rv0 := randsample
rv1 := randsample
rv2 := randsample
//compute output one
vt32 := Mul orig2 rv2
vt33 := Mul orig1 rv1
vt34 := Mul vt29 rv0
dvt34_dalpha := Mul rv0 orig0
vt35 := Add vt34 vt33
vt36 := Add vt35 vt32
//compute output two
vt37 := Mul orig5 rv2
vt38 := Mul vt30 rv1
dvt38_dbeta := Mul rv1 orig4
vt39 := Mul orig3 rv0
vt40 := Add vt39 vt38
vt41 := Add vt40 vt37
//compute output three
vt42 := Mul vt31 rv2
dvt42_dgamma := Mul rv2 orig8
vt43 := Mul orig7 rv1
vt44 := Mul orig6 rv0
vt45 := Add vt44 vt43
vt46 := Add vt45 vt42
// sum the partial derivatives of the 3 outputs
accumulator := 0
vt49 := Add accumulator dvt34_dalpha
vt53 := Add vt49 dvt38_dbeta
vt57 := Add vt53 dvt42_dgamma
```

The semantics are mostly obvious, with the exception that a *randsample* denotes an implicit loop trying various samples from the distribution and computing all their results.

Code after simple conversion to LLVM (note appearance of 3 loops corresponding to 3 *randsample*s):

```
define float @simulationFn([32 x float]* %output, [32 x float]* %input, [32 x float]* %params, [32 x float]* %distribs) {
entry:
%0 = load [32 x float]* %output
%1 = load [32 x float]* %input
%2 = load [32 x float]* %params
%3 = load [32 x float]* %distribs
%alpha = extractvalue [32 x float] %1, 0
%beta = extractvalue [32 x float] %1, 1
%gamma = extractvalue [32 x float] %1, 2
%mu = extractvalue [32 x float] %2, 0
%nu = extractvalue [32 x float] %2, 1
%sigma = extractvalue [32 x float] %2, 2
%tau = extractvalue [32 x float] %2, 3
%vt29 = fmul float 0.000000e+00, %alpha
%vt30 = fmul float 4.000000e+00, %beta
%vt31 = fmul float 8.000000e+00, %gamma
br label %loop_rv0
loop_rv0: ; preds = %afterloop_rv1, %entry
%loopCtr_rv0 = phi i32 [ 32, %entry ], [ %4, %afterloop_rv1 ]
%4 = sub i32 %loopCtr_rv0, 1
%5 = getelementptr [32 x float]* %distribs, i32 0, i32 %4
%rv0 = load float* %5
br label %loop_rv1
loop_rv1: ; preds = %afterloop_rv2, %loop_rv0
%loopCtr_rv1 = phi i32 [ 32, %loop_rv0 ], [ %6, %afterloop_rv2 ]
%6 = sub i32 %loopCtr_rv1, 1
%7 = getelementptr [32 x float]* %distribs, i32 0, i32 %6
%rv1 = load float* %7
br label %loop_rv2
loop_rv2: ; preds = %loop_rv2, %loop_rv1
%loopCtr_rv2 = phi i32 [ 32, %loop_rv1 ], [ %8, %loop_rv2 ]
%8 = sub i32 %loopCtr_rv2, 1
%9 = getelementptr [32 x float]* %distribs, i32 0, i32 %8
%rv2 = load float* %9
%vt32 = fmul float 2.000000e+00, %rv2
%vt33 = fmul float 1.000000e+00, %rv1
%vt34 = fmul float %vt29, %rv0
%dvt34_dalpha = fmul float %rv0, 0.000000e+00
%vt35 = fadd float %vt34, %vt33
%vt36 = fadd float %vt35, %vt32
%vt37 = fmul float 5.000000e+00, %rv2
%vt38 = fmul float %vt30, %rv1
%dvt38_dbeta = fmul float %rv1, 4.000000e+00
%vt39 = fmul float 3.000000e+00, %rv0
%vt40 = fadd float %vt39, %vt38
%vt41 = fadd float %vt40, %vt37
%vt42 = fmul float %vt31, %rv2
%dvt42_dgamma = fmul float %rv2, 8.000000e+00
%vt43 = fmul float 7.000000e+00, %rv1
%vt44 = fmul float 6.000000e+00, %rv0
%vt45 = fadd float %vt44, %vt43
%vt46 = fadd float %vt45, %vt42
%vt49 = fadd float 0.000000e+00, %dvt34_dalpha
%vt53 = fadd float %vt49, %dvt38_dbeta
%vt57 = fadd float %vt53, %dvt42_dgamma
%loopcond_rv2 = icmp sgt i32 %8, 0
br i1 %loopcond_rv2, label %loop_rv2, label %afterloop_rv2
afterloop_rv2: ; preds = %loop_rv2
%loopcond_rv1 = icmp sgt i32 %6, 0
br i1 %loopcond_rv1, label %loop_rv1, label %afterloop_rv1
afterloop_rv1: ; preds = %afterloop_rv2
%loopcond_rv0 = icmp sgt i32 %4, 0
br i1 %loopcond_rv0, label %loop_rv0, label %afterloop_rv0
afterloop_rv0: ; preds = %afterloop_rv1
ret float %vt57
}
```

Result after optimization at LLVM level:

```
define float @simulationFn([32 x float]* %output, [32 x float]* %input, [32 x float]* %params, [32 x float]* %distribs) {
entry:
br label %loop_rv0
loop_rv0: ; preds = %afterloop_rv1, %entry
%loopCtr_rv0 = phi i32 [ 32, %entry ], [ %0, %afterloop_rv1 ]
%0 = add i32 %loopCtr_rv0, -1
%1 = getelementptr [32 x float]* %distribs, i32 0, i32 %0
%rv0 = load float* %1, align 4
br label %loop_rv1
loop_rv1: ; preds = %afterloop_rv2, %loop_rv0
%loopCtr_rv1 = phi i32 [ 32, %loop_rv0 ], [ %2, %afterloop_rv2 ]
%2 = add i32 %loopCtr_rv1, -1
%3 = getelementptr [32 x float]* %distribs, i32 0, i32 %2
%rv1 = load float* %3, align 4
br label %loop_rv2
loop_rv2: ; preds = %loop_rv2, %loop_rv1
%loopCtr_rv2 = phi i32 [ 32, %loop_rv1 ], [ %4, %loop_rv2 ]
%4 = add i32 %loopCtr_rv2, -1
%5 = getelementptr [32 x float]* %distribs, i32 0, i32 %4
%rv2 = load float* %5, align 4
%dvt34_dalpha = fmul float %rv0, 0.000000e+00
%dvt38_dbeta = fmul float %rv1, 4.000000e+00
%dvt42_dgamma = fmul float %rv2, 8.000000e+00
%vt49 = fadd float %dvt34_dalpha, 0.000000e+00
%vt53 = fadd float %vt49, %dvt38_dbeta
%vt57 = fadd float %vt53, %dvt42_dgamma
%loopcond_rv2 = icmp sgt i32 %4, 0
br i1 %loopcond_rv2, label %loop_rv2, label %afterloop_rv2
afterloop_rv2: ; preds = %loop_rv2
%loopcond_rv1 = icmp sgt i32 %2, 0
br i1 %loopcond_rv1, label %loop_rv1, label %afterloop_rv1
afterloop_rv1: ; preds = %afterloop_rv2
%loopcond_rv0 = icmp sgt i32 %0, 0
br i1 %loopcond_rv0, label %loop_rv0, label %afterloop_rv0
afterloop_rv0: ; preds = %afterloop_rv1
ret float %vt57
}
```

It’s not immediately obvious what some of these optimizations leading to this code were!

However, we see that even after optimization, at the level of the intermediate code there are still some additions and multiplications where one argument is $0.000000$. It’s possible that final optimizations could be performed during the final conversion from machine-independent to specific code for a particular CPU. However, given the subtlety of floating-point it’s unlikely that these operations will be eliminated – partly because the obvious simplifications *aren’t* valid if the values are “special” floating-point values.

Here’s an example of output created using a slightly bigger model using a variant of the above dynamic compilation via LLVM.

The surface height at a point is essentially frequency (so that appropriately scaled they’d become transition probability). This graph clearly shows a distribution after transformation which is likely a piecewise function with some interesting characteristics which are the subject of further investigation.

category: experiments, software