Measuring Causality

We have an intuitive sense of causality, but how can define it mathematically? In Defining Causality we saw a definition which revolves around intervention. But in that post we assumed that 1. we could observe all the variables in our model and 2. we had complete access to the exact parameters of our underlying distribution. In reality 1. there are things we can’t observe and 2. the things we can observe are samples, not populations.

The Addictive Gene (recap)

Consider the causal diagram in Figure 1. In this example we have the binary variables indicating whether someone:

  1. Smokes
  2. Drinks
  3. Dies
  4. Has the (fictional) addictive gene, gene X, which makes an individual more likely to both smoke and drink


Figure 1: Causal diagram for smoking (observational distribution)

For some choice selection of parameters for this distribution, we arrive at the underlying distribution as visualised by the probability bars in Figure 2.


Figure 2: Probability bars for smoking (observational distribution)

Our big causal question here is: does smoking cause death? By eyeballing Figure 2 we can see that under the smoking (orange) bars there is a higher rate of death (red) bars. Is our question as simple as that? Sadly no. The reason this is difficult is because there is also a higher rate of drinking (green) bars, due to the influence of gene X. How can we correctly adjust for the fact that the drinking variable also contributes to higher mortality?

The key to a correct answer is to examine a set of different distributions, namely the experimental distributions. The importance of these distributions was discussed in the last post, and they are visualised in the probability bars in Figure 3 and Figure 4.

Our task now is to estimate them.


Figure 3: Probability bars for the everyone experimental distribution.


Figure 4: Probability bars for the no-one experimental distribution.

Recovering the experimental distribution

Suppose that we only have access to samples from the observational distribution, then it’s easy to estimate this observational distribution through sample averages. So given that we can (to some extent) recover the observational distribution, how do we use this to recover the experimental distributions? To answer this, let us begin with the following interpretation.

Imagine that we are Mother Nature, and it is our job to generate samples from the observational distribution. The causal diagram is important because it provides rules for the order in which we may proceed.

First, we draw the root nodes – which are our first causes – each drawn according to it’s own sacred distribution, not depending on any other node in the graph. Then we may draw the remaining nodes at any time so long as it’s parents have already been drawn; for it is the outcomes of the parents which determine the distribution from which we draw. In the smoking example, we can proceed with one of two different ordering strategies:

  1. gene X → drinking → smoking → death
  2. gene X → smoking → drinking → death

Okay, job done for Mother Nature generating observational samples, who we shall now refer to as Observational Mother Nature.

Now image that we are an Experimental Mother Nature, and we wish to generate samples from the experimental distribution. The obvious technique would be the intervention technique: we proceed as before but with one crucial change: at the moment where we are about to draw the experimental variable, instead we forcibly set this variable to our experimental value, then we continue on as normal drawing the remaining downstream variables. In our smoking example, just at the point where we would have randomly determined whether an individual was a smoker, instead we force them to be so.

But here is another technique: the restriction technique. The key insight here is that from the point of intervention, the life of a person who we force into being a smoker is exactly the same as the life of a person who was determined to be a smoker by chance. Therefore, given a large set of samples generated by Observational Mother Nature, Experimental Mother Nature can piggy back off these in the following way:

  1. Divide up Observational Mother Nature’s samples into groups depending on the history previous to that of the experimental value. That is, each separate combination of outcomes occurring in the variables drawn upstream of the experimental variable gets it’s own group.
  2. At that point, keep only the samples which agree with our experimental value.
  3. Stitch together these remaining samples, weighting each group by the history distribution, that is the original frequencies of each history group.

In our example, suppose Observational Mother Nature is following the first ordering strategy: gene X → drinking → smoking → death. Then there are four history groups before the smoking variable is drawn. The restricting and re-weighting which we describe has been visualised in Figure 5. This is also referred to as adjusting for gene X and drinking.

So here we have it: miraculously we recover the entire experimental distribution without any experimental data!


Figure 5: Adjusting for drinking and gene X, proof-by-picture. (I have switched the ordering of smoke and drink to make the comparison clearer)

Note on how to read these graphs

In this and all the pictures that follow, we have the observational distribution on the top, and the experimental distribution on the bottom (sometimes marginalised to a subset of variables). The blue boxes separate out each history group – i.e. each combination of our adjustment variables – but restricted to the experimental value.

The arrows indicate that after restricting and re-weighting, the distributions are the same.

Mother Nature: the second ordering strategy

What if Mother Nature decided to draw variables according to the other ordering strategy: gene X → smoking → drinking → death. This means only gene X is upstream of the smoking variable, and we only have two possible histories. And indeed adjusting only for gene X still works: see Figure 6.


Figure 6: Adjusting for gene X only

Also, it is overkill to want to recover the entire experimental distribution: often we only care about the causal effect, which is the marginal distribution in the dependent variable. In this case mortality.

If this is the so, then according to this – the second ordering strategy – we don’t even need to see the outcome of the drinking variable in our observational sample; if it were hidden from us, we can still recover the marginal of the remaining three variables, and in particular, the causal effect. See Figure 7 for the visualisation.


Figure 7: Adjusting for gene X, with the drinking variable being unobserved.

You will notice that this graph looks different because we can no longer group by whether the individual is a drinker or not. E.g. for those with gene X, all of the drinkers and non-drinkers are grouped together, but it still works.

Mother Nature: for two players

Let’s make things more complicated.

Imagine now that Observational Mother Nature is tired of drawing all the variables herself, so she divides the set of variables in two. Observational Mother Nature 1 (MN1) draws all the variables up to and including the experimental variable, then she passes her outcomes over to Observational Mother Nature 2 (MN2), who draws the remaining downstream variables. She only needs to pass along the relevant outcomes: those which have a child in the downstream variables.

Suppose we only care about the causal effect – i.e. the dependent variable – then MN1 doesn’t need to pass on all of her outcomes, only those which are relevant for drawing downstream variables: only those which have children among the downstream variables. For the smoking example in the case of the first ordering, this two-player method is visualised in Figure 8.


Figure 8: Two player generating of the observational distribution

Experimental Mother Nature (EMN) realises that to play the intervention technique, she doesn’t need to interfere with what MN1 is doing, she needs only to intervene with MN2. Her technique involves intervening just after receiving the outcomes from MN1, and her method is as follows:

  1. Take MN1’s outcomes (which include the experimental variable)
  2. Forcibly set the experimental variable to the experimental outcome
  3. Take over the task of MN2 in drawing all of the remaining variables.

For the smoking example, this method is depicted in Figure 9. Note that it is crucial that the experimental variable was the last thing to be drawn by MN1 before hand-off! So long as this is the case, we can be sure that tinkering with this variable would have no influence on the other variables handed-off by MN1.


Figure 9: Two player generating of the experimental distribution

And the restriction technique should still work. EMN might not know the outcomes of all the upstream variables drawn by MN1, but so long as EMN preserves the distribution of groups as given to her by MN1, then restricting and re-weighting ought to work just like before, for the same reason as before.

Returning then for the last time to our example, the restriction technique is the following:

  1. MN1 draws gene X, drinking, smoking
  2. MN1 passes only the drinking and smoking outcome to EMN
  3. EMN restricts only to the handed-off outcomes for which the individual smokes
  4. EMN re-weights according to the distribution of groups handed-off by MN1: i.e. preserving the original prevalence of drinkers and non-drinkers

This has been depicted in Figure 10. What is the corollary? We can recover the causal effect even when gene X is hidden from view!


Figure 10: Adjusting for drinking, when gene X is unobserved.

Back door paths

This is a very convoluted attempt to make intuitive what the literature refers to as blocking all the back door paths1. The main takeaway is that for any group of variables chosen in a way that is consistent with the above fairy-tale, the adjustment we describe here correctly transforms the observational distribution into the experimental distribution. These groups are exactly those which are said to block all the backdoor paths.

In fact there are other valid groups and different methods of adjustment, so the fun doesn’t stop here. If you have any ideas for crazy ways of explaining any of these other methods, then please get in touch.



For a technical exposition of what it means to block all the back door paths, see Pearl, J. (2009). Causality. : Cambridge University Press.

Defining Causality

Causality is a confusing concept. It seems to be something that we understand intuitively, but in neither maths nor science do we have an agreed upon technical definition. Part of the problem, as usual, is that we are using one word to describe more than one thing. Here I will discuss forward and backward causality. And in the case of forward causality, I want to introduce interventions as a good candidate for an agreed upon definition.

Forward and Backward Causality

Example: House Fire

Consider the following collection of events: a short circuit in someone’s house creates a spark which sets on fire some nearby curtains. The local firefighters are nowhere to be found, and the house burns down.

In this example, what was the cause of the fire? Was it the short circuit, the curtain, or the absence of firefighters? How can the absence of something be a cause? If it was the short circuit then naturally we must ask, what was the cause of the short circuit? Was it dodgy manufacturing? Or misuse of the product? Essentially the question is this: who is to blame?

This is a question of backward causality. In general, the question goes “what were the inputs that lead to this output? And how much did each one contribute?” Answering this question is tricky business. In it’s truest interpretation, it feels like the answer is always “the big bang did it”. Indeed some philosophers give up on this type of question all together, and claim that our intuitive notion of causality in this sense is a fiction; our intuitions deceive us, just like how we are deceived by our intuitions for space and time. Free will, ethics, and the justice system all completely depend on this concept of backward causality.

A different question is that of forward causality. This is not the causality of credit assignment, but the causality of decision making. Here are some examples: What is the impact on my life if I take up smoking? If my dishwasher was built with a dodgy circuit, how does that affect the probability of my house burning down? These questions sound a bit more approachable, but still it’s not as easy as you might think. Two big problems are the inadequacy of correlation, and the complications which arise from confounding variables.

In summary, a backward question fixes a value of the outcome Y = y, and asks about the antecedents X. Whereas a forward question specifies or toggles a fixed input X=x, and asks about the consequences for some descendent Y. Backward questions seem very hard but, as far as I can tell, methods of intervention are a great way to deal with forward questions. Also known as the causal calculus, we will describe the method of interventions below. We will see that it goes beyond correlation, and gracefully handles confounders. Furthermore, we might hope that a deeper understanding of forward causality will give us an insight into the confusing world of backward causality.

The Addictive Gene

Let’s look at the simple example given in Figure 1. We suppose the existence of an “addictive gene”, gene X. People with gene X are more susceptible to addictive substances, such as smoking and drinking, both of which have an impact on that person’s mortality. Our aim is to interpret and answer the following question: does smoking cause death?


Figure 1: Causal diagram for smoking (observational distribution)

Without getting into too much detail, the causal diagram restricts the type of interactions we can have between these variables. In particular gene X can influence death only via the means of making someone more likely to smoke or drink. Correspondingly, if we know whether someone drinks and/or smokes, then whether they have gene X or not is no longer relevant in determining their mortality.

There are many distributions with this causal diagram, and we shall choose the one with the following parameters.

\mathbb{P}\left(\text{gene X}\right) =50\%
\mathbb{P}\left(\text{smoke}\mid\text{gene X}\right), \mathbb{P}\left(\text{drink}\mid\text{gene X}\right) =75\%
\mathbb{P}\left(\text{smoke}\mid{\neg\text{{gene X}}}\right), \mathbb{P}\left(\text{drink}\mid\neg\text{{gene X}}\right) =25\%
\mathbb{P}\left(\text{death}\mid\text{smoke}\right), \mathbb{P}\left(\text{death}\mid\text{drink}\right) =50\%
\mathbb{P}\left(\text{death}\mid \text{{smoke}} \,\&\, \text{{drink}} \right) =75\%
\mathbb{P}\left(\text{death}\mid \neg\text{{smoke}} \,\&\, \neg\text{{drink}} \right) =20\%

An intuitive picture of the resulting distribution is given in Figure 2. I call these “probability bars”; they show that the population is divided into sixteen groups, one for each combination of these four binary variables, and the width of each group represents their probability within the distribution.


Figure 2: Probability bars for smoking (observational distribution)


Here is an bad definition of causality:

Bad definition

X is a cause of Y if P(Y|X) > P(Y). That is, observing X increases the likelihood of observing Y relative to the base-rate.

This is a tempting definition because most of the causal relationships we like to imagine do indeed satisfy this relationship. For that reason, I would even argue that it serves as a good proxy for causality. In the above setup, for example, our intuition tells us that smoking is a cause of death, and this definition agrees with that. We can calculate the required probabilities exactly to be:

\mathbb{P}(\text{death}) \approx 55.47\%
\mathbb{P}(\text{death} \mid \text{smoke}) = 66.25\%

And we can sanity check these numbers by eyeballing the probability bars and e.g. seeing that around a third of the orange bars are also red.

Thus according to the definition above, smoking is indeed a cause of death. So why is this a bad definition? It might already be clear to you, especially if you are familiar with the phrase correlation does not imply correlation, but what we have here is an artefact of correlation. With a sillier example we can see that it is clearly insufficient.

Rain and raincoats


Figure 3: Graphical model for rain and raincoats

Suppose when it rains sometimes we see raincoats and independently sometimes we see umbrellas. We give the graphical model in Figure 3, the probabilities below, and the probability bars in 4.

\mathbb{P}(\text{rain}) =50\%
\mathbb{P}(\text{raincoats} \mid \text{rain}) =90\%
\mathbb{P}(\text{umbrellas} \mid \text{rain}) =90\%


Figure 4: Probability bars for rain and raincoats

Therefore we have that \mathbb{P}(\text{umbrellas} | \text{raincoats}) = 90\%, an increase on the base rate for umbrellas \mathbb{P}(\text{umbrellas}) which stands at 45%. Therefore according to our definition, raincoats cause umbrellas. Suspicious.


It’s clear in the raincoats example that we have a confounding variable, i.e. both raincoats and umbrellas have the shared cause of rain. What a nuisance. In fact, in some contexts such a variable is also known as a nuisance variable. How can we account for this? How can we decide whether raincoats cause umbrellas?

Solution: perform an experiment. First we force raincoats into existence, then we measure the impact of this intervention on our output variable umbrellas. If the intervention gives an increase in the rate of umbrellas then we shall decree that raincoats are a cause of rain.

This type of experiment is called an RCT and is considered to be the gold standard for measuring causal inference. To carry this out correctly over the course of say a month we would randomly choose some subset of days in which to do nothing, and on the remaining days we would perform our intervention.

We can see that this experiment would reveal that raincoats actually have no impact on umbrellas. But how can we formalise this conclusion?

Formalisation of interventions

The subtlety here is that while what we see in the natural world comes from one distribution, questions of forward causality pertain to a different distribution, what’s called the experimental distribution. This is the distribution we get from performing an experiment, such as the one described above with the umbrellas.

We must introduce a retronym for what we have left behind: the observational distribution. This was our first distribution, with probability bars in Figure 2, and it is to be thought of as the natural distribution governing the business-as-usual relationships of these variables. If we were to sample from the real world, then the data would behave as if it were drawn from this distribution.

Returning to the experimental distribution, first we must decide on an experimental variable, and an experimental value. Then we proceed as if we have intervened on our experimental variable, setting it to our experimental value. I.e., we take our causal diagram for the observational distribution and remove all of the arrows going into our experimental variable. Then for the children of this variable, we proceed as if we had observed a value equal to our experimental value.

For the gene X example, our experimental variable is the smoking variable, and our experimental value is “true”. What we get is the result of hypothetically forcing everyone to smoke, regardless of gene X, regardless of whether they drink, regardless of anything! The causal diagram for the resulting experimental distribution is given in Figure 5. Again we can visualise this distribution by looking at the corresponding probability bars, these are given in 6.


Figure 5: Causal diagram for smoking (experimental distribution)


Figure 6: Probability bars for the everyone experimental distribution.

There are in fact two experimental distributions in this case: the first is when we intervene telling everyone that they must smoke, the second is when we tell everyone that they must not smoke. We shall call the case when everyone smokes the “everyone experimental distribution” — probability bars in Figure 6 — and the case when no-one smokes we will call the “no-one experimental distribution” — probability bars in Figure 7.


Figure 7: Probability bars for the no-one experimental distribution.

A better definition of Causality

These new distributions give rise to new probability measures. We could write these as \mathbb{P}_\text{everyone} and \mathbb{P}_\text{no-one}. Then we can calculate for example \mathbb{P}_\text{everyone}(\text{death}) = 62.5\%. But we will opt for some slightly more suggestive notation, making use of the do-operator1. The do-operator way to write this expression is as follows: \mathbb{P}(\text{death} \mid \text{\textit{do}(smoke)}) = 62.5\%. The idea being that we are considering the distribution in which we make everyone smoke, or – perhaps more confusingly put – we make everyone do smoking.

So what does this mean for our question: does smoking cause death? We now have the tools to give the intervention definition of causality.

Definition: Causality

We say that X is a cause of Y if \mathbb{P}(Y \mid \text{\textit{do}}(X)) > \mathbb{P}(Y).

Applied to our example of smoking and death, we see that \mathbb{P}(\text{death} \mid \text{\textit{do}}(\text{smoke})) = 62.5\% > 55.47\% \approx \mathbb{P}(\text{death}). So we still conclude that smoking is a cause of death. What about the example with the raincoats and the umbrellas? Let’s take a look at the probability bars for the experimental distribution in which we force people to wear raincoats, Figure 8.


Figure 8: Probability bars for the everyone wears raincoats experimental distribution.

We can calculate that \mathbb{P}(\text{umbrellas} \mid \text{\textit{do}}(\text{raincoats})) = 45\% which is identical to the base-rate that we have in the observational. I.e. raincoats do not cause umbrellas.


The consideration of experimental distributions gives us a working definition of causality. We have shown that this is a clear improvement over correlation, but the many other merits of this definition remain to be shown.

Next time I would like to discuss the case where we have unknown parameters. In the examples so far given we have assumed complete knowledge of the underlying distribution. But in reality we don’t know \mathbb{P}(\text{death} \mid \text{smoke}). These parameters are hidden from view, and they must be estimated from the data. But data is usually drawn from an observational distribution, so how do we estimate the experimental distributions? How do we estimate \mathbb{P}(\text{death} \mid \text{\textit{do}(smoking)})? One method is to perform a randomised control trial, where we literally go out there and tell some people that they must smoke, and others that they must not. But surprisingly this is not always necessary, and in some cases it is possible to estimate the experimental distributions from observational data alone.



The ideas behind the do-operator were developed in the mid 90s by Judea Pearl. His textbook on causality covers this material in great detail.

The Likelihood Principle

The correctness or otherwise of different statistical methods can be a difficult and contentious topic. To this end, I want to talk about an interesting trio of principles, and a result which I found surprising when I first heard it. And still I am trying to develop an intuition for what it means!

What is Evidence?

Without wanting to sound too dramatic, I would argue that everything is evidence. But good evidence on the other hand is that which is useful to a scientist; which can be used as ammunition in an argument for or against a certain theory. And often, in search of good evidence, a scientist sets up an experiment.

In this article we will use the following formalisation. An instance of statistical evidence is a pair describing the experiment and the observed outcome:

(E, x)

The description E may contain all sorts of information as to how the experiment is conducted, and the likelihoods of all possible outcomes according to different hypotheses. In our case, the hypotheses relate to the possible values of some unknown parameter \theta, which we assert as belonging to some parameter-space \Theta. Moreover we will be fully describing the experiment with a likelihood function f(x, \theta).

What is the problem that we are trying to solve?

Ideally, if two statisticians perform the same experiment and get the same outcome, then they should draw the same conclusions. But in what other cases should we expect their conclusions to be the same? For example, what if the outcomes were different only in that one was in imperial units, and the other in metric units. These experiments are different, so what exactly is it that they have in common?

We shall call the abstract essential properties of a piece of evidence the evidential meaning, written \text{Ev}(E, x). In the example just given — where we change the units, or perform some other invertible conversion — the two scientists have different experiments and data, but the same evidential meaning.

With this in mind, we are tying to solve two tasks:

  1. In the experimental regime described above, can we mathematically characterise the statistical evidence? To put this in a silly way, can we replace all of the experimental articles in all of the scientific journals with a database of mathematical objects? And as a corollary, when is it correct to say that two bits of evidence are the same; when is \text{Ev}(E, x) = \text{Ev}(E', y)?
  2. Once characterised, how should these objects be interpreted qualitatively?

This article will address the first question. What follows are three principles on evidential meaning which concern exactly the question of when two different experiments should lead to the same conclusions. The first two seem fairly intuitive, and the third is a bit more mysterious.

The Sufficiency Principle (S)

If one has the value of a sufficient statistic, further details are superfluous.

Given a model parameterised by \theta, and some data x from that model, a sufficient statistic t(x) is function of the data containing everything there is to know about \theta. Often the sample-mean might be sufficient, which indeed is the case when drawing from a normal distribution with unknown mean parameter. To be precise a statistic t(X) is sufficient when conditional on t(X) the distribution of the data X does not depend on \theta.

The sufficiency principle asserts that no evidential meaning is lost when we alter our experiment to return not the original outcome x, but instead any corresponding sufficient statistic t(x). As a corollary, if we perform an experiment and get two outcomes which have the same sufficient statistic, then our conclusions should be the same. To put this in notation, suppose that E' is the experiment returning the sufficient statistic t, then:

\text{Ev}(E, x) = \text{Ev}(E', t)

As an example, suppose we wish to estimate the bias of a potentially unfair coin. Our experiment is to flip the coin n times and record the results in their exact order. The sufficiency principle states that it would have been just as well to record only the number of heads; the order in which they came up is superfluous.

The Conditionality Principle (C)

Experiments which were considered, but not performed, are irrelevant.

Suppose we have a super-experiment which performs one experiment from a possible set of experiments \{E_h : h \in H\}. We select h according to some distribution — a stand alone distribution independent of our parameter — and then perform the corresponding experiment. The statistical evidence in this case is (E_\text{super}, (E_h, x_x)). The conditionality principle asserts that we should come away with the same conclusions as any scientist whose experimental plan was always to perform only E_h, and whose outcome x, was the same as x_h. In other words:

\text{Ev}(E_\text{super}, (E_h, x_h)) = \text{Ev}(E_h, x_h)

To add some intuition, suppose a scientist could have used one of two instruments with which to perform an experiment, and the decision will be made by the availability of funding. Sadly the funding is tight, and the scientist has to use the cheap and inferior instrument. The fact that the scientist could have been doing the experiment with the other instrument is surely no longer relevant.

The Likelihood Principle (L)

Experiments with the same likelihood function give the same conclusions.

Consider two scientists, each of whom generates some statistical evidence — (E, x), (E', y) — each parameterised by the shared parameter \theta. Their experiments, and their results are such that as functions of \theta, the two likelihood functions are the same (up to a constant factor): f(x, \theta) \sim g(y, \theta). The likelihood principle asserts that their conclusions are the same:

\text{Ev}(E, x) = \text{Ev}(E', y)

What is surprising about this principle, is not that the different data can give the same conclusion — which is already the case for the sufficiency principle — but that the entire experimental design can also be different, so long as the observed data give the same likelihood function. At first glance, this principle is more objectionable than the other two, which seem to be very intuitive.

Example one

Consider a Poisson model for the number of customers visiting a cafe in any single hour: \text{Po}(\lambda). Scientist-A sits in the cafe for an hour and counts the number of customers: only one. Scientist-B comes early the next day, and times the gap between the first and second customer: it takes and hour. Since these two experimental outcomes correspond to the same likelihood function, the likelihood principle asserts that they have the same evidential meaning.

Example two

Consider again the case of a potentially unfair coin. Scientist-A flips the coin until they get a tail: HHHHT. Scientist-B simply flips five coins and gets four heads and a tail in some unimportant order: HTHHH. Same likelihood, same evidential meaning.

C + S = L

Although the likelihood principle is less intuitive, it happens to be equivalent to sufficiency plus conditionality (proof is in the appendix). So for the scientist who believes in S and C, we have a solution to our first problem of inference: the substance of any piece of statistical evidence is exactly the resulting likelihood function.

What does this mean? What does this mean for p-values? The observant reader may have noticed that p-values rely not on what was observed, but also on what could have been observed. Take another look at example two. The experiments have the same likelihood function, but do they have the same p-value? No! In fact under a hypothesis test with a significance of 5%, in one case we would reject the coin being fair, and in the other we would not.


Figure 1: “Frequentists vs. Bayesians” strip from xkcd

You may have seen the popular xkcd strip about p-values and the death of the sun (see Figure 1). What is the interpretation under the lens of the Likelihood Principle? For one, it doesn’t matter that the probability of the detector having said yes was 0.027. What matters is how this probability scales as we vary our hypothesis: in this case the sun having exploded or not. In particular, our method should be scale invariant. Whereas for p-values, the fact that 0.027 is 0.027 and not 27, or 0.5, or anything else, is the only thing that matters. It what we look at to conclude our hypothesis test: “is this number less than 5%?”

Secondly, it doesn’t matter that the detector could have said no. For all we care, the counterfactual could have been the detector running another test, or considering an output of “maybe”. Once we have our observation, all of the counterfactuals go out of the window.


The three principles — of sufficiency, conditionality, and likelihood — help us determine the essential properties of a scientific experiment. And surprisingly the likelihood principle is only as debatable as the other two combined.

The material of this post comes from the paper On The Foundations Of Statistical Inference [†]. The content is theoretical, and it is natural to wonder what are the practical applications? Should we stop using p-values? I expect not, p-values are ubiquitous and extremely useful! However, in my opinion as a likelihoodist, they are incorrect. But then again, so is Newtonian mechanics.


The Likelihood Lemma

We will use the following result: if two outcomes x, x' of the same experiment E admit the same likelihood function, then there exists a sufficient statistic t such that t(x) = t(x'). Then, assuming the principle of sufficiency, we have the corollary:

\text{Ev}(E, x) = \text{Ev}(E, x')

Proof of C + S = L

The proof is short, and goes like this: It is clear that L \implies C + S. Suppose then that (E, x) and (E', y) admit proportional likelihood functions for \theta as described. Consider then the following experiment where we flip a coin, and if heads we perform E, if tails E':

E^* = \frac{1}{2} E + \frac{1}{2} E'

Then by the conditionality principles we have \text{Ev}(E^*, x) = \text{Ev}(E, x), and likewise for (E', y). Then since these two outcomes are now from the same experiment and they admit the same likelihood — although now each has changed by a constant factor of \frac{1}{2} — we can use the likelihood lemma to deduce the existence of a sufficient statistic t, for which t(x) = t(y). Then we apply the sufficiency principle to get the following chain:

\text{Ev}(E^*, x) = \text{Ev}(E^*|_t, t) = \text{Ev}(E^*, y)

And therefore \text{Ev}(E, x) = \text{Ev}(E', y).


Optimal Setlist

An algorithm I ran for my friend’s wedding to compute the optimal setlist, i.e. the setlist with the minimal number of musician changeovers. I used the Held-Karp algorithm. Link to the code is here.


The algorithm took around 7 seconds. And this time roughly doubles for every song we add (complexity is O(2^n n^2)). It seems I was very lucky with the number of songs chosen. Had we added like five more songs, this would have been intractable! Of course, when the cost is zero, the players can reshuffle their songs to their hearts’ desire, all within their miniset.

Songs Vocals Keys Bass Guitar Drums Cost
First dance JI BS       0
Let’s Stay Together (F)(*) JI BS DJ GW TL 3
Forget You (C) JI BS DJ GW TL 0
Crazy in Love (Dm)* JI JO DJ GW TL 1
Girls Just Wanna Have Fun (E)* JI DA DJ GW TL 1
Backstreets Back (Bbm) JI DA DJ GW TL 0
Master Blaster (Cm)* HN DA DJ GW TL 1
Lady Marmalade (Gm) JI JW DJ GW TL 2
Isn’t She Lovely (E) JI JW DJ GW TL 0
Easy Lover (Fm) JI JW DJ GW AI 1
American Boy (E) JI JO GF DJ JW 2
SeƱorita (Gb)* JI DA GF GW JW 2
Somebody Else’s Guy (Abm)* HN JW GF GW AI 2
Runaway Baby (Eb) HN JO DH GW JW 2
Superstition (Em)* JI BS DH JS JW 3
Valerie (Eb)* JI BS DH JS AI 1
Respect (C)* JI BS DH JS AI 0
Play That Funky Music (Em)* JI BS DH JS AI 0
I Got You (I Feel Good)* JI BS DH JS AI 0
Ain’t Nobody (Ebm) JI BS DH JS AI 0
Treasure (Ab) JI JO DH JS AI 1
Locked Out Of Heaven (Dm) JI JO DH JS AI 0
Uptown Funk (D)* JI DA DH JS AI 1
Good Times (Em) JI DA DH JS TL 1
End           0
Total           24

Concluding thoughts

  • The poor trumpet player was not considered in the algorithm. They play on the song tracks containing “*”, and this information was simply ignored.
  • I added no cost for a musician to stay on stage, but change instruments
  • I’m reasonably happy with the way the key changes have organised themselves. Except, of course, the turn of Eb, Em, Eb, which regrettably occurs around the maximal changeover
  • This probably isn’t a complete coincidence

Best Learning Rates

Learning rates are inelegant. As far as I know, the best advice for picking a good one is “start with 0.01, see how it performs, and tweak accordingly.” What if 0.01 is a terrible learning rate; what if it is off by orders of magnitude? In practice we mitigate these concerns by normalising input, and sensibly initialising weights, the intuition being that this keeps the gradient not too many orders of magnitude away from 1.

But still, isn’t there a best learning rate? The learning rate which, when used, will minimise the objective most effectively. Isn’t there a best learning rate for each step? Shouldn’t we use that learning rate? How much better could it be to use that learning rate? For one it would mean one fewer hyperparameter.

A simple algorithm

Here is my proposal for a learning rate schedule: For each step of gradient descent, if increasing the current learning rate by 1% would be better, then do so. Otherwise decrease the learning rate by 1%. The desired behaviour is that regardless of the initial value, the learning rate will random walk its way to the best value.

Sadly this doesn’t eliminate a hyperparameter as was hoped: we must choose an initial learning rate, \alpha_0. However as we will see, the algorithm is more robust to this hyperparameter than to the choice of learning rate in vanilla gradient descent.

Example 1

I implemented this optimiser for a single hidden layer MLP on MNIST. One small additional feature I made was to smooth the gradients, so as to avoid crazy updates at gradient spikes. Code can be found here.

Table 1: Hyperparameters
n_epochs 10
hidden_layer 20
batch_size 1000
n_run 10
Table 2: Vanilla gradient descent, with gradient smoothing
lr accuracy_mean accuracy_std loss_mean loss_std
0.000010 21.7% 4.6% 2.42 0.09
0.000022 23.3% 7.3% 2.16 0.11
0.000046 38.8% 10.2% 1.79 0.18
0.000100 51.2% 6.8% 1.42 0.14
0.000215 63.5% 6.9% 1.08 0.10
0.000464 58.0% 12.4% 1.15 0.26
0.001000 41.4% 14.8% 1.55 0.35
0.002154 19.2% 8.0% 2.07 0.18
0.004642 18.9% 12.9% 2.12 0.30
0.010000 11.2% 0.0% 2.30 0.00
Table 3: Proposed new algorithm, with gradient smoothing
lr0 accuracy_mean accuracy_std loss_mean loss_std
0.000010 73.2% 6.1% 0.85 0.15
0.000022 81.5% 5.2% 0.64 0.11
0.000046 80.0% 5.0% 0.67 0.12
0.000100 87.4% 2.3% 0.47 0.08
0.000215 87.5% 2.2% 0.45 0.06
0.000464 82.0% 9.3% 0.58 0.21
0.001000 47.8% 17.7% 1.38 0.38
0.002154 30.5% 19.5% 1.85 0.44
0.004642 12.3% 3.3% 2.27 0.10
0.010000 11.2% 0.0% 2.31 0.03


Figure 1: Log loss for \alpha_0 = 2.15e^{-3}


Figure 2: Log leaning rate \alpha_0 = 2.15e^{-3}

We can see that in the best performing case the learning rate doesn’t vary more than a single order of magnitude, and for this reason I find the performance increase surprisingly high.

After increasing to a peak, the learning rate tends to zero. As parameters tend to a minima, the best learning rate tends to zero. Does this schedule share the property that the learning rate tends to zero as the parameters tend to a minimum?

Example 2

Suppose computing the loss is as computationally expensive as computing the gradient, then this algorithm doubles the computation. We can easily reduce this overhead by reducing the frequency at which the learning rate is being updated. I re-ran [Example 1], but updating the learning rate every ten steps. We still improve on vanilla SGD, but the outperformance is not so stark. However, if we increase the learning rate learning rate, lrlr, this case stepping the learning rate by 10% instead of 1%, then we recover most of the outperformance we had before.

Table 4: Updating the learning rate every ten steps, lrlr = 101%
lr0 accuracy_mean accuracy_std loss_mean loss_std
0.000010 22.1% 7.4% 2.31 0.09
0.000022 32.5% 6.6% 1.94 0.15
0.000046 42.2% 6.4% 1.70 0.18
0.000100 50.9% 8.5% 1.39 0.18
0.000215 68.4% 7.8% 0.98 0.18
0.000464 66.5% 14.6% 0.94 0.28
0.001000 50.4% 20.3% 1.35 0.48
0.002154 30.7% 14.9% 1.81 0.37
0.004642 12.7% 3.2% 2.26 0.09
0.010000 14.6% 5.0% 2.22 0.12
Table 5: Updating the learning rate every ten steps, lrlr = 110%
lr0_mean accuracy_mean accuracy_std loss_mean loss_std
0.000010 69.5% 8.1% 0.92 0.18
0.000022 79.0% 7.3% 0.70 0.16
0.000046 82.8% 5.6% 0.60 0.11
0.000100 84.5% 3.3% 0.53 0.07
0.000215 84.6% 7.8% 0.53 0.19
0.000464 77.9% 7.5% 0.69 0.20
0.001000 51.5% 17.6% 1.31 0.43
0.002154 25.1% 12.4% 1.95 0.30
0.004642 14.1% 4.6% 2.22 0.13
0.010000 14.7% 8.6% 2.21 0.23


We could see if the performance increase extends to other test beds. However, given how easy this is to implement, I’m already excited to use this going forward.

Optimizers involving some online tuning of hyperparameters can be framed as an RL problem, with the reward at each step being negative of the loss function. We have presented a simple solution to this specific problem, but we could try other algorithms from the literature. And then it would be natural to extend use to multiple hyperparameters; momentum, for example, would be a clear next step.