Defining Causality

Causality is a confusing concept. It seems to be something that we understand intuitively, but in neither maths nor science do we have an agreed upon technical definition. Part of the problem, as usual, is that we are using one word to describe more than one thing. Here I will discuss forward and backward causality. And in the case of forward causality, I want to introduce interventions as a good candidate for an agreed upon definition.

Forward and Backward Causality

Example: House Fire

Consider the following collection of events: a short circuit in someone’s house creates a spark which sets on fire some nearby curtains. The local firefighters are nowhere to be found, and the house burns down.

In this example, what was the cause of the fire? Was it the short circuit, the curtain, or the absence of firefighters? How can the absence of something be a cause? If it was the short circuit then naturally we must ask, what was the cause of the short circuit? Was it dodgy manufacturing? Or misuse of the product? Essentially the question is this: who is to blame?

This is a question of backward causality. In general, the question goes “what were the inputs that lead to this output? And how much did each one contribute?” Answering this question is tricky business. In it’s truest interpretation, it feels like the answer is always “the big bang did it”. Indeed some philosophers give up on this type of question all together, and claim that our intuitive notion of causality in this sense is a fiction; our intuitions deceive us, just like how we are deceived by our intuitions for space and time. Free will, ethics, and the justice system all completely depend on this concept of backward causality.

A different question is that of forward causality. This is not the causality of credit assignment, but the causality of decision making. Here are some examples: What is the impact on my life if I take up smoking? If my dishwasher was built with a dodgy circuit, how does that affect the probability of my house burning down? These questions sound a bit more approachable, but still it’s not as easy as you might think. Two big problems are the inadequacy of correlation, and the complications which arise from confounding variables.

In summary, a backward question fixes a value of the outcome Y = y, and asks about the antecedents X. Whereas a forward question specifies or toggles a fixed input X=x, and asks about the consequences for some descendent Y. Backward questions seem very hard but, as far as I can tell, methods of intervention are a great way to deal with forward questions. Also known as the causal calculus, we will describe the method of interventions below. We will see that it goes beyond correlation, and gracefully handles confounders. Furthermore, we might hope that a deeper understanding of forward causality will give us an insight into the confusing world of backward causality.

The Addictive Gene

Let’s look at the simple example given in Figure 1. We suppose the existence of an “addictive gene”, gene X. People with gene X are more susceptible to addictive substances, such as smoking and drinking, both of which have an impact on that person’s mortality. Our aim is to interpret and answer the following question: does smoking cause death?


Figure 1: Causal diagram for smoking (observational distribution)

Without getting into too much detail, the causal diagram restricts the type of interactions we can have between these variables. In particular gene X can influence death only via the means of making someone more likely to smoke or drink. Correspondingly, if we know whether someone drinks and/or smokes, then whether they have gene X or not is no longer relevant in determining their mortality.

There are many distributions with this causal diagram, and we shall choose the one with the following parameters.

\mathbb{P}\left(\text{gene X}\right) =50\%
\mathbb{P}\left(\text{smoke}\mid\text{gene X}\right), \mathbb{P}\left(\text{drink}\mid\text{gene X}\right) =75\%
\mathbb{P}\left(\text{smoke}\mid{\neg\text{{gene X}}}\right), \mathbb{P}\left(\text{drink}\mid\neg\text{{gene X}}\right) =25\%
\mathbb{P}\left(\text{death}\mid\text{smoke}\right), \mathbb{P}\left(\text{death}\mid\text{drink}\right) =50\%
\mathbb{P}\left(\text{death}\mid \text{{smoke}} \,\&\, \text{{drink}} \right) =75\%
\mathbb{P}\left(\text{death}\mid \neg\text{{smoke}} \,\&\, \neg\text{{drink}} \right) =20\%

An intuitive picture of the resulting distribution is given in Figure 2. I call these “probability bars”; they show that the population is divided into sixteen groups, one for each combination of these four binary variables, and the width of each group represents their probability within the distribution.


Figure 2: Probability bars for smoking (observational distribution)


Here is an bad definition of causality:

Bad definition

X is a cause of Y if P(Y|X) > P(Y). That is, observing X increases the likelihood of observing Y relative to the base-rate.

This is a tempting definition because most of the causal relationships we like to imagine do indeed satisfy this relationship. For that reason, I would even argue that it serves as a good proxy for causality. In the above setup, for example, our intuition tells us that smoking is a cause of death, and this definition agrees with that. We can calculate the required probabilities exactly to be:

\mathbb{P}(\text{death}) \approx 55.47\%
\mathbb{P}(\text{death} \mid \text{smoke}) = 66.25\%

And we can sanity check these numbers by eyeballing the probability bars and e.g. seeing that around a third of the orange bars are also red.

Thus according to the definition above, smoking is indeed a cause of death. So why is this a bad definition? It might already be clear to you, especially if you are familiar with the phrase correlation does not imply correlation, but what we have here is an artefact of correlation. With a sillier example we can see that it is clearly insufficient.

Rain and raincoats


Figure 3: Graphical model for rain and raincoats

Suppose when it rains sometimes we see raincoats and independently sometimes we see umbrellas. We give the graphical model in Figure 3, the probabilities below, and the probability bars in 4.

\mathbb{P}(\text{rain}) =50\%
\mathbb{P}(\text{raincoats} \mid \text{rain}) =90\%
\mathbb{P}(\text{umbrellas} \mid \text{rain}) =90\%


Figure 4: Probability bars for rain and raincoats

Therefore we have that \mathbb{P}(\text{umbrellas} | \text{raincoats}) = 90\%, an increase on the base rate for umbrellas \mathbb{P}(\text{umbrellas}) which stands at 45%. Therefore according to our definition, raincoats cause umbrellas. Suspicious.


It’s clear in the raincoats example that we have a confounding variable, i.e. both raincoats and umbrellas have the shared cause of rain. What a nuisance. In fact, in some contexts such a variable is also known as a nuisance variable. How can we account for this? How can we decide whether raincoats cause umbrellas?

Solution: perform an experiment. First we force raincoats into existence, then we measure the impact of this intervention on our output variable umbrellas. If the intervention gives an increase in the rate of umbrellas then we shall decree that raincoats are a cause of rain.

This type of experiment is called an RCT and is considered to be the gold standard for measuring causal inference. To carry this out correctly over the course of say a month we would randomly choose some subset of days in which to do nothing, and on the remaining days we would perform our intervention.

We can see that this experiment would reveal that raincoats actually have no impact on umbrellas. But how can we formalise this conclusion?

Formalisation of interventions

The subtlety here is that while what we see in the natural world comes from one distribution, questions of forward causality pertain to a different distribution, what’s called the experimental distribution. This is the distribution we get from performing an experiment, such as the one described above with the umbrellas.

We must introduce a retronym for what we have left behind: the observational distribution. This was our first distribution, with probability bars in Figure 2, and it is to be thought of as the natural distribution governing the business-as-usual relationships of these variables. If we were to sample from the real world, then the data would behave as if it were drawn from this distribution.

Returning to the experimental distribution, first we must decide on an experimental variable, and an experimental value. Then we proceed as if we have intervened on our experimental variable, setting it to our experimental value. I.e., we take our causal diagram for the observational distribution and remove all of the arrows going into our experimental variable. Then for the children of this variable, we proceed as if we had observed a value equal to our experimental value.

For the gene X example, our experimental variable is the smoking variable, and our experimental value is “true”. What we get is the result of hypothetically forcing everyone to smoke, regardless of gene X, regardless of whether they drink, regardless of anything! The causal diagram for the resulting experimental distribution is given in Figure 5. Again we can visualise this distribution by looking at the corresponding probability bars, these are given in 6.


Figure 5: Causal diagram for smoking (experimental distribution)


Figure 6: Probability bars for the everyone experimental distribution.

There are in fact two experimental distributions in this case: the first is when we intervene telling everyone that they must smoke, the second is when we tell everyone that they must not smoke. We shall call the case when everyone smokes the “everyone experimental distribution” — probability bars in Figure 6 — and the case when no-one smokes we will call the “no-one experimental distribution” — probability bars in Figure 7.


Figure 7: Probability bars for the no-one experimental distribution.

A better definition of Causality

These new distributions give rise to new probability measures. We could write these as \mathbb{P}_\text{everyone} and \mathbb{P}_\text{no-one}. Then we can calculate for example \mathbb{P}_\text{everyone}(\text{death}) = 62.5\%. But we will opt for some slightly more suggestive notation, making use of the do-operator1. The do-operator way to write this expression is as follows: \mathbb{P}(\text{death} \mid \text{\textit{do}(smoke)}) = 62.5\%. The idea being that we are considering the distribution in which we make everyone smoke, or – perhaps more confusingly put – we make everyone do smoking.

So what does this mean for our question: does smoking cause death? We now have the tools to give the intervention definition of causality.

Definition: Causality

We say that X is a cause of Y if \mathbb{P}(Y \mid \text{\textit{do}}(X)) > \mathbb{P}(Y).

Applied to our example of smoking and death, we see that \mathbb{P}(\text{death} \mid \text{\textit{do}}(\text{smoke})) = 62.5\% > 55.47\% \approx \mathbb{P}(\text{death}). So we still conclude that smoking is a cause of death. What about the example with the raincoats and the umbrellas? Let’s take a look at the probability bars for the experimental distribution in which we force people to wear raincoats, Figure 8.


Figure 8: Probability bars for the everyone wears raincoats experimental distribution.

We can calculate that \mathbb{P}(\text{umbrellas} \mid \text{\textit{do}}(\text{raincoats})) = 45\% which is identical to the base-rate that we have in the observational. I.e. raincoats do not cause umbrellas.


The consideration of experimental distributions gives us a working definition of causality. We have shown that this is a clear improvement over correlation, but the many other merits of this definition remain to be shown.

Next time I would like to discuss the case where we have unknown parameters. In the examples so far given we have assumed complete knowledge of the underlying distribution. But in reality we don’t know \mathbb{P}(\text{death} \mid \text{smoke}). These parameters are hidden from view, and they must be estimated from the data. But data is usually drawn from an observational distribution, so how do we estimate the experimental distributions? How do we estimate \mathbb{P}(\text{death} \mid \text{\textit{do}(smoking)})? One method is to perform a randomised control trial, where we literally go out there and tell some people that they must smoke, and others that they must not. But surprisingly this is not always necessary, and in some cases it is possible to estimate the experimental distributions from observational data alone.



The ideas behind the do-operator were developed in the mid 90s by Judea Pearl. His textbook on causality covers this material in great detail.

One thought on “Defining Causality

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s