This post accompanies my The Conversation article on numbers R and K as used in epidemiology. When preparing this article, I found that there is no good, understandable account of what K is, containing enough mathematical detail. There are many popular articles which do not explain maths behind the number, and there are scientific papers (like an excellent 2005 Nature paper) which might not be as approachable.

The number K (more traditionally called ‘k’) is a parameter in the distribution of secondary cases caused by an infected individual. It is a bit unfortunate that under the adopted convention, the smaller the value of K, the higher the dispersion. It is not easy to estimate, as it requires detailed data on the size of an outbreak and the number of primary cases which caused this outbreak.

To start with, we will assume that the outbreak is caused by a single infected individual in a completely susceptible population. This primary case results in a certain number, , of secondary cases. Each of these new cases produces again new cases, and so the outbreak might continue.

While such a procedure of “imagining” outcomes of repeated situations is common in statistics, it is actually not far from reality, particularly at the early stages of the epidemic, called a “stuttering phase”. An infected person might come into the UK from China, and then another person from Italy, and another, and another… If for each such person we manage to track and trace the contacts, we can count how many secondary cases they caused to occur.

In the simplest case, , is the same for each person. As the reproductive number, R, is the number of secondary cases, so in this case for every person. For example, for , each infected primary case will result in 3 new secondary cases. If there is not much overlap between the cases and there are many susceptible individuals, in the next round there will be 9, then 27, etc. secondary cases, as in the diagram below.

In reality, the value of will be different for each person and the differences might be attributed to chance – a person by chance might contact somebody, or not, or – by chance – shed more viruses on a particular day. In probability theory, we talk about being a random variable, and the graph below shows an example.

Random variables are dealt with in probability theory and statistics. Fundamental to their description is the concept of a probability distribution, i.e. the information about the probability of a given outcome, . In epidemiology, we most often describe in terms of a Poisson distribution so that

with the mean (using the notation of Lloyd-Smith’s et al 2005 paper). Poisson distribution is used to describe a number of events occurring within a fixed period of time if these events occur with a known constant mean rate and independently of the time since the last event. This agrees with the picture of a person who is infected for a certain period of time; events are then associated with passing on the virus. represents the mean infection potential of a single infected individual, the ‘individual reproductive number’.

If everybody is the same and has the same value of , where is the reproductive number, then the distribution of new cases, will simply be Poisson, as shown in the graph below.

In such a case, if follows the Poisson distribution, most values of are not too far from the mean which is given by R, as seen in Figure 3. Of course, for small sample size (like in Figure 2) we could expect some differences, but generally, we do not expect large variability. In fact, the variance of the Poisson distribution is .

However, when studying real outbreaks, people started noticing that this is often not the case. Instead, many primary cases lead to very few if any secondary cases, so that for most people . In contrast, some individuals manage to spread the disease to many secondary cases, so that the corresponding values of are large. How could we still keep the picture underlying the Poisson distribution of individual events popping up over time, but account for much larger variability?

Lloyd-Smith and others proposed that – which is the ‘individual’ reproduction number, i.e. the number of secondary cases *this* primary case produces – is itself a random variable, , i.e. it varies from a person to a person (in probability theory we use capital letters for random variables and small letters for ordinary numbers). They proposed that is distributed according to a gamma distribution with mean R and dispersion parameter K,

is a Gamma function. The plot below shows some examples of the gamma distribution with .

For large values of , the distribution is concentrated around the mean value of , so it represents the case when individuals themselves have very similar potential for infection. For small values of , the distribution is heavily skewed – most individuals do not have the potential to pass on the infection at all (), whereas some individuals have a huge potential to infect – note that this is all about the *potential* to infect, not the actual number of secondary cases.

To get the secondary cases we now need to combine the gamma distribution and the Poisson distribution, the latter actually corresponding to the process of actually creating secondary cases. Fortunately, the probability theory has a helping hand here – the combination of the gamma and the Poisson distribution has a closed formula and is known as the negative binomial distribution,

with

mean and variance

The following plot gives some examples for different values of K (as in Figure 4).

A couple of observations. Firstly, the variance-to-mean ratio for the negative binomial distribution is

so that it is 1 for very large (as for in the figures). This is the case for the Poisson distribution shown above. The low overdispersion (the variance-to-mean ratio equal or close to 1) means that in most outbreaks the number of secondary cases resulting from one primary case is close to R.

Secondly, as becomes small, the overdispersion grows. For the variance is more than 30 times larger than the mean and so there is a huge variation between outbreak sizes. Most people do not pass on the infection and so most primary cases cause no subsequent chains. But there are some who will cause many secondary cases and so some outbreaks will be massive. This is better visible if we plot the negative binomial distribution on the log-linear scale, as in Figure 6.

The black points (joined by a line to ‘guide the eye’) represent the Poisson case. The most likely values are 3 and 4, with ; the probability drops rapidly as increases. In contrast, for both red points (, a geometric distribution) and blue points (), 0 is the most likely outcome (i.e. no secondary cases), but the probability does not decline as fast as for high values of . The decline for large values of is slower for low values of (blue points).

The case with low K can be illustrated as below:

I actually made up the numbers in Figure 2 and Figure 7, aiming to show ‘nice’ examples of superspreaders; although the mean number of cases is lower than 3 in Figure 7, this is to be expected – overdispersed distributions have a large variation in numbers for small sample sizes. For some real-life examples, you can see here and here.

How can we measure K? It is actually quite difficult to estimate, as it is necessary to attribute the secondary cases to the actual number of the primary ones (a large outbreak can be an effect of one superspreader, or many ordinary infected people).

Moreover, the negative binomial distribution simply tells us about the distribution of one-step-ahead transmission, i.e. the transmission from the first case to the second generation – but of course, they could transmit the disease further, introducing their own variability of new cases. This can be seen in Figure 7, where the ‘red’ index case produces 16 ‘yellow’ cases, of which 8 fail to produce any next-generation case, 4 produced 1, 3 produced 2 and 1 produced 3 new cases. But one of the ‘third generation cases’ (green) caused another large outbreak, with 11 ‘fourth-generation cases’ (blue). Thus, the negative binomial distribution is not a good way to predict how many infected individuals are in a cluster after many generations.

However, there is a bit of magic, again provided by the probability theory, which says that the probability an index case generates a cluster of size is given by

While a mouthful, this formula can be used to estimate the value of given , from the available cluster sizes. Alternatively, simulations are used to compare the data with models with different values of to find the best-fitting one.

However, it should be stressed that while the estimation of R is not trivial but has been done in different ways and quite reliably, the estimation of K for COVID-19 is difficult and different groups come up with different results. All of them, still, point to a highly over-dispersed character of the transmission, the point discussed in more detail in The Conversation.

**Post-scriptum:** The Atlantic published an excellent overview of the number K (k) in October 2020.