Estimating the fraction of recurring events

Published on

Sometimes you have a collection of events and the number of times each of them occurred. Each event is either unique, i.e. occurring only once, or recurring. You don’t know which events are unique or recurring, but you want to estimate the fraction of recurring events.

I’ll give two examples, one from everyday life and one from bioinformatics.

Everyday life example. Let’s say you’re a small business (a barbershop, a community theater, an electrical contractor etc.) People have to call you in order to reserve seats or make appointments, so you know the names of all your past customers and how many times they ordered. Naturally, some clients enjoyed the experience and will become your recurring customers, while others didn’t like it and will never return. You’d like to know how many recurring customers you have.

Why not do the obvious thing and calculate the fraction of people who ordered twice or more? Because some fraction of the customers who have not reordered are also recurring customers—they just haven’t had a chance or need to reorder yet. Of course, based on our data, we cannot possibly tell which of those customers will reorder; but we can nevertheless estimate their fraction.

Bioinformatics example. (Feel free to skip this paragraph if you’re not into bioinformatics!) You are looking at ChIP-seq peaks in \(n\) replicates. For each peak, you can calculate how many replicates the peak was found in. You want to estimate the number of replicable peaks. Again, while you could just count the peaks that were found in 2 or more replicates, some of the replicable peaks are expected to have only occurred once, and it would be a shame to miss them.

The key to this problem is the following observation. The set of unique events (customers who ordered once or peaks present in only one replicate) is “contaminated”: it consists of a mixture of recurring and non-recurring events. But the set of events that occured two or more times must only consist of recurring events. We can use these definitely recurring events to estimate the rate with which events recur, and from there find the expected number of recurring events that only occurred once.

To be more specific, let’s denote the total number of recurring events as \(m\), and the number of recurring events that occurred \(k\) times as \(m_k\) \((k=1,2,\ldots)\). Then we don’t know \(m_1\) (because we don’t know how many truly unique events there are), but we do know \(m_2, m_3, \ldots\) What we would like to estimate is \(m = m_1 + m_2 + \ldots\) (Notice that I didn’t include \(m_0\) here; see below on why.)

The natural models for \(m_k\) are the binomial and Poisson distributions. In the ChIP-seq example, where there is a clear upper bound on \(m_k\)—the number of replicates, \(n\)—the binomial model is more appropriate. The binomial model could also be applied in the case of a community theater that has only given a small number of performances to date. In other cases, when the upper bound is absent or large, the Poisson model is a better choice. There are other models one could apply here (including a negative binomial model or even nonparametric models), but I won’t cover them here; see Bunge et al. (2014) for inspiration.

Regardless of the model, we expect \(m_k\) for high \(k\) to be less informative because their chances of being non-zero become very low. So it is natural to use \(m_2\) and \(m_3\) for estimation. (Because the negative binomial distribution has an extra parameter, one would also need an extra data point, \(m_4\). This is why the negative binomial model should only be used when the total number of observations, \(\sum k\cdot m_k\), is large.)

The binomial model

In the binomial model with parameters \(n\) and \(\theta\), the probabilities of \(k=2\) and \(k=3\) repetitions for a particular event are:

\[\begin{align*} p\{k=2\} = {n\choose 2} \theta^2(1-\theta)^{n-2}; \\ p\{k=3\} = {n\choose 3} \theta^3(1-\theta)^{n-3}. \end{align*}\]

Dividing the second equation by the first gives

\[ \frac{p\{k=3\}}{p\{k=2\}}=\frac{n-2}3\frac{\theta}{1-\theta}, \]

from which we get

\[ \theta = \frac{1}{1+\frac{n-2}3\frac{p\{k=2\}}{p\{k=3\}}}. \]

Next, we observe that \(p\{k=2\}/p\{k=3\}\) can be estimated as \(m_2/m_3\), so

\[ \theta \approx \frac{1}{1+\frac{n-2}3\frac{m_2}{m_3}}. \]

Then our quantity of interest is

\[ \begin{split} m & = m_1+m_2+\ldots+m_n \\ & = \frac{m_2}{p\{k=2\}}\cdot p\{k\geq 1\}\\ & = \frac{2m_2(1-\theta^n)}{n(n-1)\theta^2(1-\theta)^{n-2}}. \end{split} \]

Let’s test this using a simulation.

set.seed(2018)
# number of unique events
u <- 40000
# number of recurring events (to be re-calculated later)
m <- 30000
# n, the maximum occurrences of a single event
n <- 15
# rate of occurrence of an event
theta <- 0.23
# occurrence data
occs <- rbinom(m, n, theta)
# remove zero occurrences, as they are unobservable
occs <- occs[occs != 0]
(m <- length(occs))
## [1] 29406
# mix in the non-recurring events
occs <- sample(c(occs, rep(1, u)))
# m_k, number of events that occurred k times
(m_k <- table(occs))
## occs
##     1     2     3     4     5     6     7     8     9    10    11 
## 42689  5581  7310  6367  4199  2139   800   235    66    17     3
# now, apply our estimation algorithm:
(theta_est <- 1/(1+(n-2)/3 * m_k[["2"]]/m_k[["3"]]))
## [1] 0.2321052
(m_est <- (2*m_k[["2"]]*(1-theta_est**n)) /
        (n*(n-1)*theta_est**2*(1-theta_est)**(n-2)))
## [1] 30565.44

So we estimated the occurrence rate \(\theta\) as 0.232 (true value 0.23) and the number of observed recurring events as 30565 (true value 29406). Pretty good. For smaller sample sizes, the estimates will be not as accurate but hopefully still useful.

The Poisson model

The derivation is analogous for the Poisson model.

\[\begin{align*} p\{k=2\} = \frac{\lambda^2e^{-\lambda}}2; \\ p\{k=3\} = \frac{\lambda^3e^{-\lambda}}6; \end{align*}\] \[ \frac{p\{k=3\}}{p\{k=2\}}=\frac{\lambda}3; \] \[ \begin{split} \lambda & = 3\cdot \frac{p\{k=3\}}{p\{k=2\}} \\ & \approx 3\cdot\frac{m_3}{m_2}; \end{split} \] \[ \begin{split} m & = m_1+m_2+\ldots+m_n \\ & = \frac{m_2}{p\{k=2\}}\cdot p\{k\geq 1\}\\ & = \frac{2m_2(e^{\lambda}-1)}{\lambda^2}. \end{split} \]

Simulation:

set.seed(2018)
# number of unique events
u <- 40000
# number of recurring events (to be re-calculated later)
m <- 30000
# rate of occurrence of an event
lambda <- 1.6
# occurrence data
occs <- rpois(m, lambda)
# remove zero occurrences, as they are unobservable
occs <- occs[occs != 0]
(m <- length(occs))
## [1] 23882
# mix in the non-recurring events
occs <- sample(c(occs, rep(1, u)))
# m_k, number of events that occurred k times
(m_k <- table(occs))
## occs
##     1     2     3     4     5     6     7     8     9 
## 49744  7669  4090  1667   521   145    31    12     3
# now, apply our estimation algorithm:
(lambda_est <- 3 * m_k[["3"]]/m_k[["2"]])
## [1] 1.599948
(m_est <- (2*m_k[["2"]]*(exp(lambda_est)-1)) / lambda**2)
## [1] 23682.68

Again, extremely accurate for large sample sizes: we estimated the occurrence rate as 1.599948 (true value 1.6) and the number of observed recurring events as 23682.68 (true value 23882).

What about \(m_0\)?

In the above calculations, I did not include \(m_0\) (the numer of unobserved but recurring events) in the definition of \(m\). While it is trivial to include \(m_0\), there are good reasons not to.

The first reason is practical: \(m\) as defined can be used to estimate the fraction of recurring events (vs. unique ones); just divide \(m\) by the total number of observed events.

The other reason is theoretical: it is not even clear what \(m_0\) means. I like how Bunge et al. (2014) put it:

Consider the following metaphor or allegory (Böhning & Schön 2005). We go to a freeway interchange and record the numbers of occupants, including the drivers, of the passing cars, for a fixed period of time, e.g., during the daylight hours of a given day. In principle, we could then extract frequency count data: \(f_1\) = the number of cars with one occupant, \(f_2\) = the number with two, and so forth. From this data, we could in principle compute a statistical estimate of \(f_0\) , the number of cars with zero occupants, i.e., cars not presently being driven, and hence the total number of cars in existence, but (apart from many other potential objections to this procedure) cars in existence where? That is, what is the target population? Is it all cars that might pass by our interchange, which could be all cars in North America, or just some local subset thereof? The answer is unclear. The reader may object that no one would carry out such a study, but it is not too different from the ocean-sampling example discussed above. In that setting, what is the microbial population under study? All microbial organisms that may enter the 1-liter collection bottle, or some more restricted subpopulation?

References

A very similar problem arises in microbial ecology; an excellent review on this is:

John Bunge, Amy Willis, Fiona Walsh. (2014) Estimating the Number of Species in Microbial Diversity Studies.