Probability of a regex occurrence
Published on ; updated on
Given a regular expression \(R\) and a random string \(L\) of length \(n\), what is the probability that \(R\) matches somewhere in \(L\)?
This seems like an intimidating problem at first, but I was surprised how easy it becomes after a couple of insights.
The first insight is to turn a regex into a DFA. This moves us from “how can I possibly know whether a regex matches?” to “feed the string to the automaton and see if it ends up in an accepting state”.
Still, the DFA can be anything at all — how do we know the probability of ending up in an accepting state? The second insight is to ascribe probabilities to the edges (transitions) of the DFA and observe that this turns it into a Markov chain1.
Why is that helpful? Because a Markov chain can be represented by its transition matrix \(P\), and the probability of getting from any state \(a\) to any state \(b\) can be computed in logarithmic time by computing \(P^n_{ab}\), the \(a,b\)-th element of the \(n\)-th power of the matrix \(P\).
So how do we assign the probabilities to DFA transitions? If all characters are equally likely, \(P_{ab}\) is simply the number of characters that move the DFA from state \(a\) to state \(b\) divided by the number of all possible characters. More generally, if every letter \(l\) has a fixed probability \(p_l\), we compute \[ P_{ab}=\sum_{a\overset{l}{\rightarrow} b}p_l, \] where the summation is over all characters \(l\) that lead from \(a\) to \(b\).
And to get the probability of the regex \(R\) matching anywhere inside the string as opposed to matching the whole string, replace \(R\) with \({.}{*}R{.}{*}\) before compiling it to a DFA. Then calculate
\[ \sum_{f\in F} P^n_{sf}, \] where \(s\) is the start state of the DFA, and \(F\) is the set of accepting states.
Here’s a simple implementation of this algorithm. (The library code supports full regular expressions over arbitrary finite alphabets, but the command line tool only accepts simple IUPAC patterns over the DNA alphabet.)
References
There is nothing new or original here — specialists know how to solve this and even more involved problems. For these more involved problems, generating functions are typically used. It’s nice, however, that for this specific case simple matrix exponentiation suffices.
For example, Nicodème, Salvy and Flajolet (2002) show how to use generating functions to calculate the probability distribution of the number of occurrences of a regex, both overlapping and non-overlapping, as well as the mean and variance of that number. They also derive the asymptotics and handle (first-order) Markov dependencies between the consequtive positions in \(L\). But for our task — the probability of a regex occurring at least once — their method essentially reduces to the one described here.
Régnier (2000) also uses generating functions but explicitly models possible overlaps between patterns, which allows her to analyze regular expressions more efficiently than Nicodème et al. (see page 20).
Filion 2017 is a very accessible introduction to the paradigm of transfer graphs, which are a bridge between DFAs, transition matrices, and generating functions.