Published on November 9, 2017

Programs used in bioinformatics vary a lot in how efficient their algorithms and implementations are. Different programs may take from minutes to years on the same data set.

As a bioinformatician, you must have some idea about how long the program will run. If the estimated time is too high, you may request more resources (more and faster computers) or even decide not to run it at all, rather than keep it running for two weeks and then kill it without getting any results.

You can estimate the run time of a program in three steps:

- Collect the data
- Build a model
- Extrapolate

In this article, I will demonstrate this technique on a simple example: estimate how long it takes bowtie2 to index the third human chromosome.

First, we download and unpack the chromosome’s nucleotide sequence:

```
% wget ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/CHR_03/hs_ref_GRCh38.p7_chr3.fa.gz
% gunzip hs_ref_GRCh38.p7_chr3.fa.gz
```

In order to collect some data, we need to run the program. But if we run the program on the full data set, it may take too long to complete — the very problem we are trying to avoid.

Therefore, we run the program on small subsets of the whole data set. For instance, if we are working with many short sequences of similar lengths, we may generate files consisting of the first \(n\) sequences, as I explain here.

In our example, however, we have a few very long sequences, so we will use the following awk script (which we’ll put in the file `head_fasta.awk`

) to extract the first `limit`

nucleotides:

```
/^>/ {
print;
next;
}
{
if ((l=length($0)) < limit) {
print;
limit -= l;
} else {
print(substr($0, 0, limit));
exit;
}
}
```

Next, we write a shell script (`bench.sh`

) that will try several numbers of nucleotides, generate a file of that size, run `bowtie2-build`

, and record the time it took to complete.

```
#!/bin/bash
# We rely on bash to get custom-formatted 'time' output.
# Otherwise, this script should be portable across shells.
# change the format of 'time' output to be the number of wall-clock seconds
TIMEFORMAT='%R'
# Print the csv header. 'n' stands for the number of nucleotides, 't' stands for
# run time in seconds.
printf "n,t\n"
# iterate over 1M, 2M, ..., 10M nucleotides
for n in `seq 1000000 1000000 10000000` # start, step, end
do
# extract the first n nucleotides
awk -v limit="$n" -f head_fasta.awk hs_ref_GRCh38.p7_chr3.fa > sample.fa
# run bowtie2-build, noting the time
# ignore the bowtie's output, and capture time's output from stderr
t=$(time (bowtie2-build --noauto -f sample.fa sample >/dev/null 2>&1) 2>&1)
# print the csv row
printf "%d,%f\n" "$n" "$t"
done
```

There are a few things you should take into account when choosing the range of values for \(n\):

- Obviously, the values shouldn’t be too high, so that data gathering doesn’t take too long.
- The values also shouldn’t be too low; otherwise, the measurements will be noisy and dominated by the fixed costs like program start time. I usually try to come up with a range for which the first (fastest) iteration completes in about a second.
- The range of values shouldn’t be too narrow compared to the values themselves. This is so that any non-linear behavior can show itself. For instance, the width of range \(101, 102, \ldots, 110\) is \(110-101=9\), which is less than \(10\%\) compared to the values themselves. Instead, I usually pick the range \(n, 2n, \ldots, 10n\), whose width is nine times bigger than the smallest value.
- The number of values shouldn’t be too low, so that a regression model can be fit. Ten is a good starting value.

The script will produce the output in the csv format; we use `tee`

to look at it and write it to the file:

`% ./bench.sh | tee time.csv`

Here’s the output I got on my laptop:

```
n,t
1000000,0.718000
2000000,1.439000
3000000,1.827000
4000000,3.115000
5000000,3.235000
6000000,4.390000
7000000,5.248000
8000000,7.153000
9000000,7.065000
10000000,8.664000
```

Let’s plot our data points. We do this to see how the run time grows with \(n\). Most tools designed to work with big data sets will run in a linear or almost linear time, but it’s better to check.

To load and plot the data, we run the following R code:

```
library(readr)
library(ggplot2)
time <- read_csv("time.csv")
ggplot(time, aes(n,t)) + geom_point() + geom_smooth(method="lm",se=F)
```

The blue line is the linear regression line. The fact that the middle points are below the regression line is a sign of a faster-than-linear growth. But the non-linearity is too subtle to be quadratic; so it’s probably \(O(n\log n)\).

To build a model with correlated predictors (\(n\) and \(n\log n\) in this case), it’s better to have more data. Let’s rerun our benchmark script, `bench.sh`

, with a smaller step size of 100000 and tee the output to `time2.csv`

.

Now we fit the model

\[t = a + b\cdot n + c\cdot n\log n\]

and plot it along the data points. As a rule of thumb, if you include higher-order terms (like \(n\log n\) or \(n^2\)), include all lower-order terms (like \(n\)) as well. Note that `lm`

adds the intercept term \(a\) automatically.

```
library(dplyr)
time <- read_csv("time2.csv")
nlm <- lm(t ~ n + I(n*log(n)), data=time)
time <- mutate(time, model=predict(nlm))
ggplot(time) + geom_point(aes(n,t)) + geom_line(aes(n,model), color="blue")
```

The model with an \(n\log n\) term gives a more realistic fit, so we’ll stick with it.

Let’s find out how many nucleotides are in the whole chromosome:

```
% awk '/^[^>]/{n+=length($0)} END{print(n)}' hs_ref_GRCh38.p7_chr3.fa
202786747
```

Now we simply plug this number into our R model:

`predict(nlm, newdata=list(n=202786747))`

`332.2153`

So, our prediction is 332 seconds, or 5.5 minutes.

To check this prediction, I actually left `bowtie2-build`

running for a while. It completed in 350 seconds — within 5% of our prediction!

On the other hand, if we went with the linear model, our prediction would have been 176 seconds — a severe underestimation by a factor of 2. The model with \(n \log n\) but without the linear term would have predicted 206 seconds — only somewhat better than the linear model.