Homology, similarity, and alignment

Published on

Although homology, similarity, and alignment are all relatively well defined notions in biology and computer science, scientists sometimes get confused about their meaning or the relationships between them. The results range from annoyed colleagues to very annoyed colleagues.

So let’s sort these things out.

Definitions

To keep things simple and concrete, I’ll be mostly talking about DNA sequences, but everything here applies equally to RNA or amino acid sequences.

Homology

Homology is a concept from evolutionary biology. It means that two DNA sequences have a common evolutionary origin (Reeck et al., 1987). For example, the mouse gene Hba-a1 is homologous with the human gene HBA1 because both species inherited it from their latest common ancestor. Homologous sequences can be divided into paralogous and orthologous sequences, although that distinction won’t be important here.

Sequence similarity

Sequence similarity is a concept from computational biology and computer science. Sequence similarity is a number that shows how much two sequences are similar. Sequence similarity is sometimes, but not always, defined via sequence distance: the smaller the distance, the more similar the sequences1. For instance, the following two DNA sequences are highly similar (the differences are highlighted in red):

GTCCTCATAACTCTCTCTAG
GTCGTCATAAC CTCTCTAG

Their similarity is witnessed by their low Levenshtein distance of 2—it only takes two edit operations (one substitution and one deletion) to transform the first sequence into the second one2. But there are other distance metrics on sequences, and there are also similarity metrics not expressible as distances. Therefore, while sequence similarity is always a number determined based on two sequences, the specifics of how that number is calculated may vary.

Sometimes the similarity score is expressed as a percentage, namely “percent similarity” or “percent identity”. Percent identity usually refers to the ratio of the number of matching residues to the total length of the alignment (see below), e.g. 18/20=90% in the example above. See also Li, 2018. Percent similarity counts “similar” residues (usually amino acids) in addition to the identical ones. The similarity between amino acids can be defined either by their chemical properties or based on a PAM matrix.

Alignment

Alignment is another concept from computational biology. It is simply any way to align two sequences one below another, possibly with gaps3 (Gusfield, 1997). I already showed an example alignment above. Here it is again, this time with the gaps (deletions) indicated by a hyphen (-), as is customary for sequence alignments:

GTCCTCATAACTCTCTCTAG
GTCGTCATAAC-CTCTCTAG

You would think that this is the alignment for these two strings, but according to the definition, the following is also a valid alignment:

GTCCTCAT-AACTCTCTCTAG
GTCGTCATAA-C-CTCTCTAG

And this one, too:

GTCCTCATAACTCTCTCTAG-------------------
--------------------GTCGTCATAACCTCTCTAG

An alignment algorithm usually has a scoring function, which assigns every alignment a numeric score indicating how good an alignment it is, and tries to find the best alignment according to its scoring function.

Relationships

To understand these terms better, it helps to see how they are related to one another.

Homology vs. similarity

Neither homology nor high similarity imply the other one. Two sequences can be similar or even identical either by chance (if they are short) or by design (if synthesized). And two homologous loci can quickly diverge in their DNA sequence if they are not constrained by natural selection.

Whereas similarity is a property of the sequences (strings) themselves and does not depend on where they came from, homology is the opposite—a property of how two things (e.g. genes) came to be, regardless of their nucleotide content.

Thus, it does not even make sense to ask whether these two (similar!) sequences are homologous without putting them in the context of the genomes and organisms in which they evolved:

GTCCTCATAACTCTCTCTAG
GTCGTCATAAC CTCTCTAG

In fact, I generated these sequences randomly using R, so I can confidently say that they are not homologous—they simply aren’t objects of evolution.

That said, there are two valid connections between homology and similarity, and they are the source of confusion between these notions:

  1. Similarity can be used to infer homology. Specifically, if the two similar genomic sequences are long and complex enough to have unlikely arisen independently even under similar selective pressures, then this is a strong evidence for their homology.
  2. Most of the homologous loci that we observe have high similarity. I emphasized “that we observe” because there is an obvious circularity and bias here: we only know that two regions are homologous when they are similar. If we could somehow sample homologous regions in different organisms in an unbiased way, perhaps we would find much less similarity than we are accustomed to.

Similarity vs. alignment

An alignment could be converted to a similarity score using a scoring function. The simplest scoring function could be “add 1 for every matching characters in the alignment, 0 for every mismatch or space (gap)”. But there are many different scoring functions, and so there isn’t one canonical similarity corresponding to an alignment.

Usually when people talk about an alignment, they mean the optimal (best) alignment relative to a particular scoring function, but we need to remember that an alignment does not have to be optimal.

Similarity could be defined through an alignment—specifically, as the score of the best (highest-scoring) alignment between the two sequences.

Similarity could also be defined in many other ways—e.g. through a locality-sensitive hash (Ondov et al., 2016).

Alignment vs. homology

If two regions are homologous, they could be aligned to underline their evolutionary history. Let’s call an alignment in which aligned positions have a common origin an evolutionary alignment.

The evolutionary alignment does not have to be the optimal alignment with respect to any given scoring function. And vice versa, the optimal alignment (such as one calculated by an alignment program) does not necessarily reflect the evolutionary relationship between the sequences.

Let’s consider an example to illustrate this. In the ancestral genome, there is a sequence ATATATG. In one of the lineages, the second and third bases got deleted, resulting in ATATG. The evolutionary pairwise alignment of the ancestor and the lineage is


ATATATG
A--TATG

Note that, although this is an optimal alignment, it is not a unique optimal alignment. Assuming a simple scoring function that assigns 0 to matches and −1 to mismatches, insertions, and deletions, the alignments


ATATATG
AT--ATG

and


ATATATG
--ATATG

have the same score of −2, even though they do not represent the evolutionary relationship between the sequences.

Also note that this alignment does not necessarily tell us about the exact path by which we arrived from ATATATG to ATATG. What if there were two mutational events: first, AT got duplicated into ATAT, and then that ATAT got deleted? It would still result in the same alignment.

Now let’s consider another lineage, where the fifth base mutated to G, resulting in ATATGTG. The optimal alignment between the two lineages is


ATATG--
ATATGTG

with the score of −2. But this alignment does not reflect the evolutionary history, because the two aligned Gs do not have the same origin. Instead, the evolutionarily correct alignment is


A--TATG
ATATGTG

having a suboptimal score of −3. This shows that the optimal alignment and the evolutionary alignment do not have to coincide.

To summarize, an alignment by itself does not imply any evolutionary relationship between the sequences. The two aligned sequences do not have to be homologous, and even if they are, the alignment does not necessarily show which bases have a common origin, although an alignment can be used for this purpose. An alignment is just a way to visualize the differences and similarities between two sequences; if there is any extra meaning in a particular alignment, it has to be given explicitly.

Acknowledgments

Thanks to Nika Gurianova for the feedback on an earlier draft of this article.

References

Reeck, GR, de Haën, C, Teller, DC, Doolittle, RF, Fitch, WM, Dickerson, RE, Chambon, P, McLachlan, AD, Margoliash, E, Jukes, TH (1987). “Homology” in proteins and nucleic acids: a terminology muddle and a way out of it. Cell, 50, 5:667.

Gusfield, D. (1997). Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge: Cambridge University Press.

Ondov, BD, Treangen, TJ, Melsted, P, Mallonee, AB, Bergman, NH, Koren, S, Phillippy, AM (2016). Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol., 17, 1:132.

Heng Li. On the definition of sequence identity.