Homology, similarity, and alignment
Published on
Although homology, similarity, and alignment are all relatively well defined notions in biology and computer science, scientists sometimes get confused about their meaning or the relationships between them. The results range from annoyed colleagues to very annoyed colleagues.
So let’s sort these things out.
Definitions
To keep things simple and concrete, I’ll be mostly talking about DNA sequences, but everything here applies equally to RNA or amino acid sequences.
Homology
Homology is a concept from evolutionary biology. It means that two DNA sequences have a common evolutionary origin (Reeck et al., 1987). For example, the mouse gene Hba-a1 is homologous with the human gene HBA1 because both species inherited it from their latest common ancestor. Homologous sequences can be divided into paralogous and orthologous sequences, although that distinction won’t be important here.
Sequence similarity
Sequence similarity is a concept from computational biology and computer science. Sequence similarity is a number that shows how much two sequences are similar. Sequence similarity is sometimes, but not always, defined via sequence distance: the smaller the distance, the more similar the sequences1. For instance, the following two DNA sequences are highly similar (the differences are highlighted in red):
GTCCTCATAACTCTCTCTAG
GTCGTCATAAC CTCTCTAG
Their similarity is witnessed by their low Levenshtein distance of 2—it only takes two edit operations (one substitution and one deletion) to transform the first sequence into the second one2. But there are other distance metrics on sequences, and there are also similarity metrics not expressible as distances. Therefore, while sequence similarity is always a number determined based on two sequences, the specifics of how that number is calculated may vary.
Sometimes the similarity score is expressed as a percentage, namely “percent similarity” or “percent identity”. Percent identity usually refers to the ratio of the number of matching residues to the total length of the alignment (see below), e.g. 18/20=90% in the example above. See also Li, 2018. Percent similarity counts “similar” residues (usually amino acids) in addition to the identical ones. The similarity between amino acids can be defined either by their chemical properties or based on a PAM matrix.
Alignment
Alignment is another concept from computational biology. It is simply
any way to align two sequences one below another, possibly with gaps3 (Gusfield, 1997). I
already showed an example alignment above. Here it is again, this time
with the gaps (deletions) indicated by a hyphen (-
), as is
customary for sequence alignments:
GTCCTCATAACTCTCTCTAG
GTCGTCATAAC-CTCTCTAG
You would think that this is the alignment for these two strings, but according to the definition, the following is also a valid alignment:
GTCCTCAT-AACTCTCTCTAG
GTCGTCATAA-C-CTCTCTAG
And this one, too:
GTCCTCATAACTCTCTCTAG-------------------
--------------------GTCGTCATAACCTCTCTAG
An alignment algorithm usually has a scoring function, which assigns every alignment a numeric score indicating how good an alignment it is, and tries to find the best alignment according to its scoring function.
Relationships
To understand these terms better, it helps to see how they are related to one another.
Homology vs. similarity
Neither homology nor high similarity imply the other one. Two sequences can be similar or even identical either by chance (if they are short) or by design (if synthesized). And two homologous loci can quickly diverge in their DNA sequence if they are not constrained by natural selection.
Whereas similarity is a property of the sequences (strings) themselves and does not depend on where they came from, homology is the opposite—a property of how two things (e.g. genes) came to be, regardless of their nucleotide content.
Thus, it does not even make sense to ask whether these two (similar!) sequences are homologous without putting them in the context of the genomes and organisms in which they evolved:
GTCCTCATAACTCTCTCTAG
GTCGTCATAAC CTCTCTAG
In fact, I generated these sequences randomly using R, so I can confidently say that they are not homologous—they simply aren’t objects of evolution.
That said, there are two valid connections between homology and similarity, and they are the source of confusion between these notions:
- Similarity can be used to infer homology. Specifically, if the two similar genomic sequences are long and complex enough to have unlikely arisen independently even under similar selective pressures, then this is a strong evidence for their homology.
- Most of the homologous loci that we observe have high similarity. I emphasized “that we observe” because there is an obvious circularity and bias here: we only know that two regions are homologous when they are similar. If we could somehow sample homologous regions in different organisms in an unbiased way, perhaps we would find much less similarity than we are accustomed to.
Similarity vs. alignment
An alignment could be converted to a similarity score using a scoring function. The simplest scoring function could be “add 1 for every matching characters in the alignment, 0 for every mismatch or space (gap)”. But there are many different scoring functions, and so there isn’t one canonical similarity corresponding to an alignment.
Usually when people talk about an alignment, they mean the optimal (best) alignment relative to a particular scoring function, but we need to remember that an alignment does not have to be optimal.
Similarity could be defined through an alignment—specifically, as the score of the best (highest-scoring) alignment between the two sequences.
Similarity could also be defined in many other ways—e.g. through a locality-sensitive hash (Ondov et al., 2016).
Alignment vs. homology
If two regions are homologous, they could be aligned to underline their evolutionary history. Let’s call an alignment in which aligned positions have a common origin an evolutionary alignment.
The evolutionary alignment does not have to be the optimal alignment with respect to any given scoring function. And vice versa, the optimal alignment (such as one calculated by an alignment program) does not necessarily reflect the evolutionary relationship between the sequences.
Let’s consider an example to illustrate this. In the ancestral
genome, there is a sequence ATATATG
. In one of the
lineages, the second and third bases got deleted, resulting in
ATATG
. The evolutionary pairwise alignment of the ancestor
and the lineage is
ATATATG
A--TATG
Note that, although this is an optimal alignment, it is not a unique optimal alignment. Assuming a simple scoring function that assigns 0 to matches and −1 to mismatches, insertions, and deletions, the alignments
ATATATG
AT--ATG
and
ATATATG
--ATATG
have the same score of −2, even though they do not represent the evolutionary relationship between the sequences.
Also note that this alignment does not necessarily tell us about the
exact path by which we arrived from ATATATG
to
ATATG
. What if there were two mutational events: first,
AT
got duplicated into ATAT
, and then that
ATAT
got deleted? It would still result in the same
alignment.
Now let’s consider another lineage, where the fifth base mutated to
G, resulting in ATATGTG
. The optimal alignment
between the two lineages is
ATATG--
ATATGTG
with the score of −2. But this alignment does not reflect the
evolutionary history, because the two aligned G
s do not
have the same origin. Instead, the evolutionarily correct alignment
is
A--TATG
ATATGTG
having a suboptimal score of −3. This shows that the optimal alignment and the evolutionary alignment do not have to coincide.
To summarize, an alignment by itself does not imply any evolutionary relationship between the sequences. The two aligned sequences do not have to be homologous, and even if they are, the alignment does not necessarily show which bases have a common origin, although an alignment can be used for this purpose. An alignment is just a way to visualize the differences and similarities between two sequences; if there is any extra meaning in a particular alignment, it has to be given explicitly.
Acknowledgments
Thanks to Nika Gurianova for the feedback on an earlier draft of this article.