Roman Cheplyaka

Matching country names by local alignment

Published on December 11, 2016; tags: rstats

The other day, I was looking at the Global Terrorism Database published on Kaggle. I wanted to see if the number of terrorist attacks across countries correlated with the population of those countries. The dataset itself didn’t contain the population figures, so I downloaded The World Bank’s population data and tried to merge the two datasets together.

Integrating datasets from different providers is rarely easy. In this case, there were two issues:

  1. The terrorism database gives data about historic countries, while the World Bank maps everything to modern countries.

    The terrorism database tells you about the terror acts performed in the Soviet Union in 1978 or Czechoslovakia in 1972, but the World Bank won’t tell you how many people lived in those countries. Instead, it will tell you how many people lived in what today are Russia, Ukraine, Czech Republic, or Slovakia.

    It might be possible to match modern countries and historic ones (e.g. calculate the population of the Soviet Union by adding up the population of all Soviet republics), but I was more interested in modern trends, so I decided to ignore the historic countries.

  2. Some of the modern countries have different spellings. A country may have a common name (Russia, Macedonia) and a formal compound name (the Russian Federation, the former Yugoslav Republic of Macedonia). The terrorism database uses the common names, whereas the World Bank uses the formal names, often with idiosyncratic abbreviations, such as “Macedonia, FYR”.

In this article, I compare several methods of matching alternative country spellings. The winner turns out to be local alignment — a method well known in bioinformatics but for some reason rarely used outside of it.

Fuzzy join

Merging datasets based on ill-defined textual values is such a common task that there is a special R package for it, fuzzyjoin. The fuzzyjoin package treats strings as equal if the distance between them is within a user-specified limit. The string distance can be calculated in ten different ways by the stringdist package.

For the task at hand, we will not use fuzzyjoin. fuzzyjoin’s mode of operation is to have a fixed similarity threshold and treat everything similar enough as equal. Whatever distance limit we pick, for some names it will be too tight and there won’t be a single match, and for other names it will be too lax and we will get more than one match.

It makes sense to expect exactly one matching country in the World Bank data for each country from the Global Terrorism Database. So instead of picking an arbitrary threshold, for every country in the first dataset we will pick the best matching name from the second one.

Distance-based methods

Which of the 10 string distance metrics should we use to match country names? First, I tried the classic Levenshtein distance and a couple of others metrics offered by stringdist. They could guess some of the countries but not that many. In order to compare the performance of different metrics, I had to compile the reference table of true matches. (I used the algorithm’s output as a starting point, so it wasn’t too much work.)

Global Terrorism Database World Bank
Venezuela Venezuela, RB
Egypt Egypt, Arab Rep.
Iran Iran, Islamic Rep.
West Bank and Gaza Strip West Bank and Gaza
Syria Syrian Arab Republic
South Korea Korea, Rep.
Bahamas Bahamas, The
Hong Kong Hong Kong SAR, China
Laos Lao PDR
Republic of the Congo Congo, Rep.
Yemen Yemen, Rep.
Russia Russian Federation
Ivory Coast Cote d’Ivoire
Bosnia-Herzegovina Bosnia and Herzegovina
Brunei Brunei Darussalam
Macedonia Macedonia, FYR
Gambia Gambia, The
North Korea Korea, Dem. People’s Rep.
Macau Macao SAR, China
Kyrgyzstan Kyrgyz Republic
Democratic Republic of the Congo Congo, Dem. Rep.
East Timor Timor-Leste

Now let’s see how well each metric from stringdist can reconstruct the truth.

methods <- c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw", "soundex")
distMatches <- data_frame(method=methods) %>% group_by(method) %>% do({
  guess <- names2[apply(stringdistmatrix(true_matches$name1, names2, .$method), 1, which.min)]
  mutate(true_matches, name2.guess=guess)
}) %>% ungroup

distMatches %>% group_by(method) %>%
  summarise(score=mean(name2==name2.guess)) %>%
method score
1 jw 0.50
2 cosine 0.36
3 jaccard 0.32
4 soundex 0.27
5 lcs 0.23
6 qgram 0.23
7 dl 0.14
8 lv 0.14
9 osa 0.14
10 hamming 0.00

Here, score is the fraction of countries successfully matched by a given method. The highest-scoring method, Jaro–Winkler distance, got only half of the countries right. Let’s see what mistakes it made:

distMatches %>% filter(method=="jw",name2!=name2.guess)
Global Terrorism Database World Bank jw’s guess
Iran Iran, Islamic Rep. Ireland
Syria Syrian Arab Republic Serbia
South Korea Korea, Rep. South Africa
Republic of the Congo Congo, Rep. Puerto Rico
Ivory Coast Cote d’Ivoire Honduras
Gambia Gambia, The Zambia
North Korea Korea, Dem. People’s Rep. North America
Macau Macao SAR, China Malta
Kyrgyzstan Kyrgyz Republic Kazakhstan
Democratic Republic of the Congo Congo, Dem. Rep. Dominican Republic
East Timor Timor-Leste Ecuador

Why do these distance metrics perform so poorly? In order to transform “East Timor” to “Timor-Leste”, they need to remove 5 characters on the left and append 6 on the right, which makes a total of 11 edits. On the other hand, to transform “East Timor” to “Ecuador”, they keep the leading “E” and trailing “or” and replace only 7 letters in between. These metrics ignore the similarities and focus on the differences.

Local alignment

A similar problem arises in bioinformatics. Let’s say we are interested in the similarities between the human and bee genomes. Naturally, these genomes are quite different, and if we perform global alignment (as in Levenshtein and most other distance metrics), the relatively few conserved genes will be lost in the mismatch noise.

For this reason, bioinformaticians developed algorithms for local alignment, a similarity metric that rewards similarity without punishing for irrelevant differences. In R, a simple local alignment algorithm is included in Biostring’s pairwiseAlignment function.

localMatches <- true_matches %>%
  mutate(name1 %>%
      pairwiseAlignment(names2, n, type="local", scoreOnly=T) %>% which.max) %>%
localMatches %>% summarize(score=mean(name2==name2.guess))

The local alignment method guessed 77% of the countries, compared to only 50% countries guessed by the best distance metric. Here are the countries it didn’t guess:

Global Terrorism Database World Bank Local alignment’s guess
South Korea Korea, Rep. South Asia
Republic of the Congo Congo, Rep. Central African Republic
North Korea Korea, Dem. People’s Rep. Middle East & North Africa
Democratic Republic of the Congo Congo, Dem. Rep. Central African Republic
East Timor Timor-Leste East Asia & Pacific (excluding high income)

Every mistake is caused by a lengthy generic qualifier such as South, North, East, or Republic.

Why isn’t local alignment included in the stringdist package? Well, here’s the thing: strictly speaking, it is not a distance. Any distance metric must obey the triangle inequality:

\[d(x,z)\leq d(x,y)+d(y,z).\]

The local alignment score does not satisfy it: the phrase “local alignment” is highly (locally) similar to both “local” and “alignment”, but “local” and “alignment” are not similar at all.

Does this matter? In some applications, perhaps. But for country names matching, it sure looks like local alignment beats any of the “proper” distance metrics.

And yet local alignment didn’t guess everything it could. I experimented briefly with a multi-alignment algorithm, but, at least in this case, it performed only a little bit better.