Matching country names by local alignment

Published on December 11, 2016

The other day, I was looking at the Global Terrorism Database published on Kaggle. I wanted to see if the number of terrorist attacks across countries correlated with the population of those countries. The dataset itself didn’t contain the population figures, so I downloaded The World Bank’s population data and tried to merge the two datasets together.

Integrating datasets from different providers is rarely easy. In this case, there were two issues:

The terrorism database gives data about historic countries, while the World Bank maps everything to modern countries.

The terrorism database tells you about the terror acts performed in the Soviet Union in 1978 or Czechoslovakia in 1972, but the World Bank won’t tell you how many people lived in those countries. Instead, it will tell you how many people lived in what today are Russia, Ukraine, Czech Republic, or Slovakia.

It might be possible to match modern countries and historic ones (e.g. calculate the population of the Soviet Union by adding up the population of all Soviet republics), but I was more interested in modern trends, so I decided to ignore the historic countries.
Some of the modern countries have different spellings. A country may have a common name (Russia, Macedonia) and a formal compound name (the Russian Federation, the former Yugoslav Republic of Macedonia). The terrorism database uses the common names, whereas the World Bank uses the formal names, often with idiosyncratic abbreviations, such as “Macedonia, FYR”.

In this article, I compare several methods of matching alternative country spellings. The winner turns out to be local alignment — a method well known in bioinformatics but for some reason rarely used outside of it.

Fuzzy join

Merging datasets based on ill-defined textual values is such a common task that there is a special R package for it, fuzzyjoin. The fuzzyjoin package treats strings as equal if the distance between them is within a user-specified limit. The string distance can be calculated in ten different ways by the stringdist package.

For the task at hand, we will not use fuzzyjoin. fuzzyjoin’s mode of operation is to have a fixed similarity threshold and treat everything similar enough as equal. Whatever distance limit we pick, for some names it will be too tight and there won’t be a single match, and for other names it will be too lax and we will get more than one match.

It makes sense to expect exactly one matching country in the World Bank data for each country from the Global Terrorism Database. So instead of picking an arbitrary threshold, for every country in the first dataset we will pick the best matching name from the second one.

Distance-based methods

Which of the 10 string distance metrics should we use to match country names? First, I tried the classic Levenshtein distance and a couple of others metrics offered by stringdist. They could guess some of the countries but not that many. In order to compare the performance of different metrics, I had to compile the reference table of true matches. (I used the algorithm’s output as a starting point, so it wasn’t too much work.)

Global Terrorism Database	World Bank
Venezuela	Venezuela, RB
Egypt	Egypt, Arab Rep.
Iran	Iran, Islamic Rep.
West Bank and Gaza Strip	West Bank and Gaza
Syria	Syrian Arab Republic
South Korea	Korea, Rep.
Bahamas	Bahamas, The
Hong Kong	Hong Kong SAR, China
Laos	Lao PDR
Republic of the Congo	Congo, Rep.
Yemen	Yemen, Rep.
Russia	Russian Federation
Ivory Coast	Cote d’Ivoire
Bosnia-Herzegovina	Bosnia and Herzegovina
Brunei	Brunei Darussalam
Macedonia	Macedonia, FYR
Gambia	Gambia, The
North Korea	Korea, Dem. People’s Rep.
Macau	Macao SAR, China
Kyrgyzstan	Kyrgyz Republic
Democratic Republic of the Congo	Congo, Dem. Rep.
East Timor	Timor-Leste

Now let’s see how well each metric from stringdist can reconstruct the truth.

methods <- c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw", "soundex")
distMatches <- data_frame(method=methods) %>% group_by(method) %>% do({
  guess <- names2[apply(stringdistmatrix(true_matches$name1, names2, .$method), 1, which.min)]
  mutate(true_matches, name2.guess=guess)
}) %>% ungroup

distMatches %>% group_by(method) %>%
  summarise(score=mean(name2==name2.guess)) %>%
  arrange(desc(score))

	method	score
1	jw	0.50
2	cosine	0.36
3	jaccard	0.32
4	soundex	0.27
5	lcs	0.23
6	qgram	0.23
7	dl	0.14
8	lv	0.14
9	osa	0.14
10	hamming	0.00

Here, score is the fraction of countries successfully matched by a given method. The highest-scoring method, Jaro–Winkler distance, got only half of the countries right. Let’s see what mistakes it made:

distMatches %>% filter(method=="jw",name2!=name2.guess)

Global Terrorism Database	World Bank	jw’s guess
Iran	Iran, Islamic Rep.	Ireland
Syria	Syrian Arab Republic	Serbia
South Korea	Korea, Rep.	South Africa
Republic of the Congo	Congo, Rep.	Puerto Rico
Ivory Coast	Cote d’Ivoire	Honduras
Gambia	Gambia, The	Zambia
North Korea	Korea, Dem. People’s Rep.	North America
Macau	Macao SAR, China	Malta
Kyrgyzstan	Kyrgyz Republic	Kazakhstan
Democratic Republic of the Congo	Congo, Dem. Rep.	Dominican Republic
East Timor	Timor-Leste	Ecuador

Why do these distance metrics perform so poorly? In order to transform “East Timor” to “Timor-Leste”, they need to remove 5 characters on the left and append 6 on the right, which makes a total of 11 edits. On the other hand, to transform “East Timor” to “Ecuador”, they keep the leading “E” and trailing “or” and replace only 7 letters in between. These metrics ignore the similarities and focus on the differences.

Local alignment

A similar problem arises in bioinformatics. Let’s say we are interested in the similarities between the human and bee genomes. Naturally, these genomes are quite different, and if we perform global alignment (as in Levenshtein and most other distance metrics), the relatively few conserved genes will be lost in the mismatch noise.

For this reason, bioinformaticians developed algorithms for local alignment, a similarity metric that rewards similarity without punishing for irrelevant differences. In R, a simple local alignment algorithm is included in Biostring’s pairwiseAlignment function.

localMatches <- true_matches %>%
  mutate(name1 %>%
    sapply(function(n)
      pairwiseAlignment(names2, n, type="local", scoreOnly=T) %>% which.max) %>%
    names2[.])
localMatches %>% summarize(score=mean(name2==name2.guess))

The local alignment method guessed 77% of the countries, compared to only 50% countries guessed by the best distance metric. Here are the countries it didn’t guess:

Global Terrorism Database	World Bank	Local alignment’s guess
South Korea	Korea, Rep.	South Asia
Republic of the Congo	Congo, Rep.	Central African Republic
North Korea	Korea, Dem. People’s Rep.	Middle East & North Africa
Democratic Republic of the Congo	Congo, Dem. Rep.	Central African Republic
East Timor	Timor-Leste	East Asia & Pacific (excluding high income)

Every mistake is caused by a lengthy generic qualifier such as South, North, East, or Republic.

Why isn’t local alignment included in the stringdist package? Well, here’s the thing: strictly speaking, it is not a distance. Any distance metric must obey the triangle inequality:

\[d(x,z)\leq d(x,y)+d(y,z).\]

The local alignment score does not satisfy it: the phrase “local alignment” is highly (locally) similar to both “local” and “alignment”, but “local” and “alignment” are not similar at all.

Does this matter? In some applications, perhaps. But for country names matching, it sure looks like local alignment beats any of the “proper” distance metrics.

And yet local alignment didn’t guess everything it could. I experimented briefly with a multi-alignment algorithm, but, at least in this case, it performed only a little bit better.