Matching country names by local alignment
Published on
The other day, I was looking at the Global Terrorism Database published on Kaggle. I wanted to see if the number of terrorist attacks across countries correlated with the population of those countries. The dataset itself didn’t contain the population figures, so I downloaded The World Bank’s population data and tried to merge the two datasets together.
Integrating datasets from different providers is rarely easy. In this case, there were two issues:
The terrorism database gives data about historic countries, while the World Bank maps everything to modern countries.
The terrorism database tells you about the terror acts performed in the Soviet Union in 1978 or Czechoslovakia in 1972, but the World Bank won’t tell you how many people lived in those countries. Instead, it will tell you how many people lived in what today are Russia, Ukraine, Czech Republic, or Slovakia.
It might be possible to match modern countries and historic ones (e.g. calculate the population of the Soviet Union by adding up the population of all Soviet republics), but I was more interested in modern trends, so I decided to ignore the historic countries.
Some of the modern countries have different spellings. A country may have a common name (Russia, Macedonia) and a formal compound name (the Russian Federation, the former Yugoslav Republic of Macedonia). The terrorism database uses the common names, whereas the World Bank uses the formal names, often with idiosyncratic abbreviations, such as “Macedonia, FYR”.
In this article, I compare several methods of matching alternative country spellings. The winner turns out to be local alignment — a method well known in bioinformatics but for some reason rarely used outside of it.
Fuzzy join
Merging datasets based on ill-defined textual values is such a common task that there is a special R package for it, fuzzyjoin. The fuzzyjoin package treats strings as equal if the distance between them is within a user-specified limit. The string distance can be calculated in ten different ways by the stringdist package.
For the task at hand, we will not use fuzzyjoin. fuzzyjoin’s mode of operation is to have a fixed similarity threshold and treat everything similar enough as equal. Whatever distance limit we pick, for some names it will be too tight and there won’t be a single match, and for other names it will be too lax and we will get more than one match.
It makes sense to expect exactly one matching country in the World Bank data for each country from the Global Terrorism Database. So instead of picking an arbitrary threshold, for every country in the first dataset we will pick the best matching name from the second one.
Distance-based methods
Which of the 10 string distance metrics should we use to match country names? First, I tried the classic Levenshtein distance and a couple of others metrics offered by stringdist. They could guess some of the countries but not that many. In order to compare the performance of different metrics, I had to compile the reference table of true matches. (I used the algorithm’s output as a starting point, so it wasn’t too much work.)
Global Terrorism Database | World Bank |
---|---|
Venezuela | Venezuela, RB |
Egypt | Egypt, Arab Rep. |
Iran | Iran, Islamic Rep. |
West Bank and Gaza Strip | West Bank and Gaza |
Syria | Syrian Arab Republic |
South Korea | Korea, Rep. |
Bahamas | Bahamas, The |
Hong Kong | Hong Kong SAR, China |
Laos | Lao PDR |
Republic of the Congo | Congo, Rep. |
Yemen | Yemen, Rep. |
Russia | Russian Federation |
Ivory Coast | Cote d’Ivoire |
Bosnia-Herzegovina | Bosnia and Herzegovina |
Brunei | Brunei Darussalam |
Macedonia | Macedonia, FYR |
Gambia | Gambia, The |
North Korea | Korea, Dem. People’s Rep. |
Macau | Macao SAR, China |
Kyrgyzstan | Kyrgyz Republic |
Democratic Republic of the Congo | Congo, Dem. Rep. |
East Timor | Timor-Leste |
Now let’s see how well each metric from stringdist can reconstruct the truth.
<- c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw", "soundex")
methods <- data_frame(method=methods) %>% group_by(method) %>% do({
distMatches <- names2[apply(stringdistmatrix(true_matches$name1, names2, .$method), 1, which.min)]
guess mutate(true_matches, name2.guess=guess)
%>% ungroup
})
%>% group_by(method) %>%
distMatches summarise(score=mean(name2==name2.guess)) %>%
arrange(desc(score))
method | score | |
---|---|---|
1 | jw | 0.50 |
2 | cosine | 0.36 |
3 | jaccard | 0.32 |
4 | soundex | 0.27 |
5 | lcs | 0.23 |
6 | qgram | 0.23 |
7 | dl | 0.14 |
8 | lv | 0.14 |
9 | osa | 0.14 |
10 | hamming | 0.00 |
Here, score is the fraction of countries successfully matched by a given method. The highest-scoring method, Jaro–Winkler distance, got only half of the countries right. Let’s see what mistakes it made:
%>% filter(method=="jw",name2!=name2.guess) distMatches
Global Terrorism Database | World Bank | jw’s guess |
---|---|---|
Iran | Iran, Islamic Rep. | Ireland |
Syria | Syrian Arab Republic | Serbia |
South Korea | Korea, Rep. | South Africa |
Republic of the Congo | Congo, Rep. | Puerto Rico |
Ivory Coast | Cote d’Ivoire | Honduras |
Gambia | Gambia, The | Zambia |
North Korea | Korea, Dem. People’s Rep. | North America |
Macau | Macao SAR, China | Malta |
Kyrgyzstan | Kyrgyz Republic | Kazakhstan |
Democratic Republic of the Congo | Congo, Dem. Rep. | Dominican Republic |
East Timor | Timor-Leste | Ecuador |
Why do these distance metrics perform so poorly? In order to transform “East Timor” to “Timor-Leste”, they need to remove 5 characters on the left and append 6 on the right, which makes a total of 11 edits. On the other hand, to transform “East Timor” to “Ecuador”, they keep the leading “E” and trailing “or” and replace only 7 letters in between. These metrics ignore the similarities and focus on the differences.
Local alignment
A similar problem arises in bioinformatics. Let’s say we are interested in the similarities between the human and bee genomes. Naturally, these genomes are quite different, and if we perform global alignment (as in Levenshtein and most other distance metrics), the relatively few conserved genes will be lost in the mismatch noise.
For this reason, bioinformaticians developed algorithms for local
alignment, a similarity metric that rewards similarity without
punishing for irrelevant differences. In R, a simple local alignment
algorithm is included in Biostring’s pairwiseAlignment
function.
<- true_matches %>%
localMatches mutate(name1 %>%
sapply(function(n)
pairwiseAlignment(names2, n, type="local", scoreOnly=T) %>% which.max) %>%
names2[.])%>% summarize(score=mean(name2==name2.guess)) localMatches
The local alignment method guessed 77% of the countries, compared to only 50% countries guessed by the best distance metric. Here are the countries it didn’t guess:
Global Terrorism Database | World Bank | Local alignment’s guess |
---|---|---|
South Korea | Korea, Rep. | South Asia |
Republic of the Congo | Congo, Rep. | Central African Republic |
North Korea | Korea, Dem. People’s Rep. | Middle East & North Africa |
Democratic Republic of the Congo | Congo, Dem. Rep. | Central African Republic |
East Timor | Timor-Leste | East Asia & Pacific (excluding high income) |
Every mistake is caused by a lengthy generic qualifier such as South, North, East, or Republic.
Why isn’t local alignment included in the stringdist package? Well, here’s the thing: strictly speaking, it is not a distance. Any distance metric must obey the triangle inequality:
\[d(x,z)\leq d(x,y)+d(y,z).\]
The local alignment score does not satisfy it: the phrase “local alignment” is highly (locally) similar to both “local” and “alignment”, but “local” and “alignment” are not similar at all.
Does this matter? In some applications, perhaps. But for country names matching, it sure looks like local alignment beats any of the “proper” distance metrics.
And yet local alignment didn’t guess everything it could. I experimented briefly with a multi-alignment algorithm, but, at least in this case, it performed only a little bit better.