Sunday, April 22, 2007

Phonetic coding algorithms


Unfortunately, English spelling can be hard. And when it comes to spelling
names, it's even more difficult. For instance, in English, the name /ˈeɪdriən/
could be spelled in at least the following ways:

  • adrienne

  • adriene

  • adrien

  • adrianne

  • adriane

  • adrian

In human circumstances, this isn't a problem since people recognize all the
spellings and translate the written text into a phonetic representation in
their mind.

Now imagine that you want a computer user to enter a name into a search box and
have the computer convert that entry into an identifier in a database. If the
database has "adrienne" and the user types "adrian", there's no match. It's
too time-consuming to enter every possible name variant into the database.
Phonetic coding algorithms address this problem. For the list of names above,
the Soundex algorithm generates the code A365. By storing a single
phonetic code in the database and converting the user's entry into A365 one
can find the person the user intended.

Caron or Hendricks vs Hendrick

There are many phonetic algorithms. Soundex is one of the
first and most widely used, but it doesn't always perform well. For example,
my sister-in-law is named Caron. People inevitably spell her name as "Karen."
Soundex doesn't help because the codes C650 and K650 are different. In
this case, we could switch to the NYSIIS (CARAN), Metaphone (KRN),
Double Metaphone (KRN) or Phonix (K6500000) algorithms.

I sometimes encounter the misspelling "Hendricks" vs "Hendrick." By ommitting
the final "s", one confuses the Metaphone and Phonix algorithms. NYSIIS
(HANDRAC) and Double Metaphone (HNTR) still work in this case though.

Kinsey vs. McKinsey

For my particular application, I settled on NYSIIS because it
can be implemented in about 30 lines of Perl code and works quite well for
the names I'm interested in. However, it doesn't entirely solve the
use case I described above. There's one person in the database whose proper
name is Elizabeth but she goes by the name Betsy and another whose name is Amy
but she goes by the name Teddy.

In both cases, the problem lies with nicknames and not phonetics. It would be
interesting to develop a phonetic code which accounted for nicknames.
Perhaps by converting names into canonical names (Betsy -> Elizabeth) before
applying the phonetic code. Or maybe using the edit distance between two
phonetic codes to determine similarity. With a phonetic code like NYSIIS, this
hybrid approach could be fruitful. Some research by Zobel and Dart
indicate success with similar hybrid techniques.

No comments: