Sunday, October 23, 2005

Name Equality Classes

It's late and I'm not thinking clearly, but this notion has been running through my head all day. This is related to my earlier thoughts about matching people. Let's look at the name "Michael", its nicknames and their Metaphone encodings (I switched from NYSIIS to Metaphone since my last post, mostly because Metaphone gives shorter encodings and my testing indicated that they perform about the same).

  • Michael - MXL

  • Mike - MK

  • Mick - MK

  • Mickey - MK

We can see that a person named "Michael" could plausibly have two Metaphone encodings: MXL and MK. In some sense, names with one of those two encodings are equivalent. If I was searching a census for records about "Michael" I should consider a name encoded MK as a possible match (even if it's the written name "Mack").

This is the part that my mind won't wrap around at this late hour. How can I involve these equivalence classes in the calculation of probabilities? Somehow any name with a code in the equivalence class needs to be incorporated in the calculation, but I'm not sure exactly how.

As my wife and I discussed this idea, I described to her some of my notions for how genealogy software and genealogy databases (like Ancestry) should interact. I imagined that as I search Ancestry, I find a record that I think might be a match. I drag that entry from Ancestry's webpage and drop it on the person in my genealogy program that I think matches. After I drop the entry, a menu displays asking me whether I want to merge the entry's information with the entry in my file (complete with source referencing, of course) or whether I want to compare my entry with the Ancestry entry. I choose "Comparison" and my genealogy program fetches the relevant record from Ancestry, compares it with all the data I have on file and returns a short report. The report indicates the probability that this is a match followed by a description of how it arrived at that number. I may then uncheck any factors in the calculation that seem irrelevant or incorrect and instantly see the newly calculated probabilty. Which reminds me that for the good of researchers the world over, Family Search and Ancestry should drink deeply of the web services Kool-Aid ©.

One of the benefits I see from being able to objectively calculate the probability that a given record matches the person I seek, is that it allows a researcher to account for numerous factors outside his knowledge. For example, if I'm looking for a person named "John", it matters greatly whether I'm looking for a John born in 1880 (8.2% of male births) or one born in 2004 (0.78% of male births).

Of course, a tool like this is bound to cause trouble when people think that the computer can do all their genealogy. All they have to do is drop a bunch of links on the genealogy program and merge the high-scoring entries. Regardless, I think the technique could offer valuable help to honest family historians.


No comments: