Wednesday, October 26, 2005

Broken konversation audio notifications


For quite some time, I have been trying to get the konversation IRC client to play sounds when certain events happen in IRC. KDE has a slick, unified notifications system but for some reason, I couldn't ever get the sound notifications to work.




Well, I finally found the answer. This follow-up to a bug report about failing KDE audio notifications provides the solution. In brief,




  1. rm ~/.kde/share/config/knotifyrc

  2. killall knotify




That's it. When you try your next audio notification through knotify, everything should work.

Sunday, October 23, 2005

Name Equality Classes


It's late and I'm not thinking clearly, but this notion has been running through my head all day. This is related to my earlier thoughts about matching people. Let's look at the name "Michael", its nicknames and their Metaphone encodings (I switched from NYSIIS to Metaphone since my last post, mostly because Metaphone gives shorter encodings and my testing indicated that they perform about the same).




  • Michael - MXL

  • Mike - MK

  • Mick - MK

  • Mickey - MK




We can see that a person named "Michael" could plausibly have two Metaphone encodings: MXL and MK. In some sense, names with one of those two encodings are equivalent. If I was searching a census for records about "Michael" I should consider a name encoded MK as a possible match (even if it's the written name "Mack").




This is the part that my mind won't wrap around at this late hour. How can I involve these equivalence classes in the calculation of probabilities? Somehow any name with a code in the equivalence class needs to be incorporated in the calculation, but I'm not sure exactly how.




As my wife and I discussed this idea, I described to her some of my notions for how genealogy software and genealogy databases (like Ancestry) should interact. I imagined that as I search Ancestry, I find a record that I think might be a match. I drag that entry from Ancestry's webpage and drop it on the person in my genealogy program that I think matches. After I drop the entry, a menu displays asking me whether I want to merge the entry's information with the entry in my file (complete with source referencing, of course) or whether I want to compare my entry with the Ancestry entry. I choose "Comparison" and my genealogy program fetches the relevant record from Ancestry, compares it with all the data I have on file and returns a short report. The report indicates the probability that this is a match followed by a description of how it arrived at that number. I may then uncheck any factors in the calculation that seem irrelevant or incorrect and instantly see the newly calculated probabilty. Which reminds me that for the good of researchers the world over, Family Search and Ancestry should drink deeply of the web services Kool-Aid ©.




One of the benefits I see from being able to objectively calculate the probability that a given record matches the person I seek, is that it allows a researcher to account for numerous factors outside his knowledge. For example, if I'm looking for a person named "John", it matters greatly whether I'm looking for a John born in 1880 (8.2% of male births) or one born in 2004 (0.78% of male births).




Of course, a tool like this is bound to cause trouble when people think that the computer can do all their genealogy. All they have to do is drop a bunch of links on the genealogy program and merge the high-scoring entries. Regardless, I think the technique could offer valuable help to honest family historians.




Categories:
,

Thursday, October 20, 2005

Identifying People

Negative Example



Background




I've been developing a program to interface to a government-designed database (I know, that's my first problem). The data specification is poorly designed in several respects, but my current focus is the method of "unique identifiers" for persons entered into the database.




When this database was originally designed, the design team decided to use the combination of the person's first name, last name, gender and birthdate as the "unique identifier." To make things worse, they used only the first letter of the first name and the last letter of the last name. The system went live with this scheme. Of course, the "unique identifier" wouldn't be unique for twins John and Jacob Smith (JH being the name part of the identifier with genders and birthdates identical). Several months later, realizing the design failure, the data committee patched the system by adding the person's first service date to the "unique identifier."




This year, the data commitee decided to redesign the data system. Some of the changes were positive, but they apparently didn't see the silliness in trying to patch the existing identifier scheme. The latest specification creates a unique identifier based on the following information about the person.




  • first letter of the first name

  • last letter of the last name

  • gender

  • birthdate

  • mother's maiden name

  • birth city

  • birth state

  • birth country




The motivation for using all these fields as the "unique identifier" is that an individual may enter the database multiple times through independent sources. The committee wants to combine the information from these two sources so that all the information about the person is available at once.



Problems



Here are a few of the problems with this "unique identifier" system.


  • The birthday paradox promises us that this method of generating a "unique identifier" will fail. We cannot create a truly unique identifier from data which is non-unique. The best we can do is attempt to reduce the probability of a collision. However, as the number of persons in the database increases, this becomes increasingly difficult.


  • Neither the first letter of the first name nor the last letter of the last name will always be the same for a particular individual. For example, when spoken, the names "Aaron" and "Erin" are easily confused resulting in different first letters (this problem could be partly remedied with a good phonetic coding system such as NYSIIS). Likewise, people often use middle names or nicknames at different times in their lives which could cause different first letters (and phonetic codes).


  • As designed, this system requires every piece of information. If anything is missing, the person cannot be entered into the system. Anyone accustomed to working with data about people knows that information can always be missing.


  • The lengthy identifier is prone to data entry errors causing false duplicates.




What Should Have Happened




So what should this design committee have done? Their system is bound to fail and the "unique identifier" is bound to grow inexorably longer and longer as the database matures. In my opinion, they should have separated the problem of unique identifiers and the problem of matching duplicate individuals in the database.




Creating truly unique identifiers using a centralized system is an easy problem. We have examples such as IP addresses, domain names, MAC addresses, Social Security Numbers, credit card numbers, bank account numbers, etc. The design committee should have assigned each person in the system a unique identifier when an individual was added into the database. They could have also distributed a bundle of identifiers to each data entry location so that a connection to the central database is not always necessary during initial data entry.


Under this proposal, the data entry locations would not have to bother with the harder problem of matching duplicate individuals. Furthermore, a useful system such as the Luhn algorithm (Perl implementation) provides users of the unique identifiers with assurance that an identifier is acceptable. This can catch data entry errors before they cause trouble.



Matching People




Using only demographic characteristics, it can be quite difficult to determine whether two candidates are actually the same person. Anyone who has done genealogy research knows exactly how difficult and time consuming the problem can be. Essentially, we can never be certain that the two candidates are the same person, we can only attempt to increase our certainty that they are so. If you doubt that locating duplicate individuals is difficult, peruse a copy of Ancestry magazine for examples.



My Ideal




The difficulty of this task brings me to the real point of this article. Several times, I've thought it would be useful to have a Perl module which could look at the data for two individuals and provide a reliable estimate of the probability that they are the same person. The estimate would be based on statistical results and probability tables for various demographics.




I envision code something like this




my $a = Person->new(
given_name => 'John',
surname => 'Smith',
gender => 'male',
birth => '1802',
birth_place => 'Lexington, Kentucky',
source => 'handwritten',
);
my $b = Person->new(
given_name => 'John',
surname => 'Smyth',
gender => 'male',
death => '1864',
death_place => 'Kentucky',
source => 'handwritten | typed',
);

printf "match probability %.1f\n", match( $a, $b );



The above interface is unimportant, the important part is that you give it the information you know and the code does all the hard work calculating the probability that the two persons are the same individual.




Here's my rough idea of how the insides of match() would work.





  1. Notice the years 1802 and 1864 and only use data from approximately that time range.


  2. Notice the birth and death locations of Kentucky and only use data from that state or that region of the country.


  3. Calculate the probability that a male in Kentucky during the 1800s would have the first name John


  4. Calculate the probability that a handwritten "Smith" is the same as a handwritten-then-typed "Smyth".


  5. Calculate the probability that a male born in 1802 would die in 1864.


  6. Combine the forgoing probabilities into a single probability.



Hard Part




The hard part of implementing something like this would be acquiring the numerous statistical tables the algorithm requires. The simple case above requires at least the following tables





  • Distribution of male names in Kentucky between 1800–1870


  • Distribution of male life expectancies in Kentucky between 1800–1870


  • Analysis of errors that occur during the manual and keyed transcription of names.




I think the compilation of these tables is feasible, simply tedious and monumental. As more and more genealogical data is being placed in computer systems around the world, compiling these tables becomes easier. For example, the United States Social Security Administration has data on names back to 1880.




A tool such as I describe here would be enormously valuable for family history researchers. However it should also give you an idea why the design committee at the beginning of this article was foolish in trying to reduce such a difficult task into a simple database identifier.

Contextual::Return confusion


I had a short snippet of code using Contextual::Return something like this




use strict;
use warnings;

my $a = foo();
print "$a\n";

sub foo {
return
NUM { return 1234 }
STR { return "foo" }
;
}



but when I ran it, the output showed 1234 instead of foo. No matter what I did, $a was always the result of the first context I specified.




But, ah ha, I didn't actually use Contextual::Return. Adding the appropriate use Contextual::Return; line to the top of my snippet produced the correct behavior.




You would laugh if you knew how much time I spent debugging that confusion, but shouldn't there be a warning message or something since NUM and STR aren't defined? It seems so, but there wasn't.

Wednesday, October 19, 2005

blogs

Hmm, they have blogs on computers now.