Thursday, October 20, 2005

Identifying People

Negative Example



Background




I've been developing a program to interface to a government-designed database (I know, that's my first problem). The data specification is poorly designed in several respects, but my current focus is the method of "unique identifiers" for persons entered into the database.




When this database was originally designed, the design team decided to use the combination of the person's first name, last name, gender and birthdate as the "unique identifier." To make things worse, they used only the first letter of the first name and the last letter of the last name. The system went live with this scheme. Of course, the "unique identifier" wouldn't be unique for twins John and Jacob Smith (JH being the name part of the identifier with genders and birthdates identical). Several months later, realizing the design failure, the data committee patched the system by adding the person's first service date to the "unique identifier."




This year, the data commitee decided to redesign the data system. Some of the changes were positive, but they apparently didn't see the silliness in trying to patch the existing identifier scheme. The latest specification creates a unique identifier based on the following information about the person.




  • first letter of the first name

  • last letter of the last name

  • gender

  • birthdate

  • mother's maiden name

  • birth city

  • birth state

  • birth country




The motivation for using all these fields as the "unique identifier" is that an individual may enter the database multiple times through independent sources. The committee wants to combine the information from these two sources so that all the information about the person is available at once.



Problems



Here are a few of the problems with this "unique identifier" system.


  • The birthday paradox promises us that this method of generating a "unique identifier" will fail. We cannot create a truly unique identifier from data which is non-unique. The best we can do is attempt to reduce the probability of a collision. However, as the number of persons in the database increases, this becomes increasingly difficult.


  • Neither the first letter of the first name nor the last letter of the last name will always be the same for a particular individual. For example, when spoken, the names "Aaron" and "Erin" are easily confused resulting in different first letters (this problem could be partly remedied with a good phonetic coding system such as NYSIIS). Likewise, people often use middle names or nicknames at different times in their lives which could cause different first letters (and phonetic codes).


  • As designed, this system requires every piece of information. If anything is missing, the person cannot be entered into the system. Anyone accustomed to working with data about people knows that information can always be missing.


  • The lengthy identifier is prone to data entry errors causing false duplicates.




What Should Have Happened




So what should this design committee have done? Their system is bound to fail and the "unique identifier" is bound to grow inexorably longer and longer as the database matures. In my opinion, they should have separated the problem of unique identifiers and the problem of matching duplicate individuals in the database.




Creating truly unique identifiers using a centralized system is an easy problem. We have examples such as IP addresses, domain names, MAC addresses, Social Security Numbers, credit card numbers, bank account numbers, etc. The design committee should have assigned each person in the system a unique identifier when an individual was added into the database. They could have also distributed a bundle of identifiers to each data entry location so that a connection to the central database is not always necessary during initial data entry.


Under this proposal, the data entry locations would not have to bother with the harder problem of matching duplicate individuals. Furthermore, a useful system such as the Luhn algorithm (Perl implementation) provides users of the unique identifiers with assurance that an identifier is acceptable. This can catch data entry errors before they cause trouble.



Matching People




Using only demographic characteristics, it can be quite difficult to determine whether two candidates are actually the same person. Anyone who has done genealogy research knows exactly how difficult and time consuming the problem can be. Essentially, we can never be certain that the two candidates are the same person, we can only attempt to increase our certainty that they are so. If you doubt that locating duplicate individuals is difficult, peruse a copy of Ancestry magazine for examples.



My Ideal




The difficulty of this task brings me to the real point of this article. Several times, I've thought it would be useful to have a Perl module which could look at the data for two individuals and provide a reliable estimate of the probability that they are the same person. The estimate would be based on statistical results and probability tables for various demographics.




I envision code something like this




my $a = Person->new(
given_name => 'John',
surname => 'Smith',
gender => 'male',
birth => '1802',
birth_place => 'Lexington, Kentucky',
source => 'handwritten',
);
my $b = Person->new(
given_name => 'John',
surname => 'Smyth',
gender => 'male',
death => '1864',
death_place => 'Kentucky',
source => 'handwritten | typed',
);

printf "match probability %.1f\n", match( $a, $b );



The above interface is unimportant, the important part is that you give it the information you know and the code does all the hard work calculating the probability that the two persons are the same individual.




Here's my rough idea of how the insides of match() would work.





  1. Notice the years 1802 and 1864 and only use data from approximately that time range.


  2. Notice the birth and death locations of Kentucky and only use data from that state or that region of the country.


  3. Calculate the probability that a male in Kentucky during the 1800s would have the first name John


  4. Calculate the probability that a handwritten "Smith" is the same as a handwritten-then-typed "Smyth".


  5. Calculate the probability that a male born in 1802 would die in 1864.


  6. Combine the forgoing probabilities into a single probability.



Hard Part




The hard part of implementing something like this would be acquiring the numerous statistical tables the algorithm requires. The simple case above requires at least the following tables





  • Distribution of male names in Kentucky between 1800–1870


  • Distribution of male life expectancies in Kentucky between 1800–1870


  • Analysis of errors that occur during the manual and keyed transcription of names.




I think the compilation of these tables is feasible, simply tedious and monumental. As more and more genealogical data is being placed in computer systems around the world, compiling these tables becomes easier. For example, the United States Social Security Administration has data on names back to 1880.




A tool such as I describe here would be enormously valuable for family history researchers. However it should also give you an idea why the design committee at the beginning of this article was foolish in trying to reduce such a difficult task into a simple database identifier.

No comments: