Mar 4 2009

Observations in Data Models

Martin Fowler has an excellent article on contradictory observations and data models, which I think should be required reading for everyone who even thinks about writing genealogical software.

I had never thought about the specific examples that he brings up in the health care profession, though they make perfect sense. I have thought about these very issues quite a bit in the realm of genealogical data and it is my firm belief that software that doesn’t allow for building up a “web of belief” from evidence (or observations as he calls them here)—including contradictory and rejected evidence—is fundamentally broken. That means almost every piece of genealogical software ever written. Certainly all of the commercial ones. Thankfully we’re seeing some progress on this front. The new.familysearch.org site gets us part of the way there—you have separate observations that are merged and disputed and give a view of the data. Unfortunately there are still holes in the conclusions drawn from observations, especially contradictory ones. Hopefully this will get worked out, so that if I have solid evidence that rejects someone else’s entry (which may have been based on no evidence at all, or weak evidence), the view should update to reflect that automatically. Likewise, if I have weak evidence that contradicts someone else’s strong evidence, it should by no means change the view to my new data, but I should be able to record it for posterity (that rejection is important to record so that when someone else stumbles on the same weak evidence they can see that it was given full consideration). Also the new.familysearch.org merging stuff is both not transparent enough and too transparent—you have to really dig to figure out where each bit of data came from and yet every single alternate spelling or date is right there in your face whether the differences are important or not. But these issues are things that can incrementally improve. The important thing is that they’re fundamentally on the right track.

I have a book on genealogical evidence (thanks Mom!) that I’m reading. When I finish it I plan to pontificate in depth about data models and genealogy, and maybe even put some code where my mouth is.


Jul 30 2008

gedtag

Have you ever tried to import aunt Millie’s n-thousand-person GEDCOM into your database? You either ended up with a reeking mess of a database, gave up and restored from a backup, or went insane trying to clean up the mess. Believe me, I know. And my family GEDCOMs are fairly well-behaved. But then there’s always Ancestral File or online generated GEDCOMs.

This is no laughing matter. In fact, it has been the single most debilitating roadblock to me doing any real genealogy since I got the bug as a teenager.

I think I finally have a way to tame the beast. It’s not a magic bullet—there will still be a lot of mind-numbing match/merge. But it will maintain order and the integrity of the database.

First, start with a clean slate. If you have an existing database, export it to GEDCOM and make a new database. This step isn’t strictly necessary but keeps things ultra clean. If you’re afraid you’ll lose information in the export/import, you need a different genealogy program.

Now, sort your GEDCOMs to import by their importance and reliability. Your original database export probably comes near the top of this list, although not necessarily. Write this down. In fact, write everything down when doing anything in genealogy.

Now, take that first GEDCOM and run it through my magic filter. This will add REFN tags to your GEDCOM that look something like this: hans@fugal.net,2008-07-30:foo.ged/INDI/I1. This tag tells you the submitter’s email (or name), the date in the GEDCOM file (or today’s date), the name of the original GEDCOM file, and the identifying information for this particular record. In short, it keeps track of where that record came from. It will show up in PAF as the custom ID, and likely in other software in a similar manner.

Now import the GEDCOM. In PAF, there is an option on import to reuse RINs. Uncheck this option. The import screen tells you that highest RIN currently used. Take note of this RIN. Now every record in the import will have a RIN above this RIN. The RIN is easier to use in match and merge (it’s right there, you don’t have to dig for it), so the tags we added are for posterity’s sake.

Now, do the match and merge. Did you know that PAF has the ability to match and merge based on the _UID tags it spits out in GEDCOMs? That means if this GEDCOM and the GEDCOM(s) you’ve already imported have a common ancestor, the universally unique IDs will match, and you know without a doubt that someone thought they were the same person already. You can breeze through these merges with confidence that you won’t merge people you shouldn’t. Likewise there is an AFN match and merge, which is almost as trustworthy. (I’m a bit paranoid so I always double-check anything coming from Ancestral File. Maybe it’s because there are about 5 versions of me in Ancestral File, most of which can’t even spell my name right.) Finally, go through the other options (name, soundex) and do a thorough match/merge.

Now, go through all the remaining RINs greater than the RIN you noted earlier. These are the new people in your database. Get to know them. See where they sit in the pedigree. Read the notes. Make sure they meet your quality standards. Add sources if you know of them. Make notes of missing information, questionable stuff to research, etc. You should have a whole truckload of research tasks to do after this import—and some of them you should do before the next import (you’ll recognize these if you take the time to think of them and write them down). Actually you should do that with every person you merge in the previous step as well, since they will merge into lower RINs. Don’t hit that merge button until you’ve done the quality check!

After weeks, months, or years of doing this on Sunday afternoon, you will have a meticulous database that works for you. You will have laid a solid foundation which will impower your future research efforts. You will not be sorry.


Jun 22 2008

Genealogy: Induction or Deduction?

From time to time I think about evidence-based genealogy. All good genealogy is evidence-based, i.e. you have evidence to support all of your conclusions, and a complete stranger would agree with your conclusions because of your evidence. But most amateur genealogists, and computer software, treat evidence as a secondary concern at best. To them, it’s the conclusions that are important, and documenting the evidence is an afterthought and a bother and usually is not done at all. After all, it’s obvious at the moment that you’re recording the marriage of Fred and Wilma that it’s true. Of course later we find that Fred and Wilma never even knew each other, and we’ve forgotten why we thought they got married. Oops.

The problem is exacerbated by the fact that most amateur genealogists (including genealogy software developers) start out by recording the family history stored in their heads. This is information that they are as sure about as they are of gravity. Recording evidence of these “givens” is tedious and ridiculous. And there’s enough of it that by the time you’ve entered it all into the computer (or onto family group sheets) you have developed a solid bad habit of not entering sources.

This is compounded even further by genealogical databases. Go to a family history center or genealogy website and download a few thousand or tens of thousands of names. Who would turn down such a tremendous head start? Who would meticulously verify and document the evidence of every one of those names and the dozen or more conclusions associated with each one?

But I’m getting sidetracked. The question of the day is whether genealogy is an inductive or deductive sport. Let’s review the definitions.

induction |inˈdək sh ən|, noun. The inference of a general law from particular instances.

deduction |diˈdək sh ən|, noun. The inference of particular instances by reference to a general law or principle.

So induction is going from specific to general, i.e. making conculsions based on evidence. Sounds like genealogy, right? But if we replace “general law or principle” with the word “premise,” then it also looks a lot like genealogy. The problem is, neither evidence nor genealogical conclusions look an awful lot like “general laws.”

Let me take another crack at defining the terms. Induction is when you take a bunch of observations and induce a probable generality from them. Deduction is when you take premises and deduce an absolute generality from them, given that the premises are true.

If I have a birth certificate for one Fred Flintstone, then I can deduce that some Fred Flintstone was born on such and such date. The only way to question that conclusion is to question the veracity of the birth certificate. Note that I said some Fred Flintstone. A common pitfall in genealogy is the leap from evidence for someone of the same name to evidence for the particular person being researched.

If I have the birth certificate, and a bunch of other documents, and they all support the notion that there was one Fred Flintstone in Bedrock during this period of time, and all the evidence fits together well, I can construct a probable picture of the person Fred Flintstone. This seems to be induction. Even though my premises are true, I may be taking a leap of faith to conclude that the Fred Flintstone from the birth certificate is the Fred Flintstone that married Wilma (and therefore my ancestor). It’s not deduction because it doesn’t follow directly from the premises.

Well, so it seems like genealogy is both inductive and deductive, and that’s before you even consider the fallability of evidence. No wonder it can be such a mess. This underlines the need for tools that help us dwell in the realm of evidence which is relatively stable compared to the realm of conclusions. Very rarely indeed will a primary source be completely false (though it is more common to find inaccurate sources—bad spelling or slightly-off dates). More often, our conclusions based on the primary sources are completely false. Yet, in the end, it’s the conclusions that we care about. So the software needs to allow us to dwell in the evidence world while providing the context of our current set of conclusions.

Software developers would be tempted to treat evidence-based genealogical software as deductive reasoning. They’d program in all kinds of ways for the computer to do the thinking for you. Fuzzy probable conclusions have no place in this vision. I think that’s the point of this post. We mustn’t fall into that trap or we’ll have another dark age like the conclusion-based age we’re still struggling to get out of. Except this one will be worse because it doesn’t even match the amateur genealogist’s first way of thinking of things.

While I believe there is room for computers to automatically infer things based on evidence, and direct researchers to areas of the family tree that may be influenced by this new bit of information, I think it is vital that we not lose sight of the fact that this is a human enterprise. In the end, a person must interpret the evidence, and she must be able to easily change her mind later. As such, the software must first and foremost be an organizational tool. It must help us make sense of the mass of evidence and conclusions. It must free us from the shackles of disorganization without binding us with the shackles of inflexible deductive logic. And yet, at best it will encourage the infallibility of deductive reasoning where appropriate.

So what do you think? I’m a computer scientist, not a logician and I have been known to confuse inductive and deductive reasoning. Is genealogy inductive, deductive, or both?