Mar 4 2009

Observations in Data Models

Martin Fowler has an excellent article on contradictory observations and data models, which I think should be required reading for everyone who even thinks about writing genealogical software.

I had never thought about the specific examples that he brings up in the health care profession, though they make perfect sense. I have thought about these very issues quite a bit in the realm of genealogical data and it is my firm belief that software that doesn’t allow for building up a “web of belief” from evidence (or observations as he calls them here)—including contradictory and rejected evidence—is fundamentally broken. That means almost every piece of genealogical software ever written. Certainly all of the commercial ones. Thankfully we’re seeing some progress on this front. The new.familysearch.org site gets us part of the way there—you have separate observations that are merged and disputed and give a view of the data. Unfortunately there are still holes in the conclusions drawn from observations, especially contradictory ones. Hopefully this will get worked out, so that if I have solid evidence that rejects someone else’s entry (which may have been based on no evidence at all, or weak evidence), the view should update to reflect that automatically. Likewise, if I have weak evidence that contradicts someone else’s strong evidence, it should by no means change the view to my new data, but I should be able to record it for posterity (that rejection is important to record so that when someone else stumbles on the same weak evidence they can see that it was given full consideration). Also the new.familysearch.org merging stuff is both not transparent enough and too transparent—you have to really dig to figure out where each bit of data came from and yet every single alternate spelling or date is right there in your face whether the differences are important or not. But these issues are things that can incrementally improve. The important thing is that they’re fundamentally on the right track.

I have a book on genealogical evidence (thanks Mom!) that I’m reading. When I finish it I plan to pontificate in depth about data models and genealogy, and maybe even put some code where my mouth is.


Aug 9 2008

New FamilySearch

So I finally got around to trying out the “New FamilySearch” today. I am both impressed and disappointed.

First the good parts. NFS (you didn’t think I was going to type “New FamilySearch” over and over, did you?) has an impressive goal and paradigm. The goal is to create one hugemongous centralized database for all church members. The idea is to get away from the half dozen church databases (Ancestral File, IGI, etc.), and the half gazillion individual databases. A noble goal but a very scary one. It would be easy to screw this up and make a bigger mess than that with which we started. In fact this is why I have been reluctant to check it out—I didn’t want to be disappointed and I wholly expected to be.

Well, they actually pull it off quite well. The new paradigm is to keep everything and to promote recording evidence. In short, genealogy done right. When you merge a person, that is recorded and available for others to see. When you want to change information, you don’t change it directly (as you would in a conclusion-based program like PAF), but instead you “add an opinion” complete with sources and/or notes. If you think that a piece of information is wrong and you have evidence against it, you can dispute it (again, giving source and notes). The old “wrong” information isn’t eliminated, but it is marked as disputed. The changes and choices you make about people show up in the pedigree chart etc. This is multi-user genealogy done well (I might call it “distributed genealogy”, but I’ll reserve that term for something better, as you’ll see later).

From the perspective of an LDS member this is a fantastic system. When ordinances are performed in the temple they are immediately reflected in the database. When you want to do temple work for so-and-so, you state your intentions in the system and print out a page to take to the temple with you. If anyone else tried to do the same work, they’d see it was in process. This will drastically reduce—perhaps even essentially eliminate—duplicated effort in the temple. I have to say it’s about time. It would have been cool 10 years ago. It was expected 7 years ago. Now it’s finally here.

There are some other cool tidbits, too, like the pedigree view which combines couples to make better use of space (are they the first to think of it? Probably not, though I haven’t personally seen this approach before):

NFS pedigree

There’s an info box at the bottom with different tabs, one of which is “possible duplicates”. I much much much prefer working with duplicates in this manner, rather than a global “match and merge”. Very nice. There are also time lines and Google Maps integration (see where your ancestor was born, married, died, etc.). And those little temple icons unobtrusively notify you of potential temple work to do (or that has already been done). Overall they make nice use of AJAX, too.

But there’s problems. Big problems.

It’s slow. Painfully slow. It’s slow enough to be a real pain for doing actual genealogical work. Maybe people with limited computer skills wouldn’t find it slow, because it moves at about the pace they can keep up with. But for those of us in the computer age (read: almost everyone in my generation or younger) it is painful and restrictive. Why is it slow? Because it’s a web app. News flash! Even AJAX web apps are slow.

Ok, it’s slow. No big deal, right? Just download the GEDCOM, do your research, and upload the changes. Right? I have news for you. There’s no exporting data from NFS. The help center has this to say:

Exporting Information from FamilySearch for Use in Your Personal Computer

This topic describes how to get information from FamilySearch into your family history computer program.

If you find information in FamilySearch that you do not have, you will need to either use the cut and paste features of your operating system or retype it into your computer program.

Currently, FamilySearch does not support downloading information for use with Personal Ancestral File or similar computer programs. Family history computer programs may choose to support this feature when it becomes available from FamilySearch.

Really. Cut and paste!
It is a big black hole waiting to consume your information and display it to you on its terms only. Its slow terms. You want to make a family pedigree website? Write a script to spit out all the place names of your ancestors so you can put blue dots on a map? Make a Google Maps mashup? Do any number of other useful things with a GEDCOM export, including actually be able to work with it at a reasonable speed, put it on your handheld for reference at the family history library? Print out reports? No way. Uh-uh. Remember how I avoided using the term “distributed genealogy”? It’s like having your genealogy in a distributed revision control system like mercurial or git, but you can only access the one single repository with a web interface. You can’t check out the code. You can’t work offline. You can’t use your own tools. You can’t write emergent scripts. You’re screwed.

For understandable reasons, you can’t see information on living people, and they don’t show up as search results. You do get access to your own ancestors and descendents and your spouse, but apparently not your spouse’s family, your siblings, or any information on living people (like your parents’ birthdays, etc.). You can enter this information in, or upload it in a GEDCOM. But the first rule of genealogy is start with your 4 generations. If everyone starts with their 4 generations, but most of those people are still alive, then how much effort is duplicated? How many duplicate versions of my dad will there be? Well let’s see, he has 11 siblings, various aunts and uncles who are into genealogy, 7 children (who should all see the same record, but might conceivably enter conflicting information). Not a huge problem, but an annoyance. Once you fill out the tree to the dead people (hint: upload a GEDCOM of what you already have here, but only those first couple generations), then you find and link the dead people into the tree, then you have a nice resource. So far, it’s just a research resource—I wouldn’t trust a lot of things further than I can throw them, but they make good research jumping-off points. Maybe eventually through the hard work of thousands it will converge to a respectable database, in the spirit of a wiki.

Also, it’s presently restricted to LDS members (you need your membership number and confirmation date to register). The best genealogists I know aren’t LDS. Certainly the bulk of decent genealogists I know aren’t LDS. Most of the lousy genealogists I know are LDS. (Of course, that doesn’t mean we have a monopoly on lousy genealogists, I just haven’t had reason or opportunity to mingle with lousy non-LDS genealogists much). So this seems like a drawback across the board.

Maybe down the road (I think it’s still beta, though they never use that word) it will allow GEDCOM export and be available to all genealogists. Maybe the speed issue will be addressed, or they’ll come up with a desktop client. Maybe this will be the rockingest genealogy database ever. Or maybe it will be of marginal interest—a great way to prepare names for the temple and avoid duplicate temple work, but not a good tool for daily genealogical work. Time will tell.

I am impressed by the no-information-loss implementation. I’d like to propose taking it a step further. What if we could publish genealogical repositories on websites like we do with mercurial or git? What if we had the genealogical equivalent of github? What if you and all the other genealogists out there could, without information loss, match and merge and add information and correct information and unmerge faulty merges and… all without loss of information, the ability to go back in time (like you can with a revision control system), etc. A global genealogical database, a global record of genealogical discovery. Now, one huge database doesn’t make a lot of sense. It’d be a pain to push and pull. So you’d have to be able to push and pull only pieces of the tree. And of course the merging, confidence, dispute, etc. aspects would have to be dealt with well (as they mostly are in NFS, though there would be unique challenges for it in a truly distributed genealogical system). Just imagine the potential. And feel free to expound on your imaginings in the comments.


Jul 30 2008

gedtag

Have you ever tried to import aunt Millie’s n-thousand-person GEDCOM into your database? You either ended up with a reeking mess of a database, gave up and restored from a backup, or went insane trying to clean up the mess. Believe me, I know. And my family GEDCOMs are fairly well-behaved. But then there’s always Ancestral File or online generated GEDCOMs.

This is no laughing matter. In fact, it has been the single most debilitating roadblock to me doing any real genealogy since I got the bug as a teenager.

I think I finally have a way to tame the beast. It’s not a magic bullet—there will still be a lot of mind-numbing match/merge. But it will maintain order and the integrity of the database.

First, start with a clean slate. If you have an existing database, export it to GEDCOM and make a new database. This step isn’t strictly necessary but keeps things ultra clean. If you’re afraid you’ll lose information in the export/import, you need a different genealogy program.

Now, sort your GEDCOMs to import by their importance and reliability. Your original database export probably comes near the top of this list, although not necessarily. Write this down. In fact, write everything down when doing anything in genealogy.

Now, take that first GEDCOM and run it through my magic filter. This will add REFN tags to your GEDCOM that look something like this: hans@fugal.net,2008-07-30:foo.ged/INDI/I1. This tag tells you the submitter’s email (or name), the date in the GEDCOM file (or today’s date), the name of the original GEDCOM file, and the identifying information for this particular record. In short, it keeps track of where that record came from. It will show up in PAF as the custom ID, and likely in other software in a similar manner.

Now import the GEDCOM. In PAF, there is an option on import to reuse RINs. Uncheck this option. The import screen tells you that highest RIN currently used. Take note of this RIN. Now every record in the import will have a RIN above this RIN. The RIN is easier to use in match and merge (it’s right there, you don’t have to dig for it), so the tags we added are for posterity’s sake.

Now, do the match and merge. Did you know that PAF has the ability to match and merge based on the _UID tags it spits out in GEDCOMs? That means if this GEDCOM and the GEDCOM(s) you’ve already imported have a common ancestor, the universally unique IDs will match, and you know without a doubt that someone thought they were the same person already. You can breeze through these merges with confidence that you won’t merge people you shouldn’t. Likewise there is an AFN match and merge, which is almost as trustworthy. (I’m a bit paranoid so I always double-check anything coming from Ancestral File. Maybe it’s because there are about 5 versions of me in Ancestral File, most of which can’t even spell my name right.) Finally, go through the other options (name, soundex) and do a thorough match/merge.

Now, go through all the remaining RINs greater than the RIN you noted earlier. These are the new people in your database. Get to know them. See where they sit in the pedigree. Read the notes. Make sure they meet your quality standards. Add sources if you know of them. Make notes of missing information, questionable stuff to research, etc. You should have a whole truckload of research tasks to do after this import—and some of them you should do before the next import (you’ll recognize these if you take the time to think of them and write them down). Actually you should do that with every person you merge in the previous step as well, since they will merge into lower RINs. Don’t hit that merge button until you’ve done the quality check!

After weeks, months, or years of doing this on Sunday afternoon, you will have a meticulous database that works for you. You will have laid a solid foundation which will impower your future research efforts. You will not be sorry.


Dec 3 2007

Another Fugal

Friday, November 30. It was a dark and stormy night. Erin began having regular contractions and about 5 hours later Lachlan Pádraig Fugal was born. He escaped just in time to see most of the last half hour of November.

He weighed 3.86kg and was 49.5cm long. For those of you stuck in the past, that’s 0.6 stone and 58 barleycorns.

We’re glad to have him with us and we’re happy to report that we’re all doing fine, even Jonathan.


Lachlan Pádraig Fugal

I suppose you’re dying to know more about the name. Lachlan is of Gaelic origin
and means “from the land of the lochs,” i.e. the Vikings. Pádraig is the old
Gaelic spelling of Patrick, and is pronounced something like PAH-drig but
leaning towards Patrick. You can just say Patrick, but don’t say “pa-DRAYG” or
we’ll laugh at you. Fugal of course comes from the Danish Fugl which means
bird. Erin wasn’t going for my “crazy” Scandinavian names, and I was not too
keen on Christopher or Christian, so we were at an impasse for awhile. Somehow
Lachlan slipped through her “craziness” filters and we tried it out for a day.
We think it’s a delightful name; strong and cute and with lots of meaning and
family heritage, just like Jonathan Frederik Fugal, his older brother.