The Fugue Counterpoint by Hans Fugal


Calipers and Science

Just for kicks I dug up the original Jackson/Pollock paper for skinfold measurements for determining body fat percentage. Turns out there's also a 7-point equation that also takes circumference of waist and forearm into account.

Here's a snapshot of the equations for men from the paper ("Generalized equations for predicting body density of men" by A.S. Jackson and M.L. Pollock, 1978. I couldn't find the PDF for the women paper online).
Generalized body density equations

Important notes: skinfolds are in millimeters, circumferences are in meters, and log is the natural log (ln in most computer languages). I plugged my values from two weeks back into a spreadsheet and got the following results:

JP Equation Density %BF
Sum of seven skinfolds
S, S^2, age 1.0518 20.62%
S, S^2, age,C 1.0476 22.51%
log S, age 1.0506 21.15%
log S, age, C 1.0482 22.25%
Sum of three skinfolds
S, S^2, age (5) 1.0607 16.69%
S, S^2, age,C (6) 1.0549 19.24%
log S, age (7) 1.0578 17.95%
log S, age, C (8) 1.0574 18.14%

The most interesting thing here is that there's a large difference between 7 and 3 site measurements, and the 3 site range is significantly larger. Also very interesting to note is that the one-site (suprailiac) AccuMeasure chart is, for me, in line with the 7-site measurement (22.1%). Given other measurements I've taken and just general guesswork based on what I see in the mirror, I think that is a decent estimate.

It's also curious that there are two sets of equations given, one using logs and one using squares.

Moral of the story: more data is better, sometimes not-enough more data is worse than a simpler estimate, and interesting things can be learned when you go to the original source. (This is just a quick note, but the paper is very interesting and reading it will be an interesting exercise that sets proper expectations for, and understanding of, the JP7 skinfold method).


Observations in Data Models

Martin Fowler has an excellent article on contradictory observations and data models, which I think should be required reading for everyone who even thinks about writing genealogical software.

I had never thought about the specific examples that he brings up in the health care profession, though they make perfect sense. I have thought about these very issues quite a bit in the realm of genealogical data and it is my firm belief that software that doesn't allow for building up a "web of belief" from evidence (or observations as he calls them here)—including contradictory and rejected evidence—is fundamentally broken. That means almost every piece of genealogical software ever written. Certainly all of the commercial ones. Thankfully we're seeing some progress on this front. The site gets us part of the way there—you have separate observations that are merged and disputed and give a view of the data. Unfortunately there are still holes in the conclusions drawn from observations, especially contradictory ones. Hopefully this will get worked out, so that if I have solid evidence that rejects someone else's entry (which may have been based on no evidence at all, or weak evidence), the view should update to reflect that automatically. Likewise, if I have weak evidence that contradicts someone else's strong evidence, it should by no means change the view to my new data, but I should be able to record it for posterity (that rejection is important to record so that when someone else stumbles on the same weak evidence they can see that it was given full consideration). Also the merging stuff is both not transparent enough and too transparent—you have to really dig to figure out where each bit of data came from and yet every single alternate spelling or date is right there in your face whether the differences are important or not. But these issues are things that can incrementally improve. The important thing is that they're fundamentally on the right track.

I have a book on genealogical evidence (thanks Mom!) that I'm reading. When I finish it I plan to pontificate in depth about data models and genealogy, and maybe even put some code where my mouth is.


Sensible Graphs with Cacti

I love Cacti. It's an excellent tool for visualizing interesting statistics like bandwidth usage, CPU and load average, memory usage, etc. It's relatively straightforward to set up, if slightly klunky, and it takes a lot of guesswork out of questions that are otherwise difficult to answer. (I should note here that Cacti is a sort of front-end to RRDtool which does all the hard work as far as the visualization is concerned.)

But some of the default graphs that come with Cacti are absolute rubbish. I took it upon myself to fix the two worst offenders this week: the load average graph and the memory usage graph. Let's compare, shall we?

Here's the default load average graph:

default load average graph

This graph is just plain wrong. It stacks the load averages one on top of the other which makes it impossible to get a real reading for the 5 and 15 minute averages, and makes things look worse than they are. If that textual explanation went over your head, compare with this repaired load average graph and all will be made clear:

my load average graph

Wow, you can actually see how the averages are, well, averages. Funny thing about proper graphs.

This change is simple enough to do yourself so I won't provide a template download in the interest of expanding your mind (hopefully without exploding your skull). Right after I show you my pretty memory usage graph, that is.

First, let's see the default memory usage graph:

default memory usage graph

If you can tell what that graph is saying at a glance, you're better than I. This one doesn't so much lie as beat around the bush. The vital information is there, if you know how to read it. The key is that the stuff you see totals the RAM that is available for programs to consume (free+buffers+cache), so the smaller the area of the graph, the less memory you have available. It also doesn't show swap. Swap is available on another graph (also in terms of free swap not swap used), but on a separate graph you miss out on the relative comparison.

Here's the memory graph I came up with:

my memory usage graph

I think it is self-explanatory and that it has all the information you could ask of a memory usage graph presented in the clearest possible way. Maybe I'm a bit biased, but you have to admit it's better.

So how do we modify and create graphs in Cacti for fun and profit? Let's begin with the load average graph. No, scratch that. Let's begin with some terminology.

Cacti has graph templates that define what the graph will look like. We'll spend a lot of time creating and modifying those. It also has data templates for telling it how to get the data (e.g. the SNMP OID or the script to run). You use a data template to create a data source which actually fetches and stores that data, and you use a graph template to create a graph that is associated with a device (host) and its data sources. Data sources are usually created automatically when you create a graph. There's one more oddball thing called a CDEF which is basically a rudimentary RPN calculator that you have to define the expressions for ahead of time in the most excruciatingly painful way. But we'll need a couple for the memory usage graph.

SNMP stands for Simple Network Management Protocol, which naturally means that it's the antithesis of simple and that it is mostly used for monitoring instead of management (though you can indeed use it for management, which is way beyond the scope here). The short of it is, you have devices that talk SNMP and you can get info about interesting things that you'd like to graph with Cacti over the network. If you have a linux box, it can be made to talk SNMP by installing Net-SNMP and configuring it.

SNMP version 3 is a complicated mess to configure because you have to have a PhD in network security to understand its authentication schemes (in which case you might conclude that it's not secure enough). Versions 1 and 2c are both sufficient for my needs, and from our point of view they're essentially identical and simple enough to explain. I'll assume you use version 2c. There's a cleartext password for read-only access and optionally one for read-write access (for that management thing that we don't do). In order to keep things (anti)simple, they're not called passwords but rather "community strings". The default community strings for when you really can't be bothered to change them are "public" and "private", and most SNMP devices come with these defaults preset. What's that? You didn't realize you had several (dozens?) of devices on your network just waiting for some bored employee to start playing with its settings from the comfort of his workstation because you didn't change the default read-write community string? Well, you do.

Here's the snmpd config file I use, which I don't mind sharing because the only way you can get to it is over my LAN or my VPN, and it's read-only anyway and I have no secrets about my host stats.

rocommunity  yoursecrethere
syslocation  "Las Cruces"
sysservices 79

If you can't figure out how to tweak the configuration file included with your distro (which is no doubt hundreds of lines long with loads of comments), you can replace it with something like that and you'll be up and running with SNMP version 2c.

Ok, now you can install Cacti. Then create a device using the ucd/net SNMP device template for the host you want to monitor (you don't technically have to do that with localhost but you'd have to modify my graphs to use the non-SNMP data sources). When the device is created and it says it was able to connect to it ok, then you can create graphs for the device. Go ahead and create the "ucd/net - Load Average" graph. Then you'll no doubt dash over to the graphs "tab" and be totally dismayed that the graph seems broken. Fear not, it'll show up once it's had some time to gather data (check back in 5 minutes).

In the meantime we can go fix the load average graph template. Any changes we make will apply to the graph we just created as well as any new graphs we create with that template. Go to "Graph templates" on the left then find the graph of interest and click on its name. Take a moment familiarizing yourself with this page, then click on the 5 minute average item to edit it. Here you change the graph item type from STACK to LINE1. I also changed the color to 002ABF which shows up better. Do the same for the 15 minute average item (LINE1, I left the color alone). Now go refresh your graph and you'll see the changes. Et voilà, you are a Cacti graph template hacker. At this point you may feel the irresistable urge to change the colors of some of the more ugly but functional graphs, and I won't hinder you. I'll wait right here.

Ok, the memory usage graph is a bit more work. I won't take you through it step by step but I'll point out a couple of gotchas that I encountered when creating it. First, I realize that others have made memory usage graphs and provided them on forums and such to download. After the third one failed to work I decided it was better to just make my own. Hopefully mine will work for you—I put a bit of effort into making sure it would import cleanly.

There's actually a reason why the memory usage graphs are so backwards: because most devices provide total and free stats but not used stats. Obviously they expect you to calculate used yourself. So directly graphing the bits provided by SNMP was the easy way out.

We, on the other hand, have chosen the path of pain. We need to calculate memory used (which is total-(free+cache+buffers)). We could do this with a script but that's sticky and not very portable (depending on the target distro, version of Cacti, etc.). The better thing is to use a CDEF. If you click on graph management the CDEFs link is revealed. We want a CDEF that calculates (total-free-cache-buffers)*1024 (the sources are kilobytes). Now, a CDEF uses a positional reference system. The first data source used by your graph is a, the second is b, and so on. So the CDEF string will look something like d,a,-,b,-,c,-,1024,*. But here's where things get dodgy—it's hard to know what order the data sources will settle on until after you've created the graph. If you create the graph in the right order (no shuffling) and you realize that the AVERAGE and MAX consolidation functions create separate data source (but not LAST), and who-knows-what other pitfalls, then you can be confident ahead of time. Or, you can just create the template, create a graph using the template, and look at the graph debug output to figure out which source is which.

So now you create a new graph template, and referring to a template similar to what you want you fill in all the right fields, leave most at their defaults, add graph items, tweak and refresh a sample graph using your template a gazillion times, go back and forth with the CDEFs getting things right, then create new (temporary) graphs to make sure it works.

Luckily for you, if all you want is a cool memory graph, I did all this for you. Download and import my memory usage graph template, create a graph, and in a day or so you'll have a memory usage graph as pretty as mine. Oh, alright, I'll provide a load average template for you as well.


Gnuplot in Action

One of the oldest and most universally useful tools we have is gnuplot. It is also one of the least understood and most underutilized tools we have.

I can hear you now. "What do I need gnuplot for? I don't make graphs." Well that's exactly the problem. Everyone who works with data should be making graphs, and lots of them. Do you write programs that manipulate data? You need gnuplot. Do you want to evaluate performance or traffic on your website? You need gnuplot. Do you want to impress your friends with cool graphs of the growth rates of yeast and bacteria in sourdough or your weight loss and percent body fat? You need gnuplot.

I've been using gnuplot for years. I scraped up enough gnuplot skillz to make basic graphs and it has been invaluable. But I knew gnuplot could do more than I knew how to make it do, and whenever I tried to do something advanced it was only with great pain that I succeeded. Often I failed. Let's face it, gnuplot can be a bear to learn. Why? Well, mostly because of the documentation. Not that there isn't any, almost the contrary. There's a lot of documentation, but it's very much reference documentation. What the world has been lacking is a good introduction to gnuplot that isn't afraid to get nitty-gritty where it needs to, but doesn't just parrot the abundant but obscure documentation that's already out there.

We no longer need to wait. The book is called Gnuplot in Action by Philipp Janert, and it is an absolutely fantastic book. Really, I can't say enough good about it.

Janert walks the fine line between cheesy tutorial and dense reference with the skill of a circus acrobat. The writing is approachable, yet chock full of useful information. Nothing is rushed, but it doesn't plod. The text is sprinkled with beautiful graphs that expand your imagination and open your eyes to the possibilities of gnuplot.

In chapter 2, "Essential Gnuplot", the impatient reader is given a whirlwind tour of gnuplot basics. After just 11 pages you will know everything you need to know for 90% of the graphs you will ever need to create. In fact, you'll know more than I knew when I began reading it—I learned a couple things that I kick myself for never having discovered on my own.

Chapter 3 goes into more detail on dealing with data, and in that chapter I learned a ton. Several of the things I learned in this chapter have saved me numerous hours this semester alone. Chapter 4 picks up the remaining miscellany.

In part 2, all those nagging questions of polish are addressed. This is where I used to spend the most time banging my head against the wall, searching, plodding through various newsgroup threads. "How do I get this or that to look just right?" These types of questions are hard to find answers to in search engines. Janert takes us by the hand and explains each and every question I've ever had and a few I hadn't yet dared to have. Truly beautiful graphs are now within my grasp. What's more, it no longer seems like an exercise in pain but a simple recipe for success. After Janert explains these techniques they seem plain as the nose on your face, yet he's not condescending.

Part 3 dives into the deep dark secrets of gnuplot. 3D plots, color, multiplots, different coordinate systems, fitting, terminals, and a dozen other things you didn't even know that you didn't want to know. No doubt you'll skim this section the first time and come back to it when you need those dark magic tidbits.

Part 4 is arguably the most important part of this book, or perhaps second after part 1. Part 4 is a crash course on graphical analysis. What kinds of graphs you can create, when you should and shouldn't use them, how not to lie with graphs (and how to pick out people lying with graphs), and most importantly, how to go from raw data that you don't understand to organized data that you do understand and have pretty graphs to demonstrate to boot. All
with practical examples that you can tweak for your own use.

Finally, there's a gnuplot reference in the appendix. This is a deluxe package and has everything you need to become a gnuplot guru. I am thrilled that this book is coming to dispel the darkness surrounding gnuplot.

I really have no cons to speak of, other than the prerelease PDF I had access to had some minor problems—the sort of problem I would expect to be resolved in the final stages of editing. I don't have experience with other Manning books, but having seen prerelease versions of other books from other publishers I'd say the current copy is par for the course. I'm certain they'll
fix those things up and have an outstanding PDF in the end. I recommend springing for the dead tree version though, as I expect the reference at the end of the book and the examples throughout will be more accessible next to your computer instead of on the screen. (You already use quite a bit of real estate running gnuplot and/or editing a gnuplot file and displaying graphs.)