I love Cacti. It's an excellent tool for visualizing interesting statistics like bandwidth usage, CPU and load average, memory usage, etc. It's relatively straightforward to set up, if slightly klunky, and it takes a lot of guesswork out of questions that are otherwise difficult to answer. (I should note here that Cacti is a sort of front-end to RRDtool which does all the hard work as far as the visualization is concerned.)
But some of the default graphs that come with Cacti are absolute rubbish. I took it upon myself to fix the two worst offenders this week: the load average graph and the memory usage graph. Let's compare, shall we?
Here's the default load average graph:
This graph is just plain wrong. It stacks the load averages one on top of the other which makes it impossible to get a real reading for the 5 and 15 minute averages, and makes things look worse than they are. If that textual explanation went over your head, compare with this repaired load average graph and all will be made clear:
Wow, you can actually see how the averages are, well, averages. Funny thing about proper graphs.
This change is simple enough to do yourself so I won't provide a template download in the interest of expanding your mind (hopefully without exploding your skull). Right after I show you my pretty memory usage graph, that is.
First, let's see the default memory usage graph:
If you can tell what that graph is saying at a glance, you're better than I. This one doesn't so much lie as beat around the bush. The vital information is there, if you know how to read it. The key is that the stuff you see totals the RAM that is available for programs to consume (free+buffers+cache), so the smaller the area of the graph, the less memory you have available. It also doesn't show swap. Swap is available on another graph (also in terms of free swap not swap used), but on a separate graph you miss out on the relative comparison.
Here's the memory graph I came up with:
I think it is self-explanatory and that it has all the information you could ask of a memory usage graph presented in the clearest possible way. Maybe I'm a bit biased, but you have to admit it's better.
So how do we modify and create graphs in Cacti for fun and profit? Let's begin with the load average graph. No, scratch that. Let's begin with some terminology.
Cacti has graph templates that define what the graph will look like. We'll spend a lot of time creating and modifying those. It also has data templates for telling it how to get the data (e.g. the SNMP OID or the script to run). You use a data template to create a data source which actually fetches and stores that data, and you use a graph template to create a graph that is associated with a device (host) and its data sources. Data sources are usually created automatically when you create a graph. There's one more oddball thing called a CDEF which is basically a rudimentary RPN calculator that you have to define the expressions for ahead of time in the most excruciatingly painful way. But we'll need a couple for the memory usage graph.
SNMP stands for Simple Network Management Protocol, which naturally means that it's the antithesis of simple and that it is mostly used for monitoring instead of management (though you can indeed use it for management, which is way beyond the scope here). The short of it is, you have devices that talk SNMP and you can get info about interesting things that you'd like to graph with Cacti over the network. If you have a linux box, it can be made to talk SNMP by installing Net-SNMP and configuring it.
SNMP version 3 is a complicated mess to configure because you have to have a PhD in network security to understand its authentication schemes (in which case you might conclude that it's not secure enough). Versions 1 and 2c are both sufficient for my needs, and from our point of view they're essentially identical and simple enough to explain. I'll assume you use version 2c. There's a cleartext password for read-only access and optionally one for read-write access (for that management thing that we don't do). In order to keep things (anti)simple, they're not called passwords but rather "community strings". The default community strings for when you really can't be bothered to change them are "public" and "private", and most SNMP devices come with these defaults preset. What's that? You didn't realize you had several (dozens?) of devices on your network just waiting for some bored employee to start playing with its settings from the comfort of his workstation because you didn't change the default read-write community string? Well, you do.
Here's the snmpd config file I use, which I don't mind sharing because the only way you can get to it is over my LAN or my VPN, and it's read-only anyway and I have no secrets about my host stats.
rocommunity yoursecrethere syslocation "Las Cruces" syscontact email@example.com sysservices 79
If you can't figure out how to tweak the configuration file included with your distro (which is no doubt hundreds of lines long with loads of comments), you can replace it with something like that and you'll be up and running with SNMP version 2c.
Ok, now you can install Cacti. Then create a device using the ucd/net SNMP device template for the host you want to monitor (you don't technically have to do that with localhost but you'd have to modify my graphs to use the non-SNMP data sources). When the device is created and it says it was able to connect to it ok, then you can create graphs for the device. Go ahead and create the "ucd/net - Load Average" graph. Then you'll no doubt dash over to the graphs "tab" and be totally dismayed that the graph seems broken. Fear not, it'll show up once it's had some time to gather data (check back in 5 minutes).
In the meantime we can go fix the load average graph template. Any changes we make will apply to the graph we just created as well as any new graphs we create with that template. Go to "Graph templates" on the left then find the graph of interest and click on its name. Take a moment familiarizing yourself with this page, then click on the 5 minute average item to edit it. Here you change the graph item type from STACK to LINE1. I also changed the color to 002ABF which shows up better. Do the same for the 15 minute average item (LINE1, I left the color alone). Now go refresh your graph and you'll see the changes. Et voilà, you are a Cacti graph template hacker. At this point you may feel the irresistable urge to change the colors of some of the more ugly but functional graphs, and I won't hinder you. I'll wait right here.
Ok, the memory usage graph is a bit more work. I won't take you through it step by step but I'll point out a couple of gotchas that I encountered when creating it. First, I realize that others have made memory usage graphs and provided them on forums and such to download. After the third one failed to work I decided it was better to just make my own. Hopefully mine will work for you—I put a bit of effort into making sure it would import cleanly.
There's actually a reason why the memory usage graphs are so backwards: because most devices provide total and free stats but not used stats. Obviously they expect you to calculate used yourself. So directly graphing the bits provided by SNMP was the easy way out.
We, on the other hand, have chosen the path of pain. We need to calculate memory used (which is total-(free+cache+buffers)). We could do this with a script but that's sticky and not very portable (depending on the target distro, version of Cacti, etc.). The better thing is to use a CDEF. If you click on graph management the CDEFs link is revealed. We want a CDEF that calculates (total-free-cache-buffers)*1024 (the sources are kilobytes). Now, a CDEF uses a positional reference system. The first data source used by your graph is a, the second is b, and so on. So the CDEF string will look something like
d,a,-,b,-,c,-,1024,*. But here's where things get dodgy—it's hard to know what order the data sources will settle on until after you've created the graph. If you create the graph in the right order (no shuffling) and you realize that the AVERAGE and MAX consolidation functions create separate data source (but not LAST), and who-knows-what other pitfalls, then you can be confident ahead of time. Or, you can just create the template, create a graph using the template, and look at the graph debug output to figure out which source is which.
So now you create a new graph template, and referring to a template similar to what you want you fill in all the right fields, leave most at their defaults, add graph items, tweak and refresh a sample graph using your template a gazillion times, go back and forth with the CDEFs getting things right, then create new (temporary) graphs to make sure it works.
Luckily for you, if all you want is a cool memory graph, I did all this for you. Download and import my memory usage graph template, create a graph, and in a day or so you'll have a memory usage graph as pretty as mine. Oh, alright, I'll provide a load average template for you as well.