dylan - if you don't mind I'd really like to hear what you don't like about collectl as I'm always trying to make it better and I can't help but wonder if there are issues with trying to figure out all the switches, its output or what. Not only will collectl record load averages as often as you like, including down to sub-second intervals, you can also easily import the data into a spreadsheet or use collectl-utils to plot it.
It you like the top command collectl even has a --top switch that will allow you to display a list of tops I/O users, if your kernel provides those stats.
If you want to collect loadavarage history of your server. Then I would recommend 'sar' tool. If you want to collect traffic history then you can use 'vnstat'. Both are handy gpl tools in order to keep good load and traffic track of your server.
re collectl documentation: I'm surprised you found the documentation minimal. Were you looking at collectl.sourceforge.net/Documentation.html?
As for installation, Quite honestly I haven't heard of any problems before other than a couple of bugs in the INSTALL script. I guess I thought providing RPMs and a tarball were self-explanatory. When I first packaged collectl I simply followed convention and included a README and INSTALL in the tarball. Is there something else I could do to make this easier?
re usage: On that same documentation page is a tutorial, whose first paragraph tells you to just run the command 'collectl' to get started and then goes into a number of examples. Like sar, there are a lot of different options which require either playing with or doing a lot of reading. Unlike sar, collectl provides a number of different output formats for the same data to increase flexibility, thought that also does come at a cost.
For example, to just look at CPU data you'd use the command
"collectl -sc" and get the following, noting most of the output below is formatted as nice columns but this forum seems to squash multiple spaces into single one and the results are less easier to read:
of course if you want to see data on individual CPUs you use an uppercase C (this convention also applies to disks, network, nfs and a couple of other subsystem that have instance data for them), so doing "collectl -sC -oT" yields the following for the first sample
but I know you were interested in load averages and that requires the verbose form of output which I reserved for that type of data which is less commonly used such as load averages. To see that you use the command "collectl -sc --verbose -oT" and would see:
And the best part is you don't know all this before using collectl, though I'd certainly recommend trying a few interactive commands with how. Hoever, after you install it you can simply
start it as a daemon the standard way you'd start any daemon (except sar): "/etc/init.d/collectl start" and it will start collecting a lot more than cpu data and writing it to the standard log directory: /var/log/collectl. You can then play back that data as often as you like with any combination of switches you want.
For whatever it's worth, some of the largest clusters in the world [I'm talking >2000 nodes or 16K CPUs) run collectl.
Actually I almost forgot - if you do choose to run sar, don't use the default monitoring interval of 10 minutes - it's pretty useless if you want any meaningful data, though I suspect there are many out there happily SARing away at that level. If I've learned nothing else from collectl it's that data sample coarser than about 10 seconds miss too much key information, such as spikes that tend to get averaged out and therefore never seen.