re collectl documentation: I'm surprised you found the documentation minimal. Were you looking at collectl.sourceforge.net/Documentation.html?
As for installation, Quite honestly I haven't heard of any problems before other than a couple of bugs in the INSTALL script. I guess I thought providing RPMs and a tarball were self-explanatory. When I first packaged collectl I simply followed convention and included a README and INSTALL in the tarball. Is there something else I could do to make this easier?
re usage: On that same documentation page is a tutorial, whose first paragraph tells you to just run the command 'collectl' to get started and then goes into a number of examples. Like sar, there are a lot of different options which require either playing with or doing a lot of reading. Unlike sar, collectl provides a number of different output formats for the same data to increase flexibility, thought that also does come at a cost.
For example, to just look at CPU data you'd use the command
"collectl -sc" and get the following, noting most of the output below is formatted as nice columns but this forum seems to squash multiple spaces into single one and the results are less easier to read:
#<--------CPU-------->
#cpu sys inter ctxsw
25 5 8745 42276
23 4 8651 45057
27 5 9189 46796
21 4 8264 44922
23 4 8662 46608
26 4 9670 48484
if you want timestamps just add -oT and you get
# <--------CPU--------
#Time cpu sys inter ctxsw
13:01:50 25 5 8745 42276
13:02:00 23 4 8651 45057
13:02:10 27 5 9189 46796
13:02:20 21 4 8264 44922
of course if you want to see data on individual CPUs you use an uppercase C (this convention also applies to disks, network, nfs and a couple of other subsystem that have instance data for them), so doing "collectl -sC -oT" yields the following for the first sample
# SINGLE CPU STATISTICS
# Cpu User Nice Sys Wait IRQ Soft Steal Idle
13:01:50 0 54 0 8 0 0 7 0 28
13:01:50 1 11 0 4 0 0 0 0 83
13:01:50 2 14 0 4 0 0 0 0 81
13:01:50 3 10 0 4 0 0 0 0 85
13:01:50 4 29 0 5 0 0 0 0 64
13:01:50 5 13 0 4 1 0 0 0 79
13:01:50 6 8 0 4 0 0 0 0 86
13:01:50 7 13 0 4 0 0 0 0 81
but I know you were interested in load averages and that requires the verbose form of output which I reserved for that type of data which is less commonly used such as load averages. To see that you use the command "collectl -sc --verbose -oT" and would see:
# CPU SUMMARY (INTR, CTXSW & PROC /sec)
# User Nice Sys Wait IRQ Soft Steal Idle Intr Ctxsw Proc RunQ Run Avg1 Avg5 Avg15
13:02:00 18 0 3 0 0 1 0 76 8651 13:01:50 19 0 4 0 0 1 0 74 8745 42K 3 436 2 0.24 0.05 0.02
13:02:00 18 0 3 0 0 1 0 76 8651 45K 1 439 2 14.27 3.17 1.04
13:02:10 21 0 3 0 0 1 0 72 9189 46K 5 439 2 31.63 7.18 2.36
13:02:20 17 0 3 0 0 1 0 78 8264 44K 0 439 0 41.26 9.95 3.31
13:02:30 18 0 3 0 0 1 0 76 8662 46K 1 439 4 35.07 9.65 3.28
13:02:40 21 0 3 0 0 1 0 73 9670 48K 1 442 1 56.79 15.21 5.16
Finally, "collectl -sc -P" provides the output in a format very easy to plot:
#Date Time [CPU]User% [CPU]Nice% [CPU]Sys% [CPU]Wait% [CPU]Irq% [CPU]Soft% [CPU]Steal% [CPU]Idle% [CPU]Totl% [CPU]Intrpt/sec [CPU]Ctx/sec [CPU]Proc/sec [CPU]ProcQue [CPU]ProcRun [CPU]L-Avg1 [CPU]L-Avg5 [CPU]L-Avg15
20090608 13:01:50 20 0 5 0 0 1 0 74 26 8745 42276 4 436 2 0.24 0.05 0.02
20090608 13:02:00 19 0 4 0 0 1 0 76 24 8651 45057 2 439 2 14.27 3.17 1.04
20090608 13:02:10 22 0 4 0 0 1 0 73 27 9189 46796 6 439 2 31.63 7.18 2.36
20090608 13:02:20 17 0 3 0 0 1 0 78 21 8264 44922 0 439 0 41.26 9.95 3.31
20090608 13:02:30 19 0 3 0 0 1 0 77 23 8662 46608 1 439 4 35.07 9.65 3.28
And the best part is you don't know all this before using collectl, though I'd certainly recommend trying a few interactive commands with how. Hoever, after you install it you can simply
start it as a daemon the standard way you'd start any daemon (except sar): "/etc/init.d/collectl start" and it will start collecting a lot more than cpu data and writing it to the standard log directory: /var/log/collectl. You can then play back that data as often as you like with any combination of switches you want.
For whatever it's worth, some of the largest clusters in the world [I'm talking >2000 nodes or 16K CPUs) run collectl.
-mark