Client interval question

list Scott Walters
Tue, 13 Dec 2005 12:24:54 -0500
Message-Id: <user-5bb518cf97b6@xymon.invalid>

On Mon, Dec 12, 2005 at 01:12:20PM -0600, Jeff Newman wrote:

I wanted to move from a 5 minute interval on all my clients to a 1  minute
interval.

In all my years of Systems Administration, things that run every  minute all the time usually end up being a "Bad Idea".

How will a smaller sampling period improve the service you provide?

the script is "vmstat 300 2" So do I need to update that to  reflect 1 minute
as well (i.e. vmstat 60 2)?
Or is this by design? Are there others that might need to change  that I
don't know about? Is the way I am going about this wrong?

That's an interesting question :-)

My job requires data be useful, not just interesting.  That is not to  say there aren't jobs were useful is good enough.

The graph DB's that vmstat feeds data into (the RRD files) are
constructed in such a way that a 5-minute interval is what makes
sense. So running them with anything else really just a waste of
ressources.

With the stock larrd/hobbit RRD definitions you are correct.  He'll  only use one of the five, and whine about the timestamp of the other  four.

(I do have a patch here from a user that would allow you to configure
the RRD files for different data-collection frequencies, but that has
not been merged yet - primarily due to me being overloaded).

The design goal of larrd, (I can't speak for Henrik and hobbit/RRD)  was capacity planning and trending.  5m samples are  more than  adequate for that activity.

IMO, sampling at a high frequency implies real-time performance  analysis, and I've always felt that outside the scope of capacity  planning and trending.  EG. We don't run sendmail in debug all the  time . . . .

All that being said, those long term trends are very helpful for  problem resolution.  One can compare a single 5m sample against an  aggregate of 5m samples and determine if things are 'normal'.  But  the art of comparing all the activity within a single 5m sample for  normal is very very difficult.

So no - you shouldn't change that vmstat command. But it is bad design
on my part to assume that the client polling period would always be
5 minutes - it's perfectly valid to run the client checks differently.

That's my design you inherited and because of the complexity of the  parts, I think it is a very solid design.  To become flexible enough  to handle different sampling rates, the server would need to know the  frequency of the tests.  And then changing the RRD in the future is  'almost' impossible (very difficult at the least).  And I've never  seen what happens to 1.5 years of data when you start messing with  the RRD.

In the end, I think you'd get the worst of both worlds.

I'll think about what's the most sensible solution. It probably  would be
to only start the vmstat command if one isn't running; that does  assume
that you will run the client scripts *at least once* every 5 minutes.

I disagree.  If real-time performance analysis is needed, I would  pick other tools --  "vmstat 5"  works for me;)  Or construct/fork  the client agent specifically designed for such a task, and run it on  an as-needed basis.

Then try and decide for real time perf analysis if the sampling rate  should be 5s or 1m ;)

scott