Client interval question

17 messages in this thread

list Jeff Newman · Mon, 12 Dec 2005 13:12:20 -0600 ·

Hi,

I wanted to move from a 5 minute interval on all my clients to a 1 minute
interval.

I went to my first AIX host to test, and changed
/usr/local/hobbit/client/etc/clientlaunch.cfg
and changed the interval to 1m (I assume this is correct)

Well, sure enough, hobbit launches stuff every minute, but the problem is
that I see "vmstat 300 2" running
a ton. So looking at /usr/local/hobbit/client/bin/hobbitclient-aix.sh, I see
that hardcoded into the script
is "vmstat 300 2" So do I need to update that to reflect 1 minute as well (
i.e. vmstat 60 2)?
Or is this by design? Are there others that might need to change that I
don't know about? Is the way
I am going about this wrong?

Thanks,
Jeff

list Henrik Størner · Mon, 12 Dec 2005 22:56:36 +0100 ·

▸ quoted from Jeff Newman

On Mon, Dec 12, 2005 at 01:12:20PM -0600, Jeff Newman wrote:

I wanted to move from a 5 minute interval on all my clients to a 1 minute
interval.

I went to my first AIX host to test, and changed
/usr/local/hobbit/client/etc/clientlaunch.cfg
and changed the interval to 1m (I assume this is correct)

Yep.

▸ quoted from Jeff Newman

Well, sure enough, hobbit launches stuff every minute, but the problem is
that I see "vmstat 300 2" running a ton. So looking at 
/usr/local/hobbit/client/bin/hobbitclient-aix.sh, I see that hardcoded into 
the script is "vmstat 300 2" So do I need to update that to reflect 1 minute 
as well (i.e. vmstat 60 2)?
Or is this by design? Are there others that might need to change that I
don't know about? Is the way I am going about this wrong?

That's an interesting question :-)

The graph DB's that vmstat feeds data into (the RRD files) are
constructed in such a way that a 5-minute interval is what makes
sense. So running them with anything else really just a waste of
ressources.

(I do have a patch here from a user that would allow you to configure
the RRD files for different data-collection frequencies, but that has
not been merged yet - primarily due to me being overloaded).

So no - you shouldn't change that vmstat command. But it is bad design
on my part to assume that the client polling period would always be 
5 minutes - it's perfectly valid to run the client checks differently.

I'll think about what's the most sensible solution. It probably would be
to only start the vmstat command if one isn't running; that does assume 
that you will run the client scripts *at least once* every 5 minutes.


Henrik

list Scott Walters · Tue, 13 Dec 2005 12:24:54 -0500 ·

▸ quoted from Jeff Newman

On Mon, Dec 12, 2005 at 01:12:20PM -0600, Jeff Newman wrote:

I wanted to move from a 5 minute interval on all my clients to a 1  minute
interval.

In all my years of Systems Administration, things that run every  minute all the time usually end up being a "Bad Idea".

How will a smaller sampling period improve the service you provide?

▸ quoted from Henrik Størner

the script is "vmstat 300 2" So do I need to update that to  reflect 1 minute
as well (i.e. vmstat 60 2)?
Or is this by design? Are there others that might need to change  that I
don't know about? Is the way I am going about this wrong?

That's an interesting question :-)

My job requires data be useful, not just interesting.  That is not to  say there aren't jobs were useful is good enough.

▸ quoted from Henrik Størner

The graph DB's that vmstat feeds data into (the RRD files) are
constructed in such a way that a 5-minute interval is what makes
sense. So running them with anything else really just a waste of
ressources.

With the stock larrd/hobbit RRD definitions you are correct.  He'll  only use one of the five, and whine about the timestamp of the other  four.

▸ quoted from Henrik Størner

(I do have a patch here from a user that would allow you to configure
the RRD files for different data-collection frequencies, but that has
not been merged yet - primarily due to me being overloaded).

The design goal of larrd, (I can't speak for Henrik and hobbit/RRD)  was capacity planning and trending.  5m samples are  more than  adequate for that activity.

IMO, sampling at a high frequency implies real-time performance  analysis, and I've always felt that outside the scope of capacity  planning and trending.  EG. We don't run sendmail in debug all the  time . . . .

All that being said, those long term trends are very helpful for  problem resolution.  One can compare a single 5m sample against an  aggregate of 5m samples and determine if things are 'normal'.  But  the art of comparing all the activity within a single 5m sample for  normal is very very difficult.

▸ quoted from Henrik Størner

So no - you shouldn't change that vmstat command. But it is bad design
on my part to assume that the client polling period would always be
5 minutes - it's perfectly valid to run the client checks differently.

That's my design you inherited and because of the complexity of the  parts, I think it is a very solid design.  To become flexible enough  to handle different sampling rates, the server would need to know the  frequency of the tests.  And then changing the RRD in the future is  'almost' impossible (very difficult at the least).  And I've never  seen what happens to 1.5 years of data when you start messing with  the RRD.

In the end, I think you'd get the worst of both worlds.

▸ quoted from Henrik Størner

I'll think about what's the most sensible solution. It probably  would be
to only start the vmstat command if one isn't running; that does  assume
that you will run the client scripts *at least once* every 5 minutes.

I disagree.  If real-time performance analysis is needed, I would  pick other tools --  "vmstat 5"  works for me;)  Or construct/fork  the client agent specifically designed for such a task, and run it on  an as-needed basis.

Then try and decide for real time perf analysis if the sampling rate  should be 5s or 1m ;)

scott

list Tracy J. di Marco White · Tue, 13 Dec 2005 13:47:48 CST ·

▸ quoted from Scott Walters

In message <user-5bb518cf97b6@xymon.invalid>, Scott Walters writes:
}
}> On Mon, Dec 12, 2005 at 01:12:20PM -0600, Jeff Newman wrote:
}>>
}>> I wanted to move from a 5 minute interval on all my clients to a 1  
}>> minute
}>> interval.
}>>
}
}In all my years of Systems Administration, things that run every  
}minute all the time usually end up being a "Bad Idea".
}
}How will a smaller sampling period improve the service you provide?


We run pretty much all of our big brother tests every minute.  On
our new hobbit servers, we're running them at the default intervals.

BB shows us that our primary name server is going out for less than
a minute, about every 62 minutes.  Hobbit is missing most of those
outages, although the longer "xxxx events received in the last xxx
minutes" is what helped us spot the problem, as a whole bunch of
machines' services don't respond well when our primary name server
is out, and having a mass of servers go yellow then green, in
unison, is sort of eye catching.

Tracy J. Di Marco White
Information Technology Services
Iowa State University

list Scott Walters · Tue, 13 Dec 2005 15:08:03 -0500 ·

▸ quoted from Tracy J. di Marco White

We run pretty much all of our big brother tests every minute.  On
our new hobbit servers, we're running them at the default intervals.

BB shows us that our primary name server is going out for less than
a minute, about every 62 minutes.
Hobbit is missing most of those
outages, although the longer "xxxx events received in the last xxx
minutes" is what helped us spot the problem, as a whole bunch of
machines' services don't respond well when our primary name server
is out, and having a mass of servers go yellow then green, in
unison, is sort of eye catching.

So hobbit with the xxx events (running every 5m) did provide enough  information to indicate an intermittent problem with DNS?

Things running every 5m will collide with a problem that happens for  a minute frequently enough to 'show up on the radar'

But every site has different requirements.  It's just been my  experience that sampling more frequently than 5m hits the knee-bend  of diminishing returns.  It also increases the potential for state  changes, which chews up the filesystem with the history info.

ymmv

scott

list Jeff Newman · Wed, 14 Dec 2005 17:31:28 -0600 ·

▸ quoted from Scott Walters

On 12/13/05, Scott Walters <user-2c405ccfe1ee@xymon.invalid> wrote:

In all my years of Systems Administration, things that run every
minute all the time usually end up being a "Bad Idea".

How will a smaller sampling period improve the service you provide?


It can be a bad idea sometimes, others not (for example, the reply from
the person catching intermittant problems with BB running every minute)

A smaller sampling period can show things in a more granular aspect. For
example, a process kicks off and 5 minutes later you see 100 errors (im
keeping things generic for illustrative purposes) Were those 100 errors in
the first minute? the last? constantly throughout the 5 minutes?

Im not saying your wrong, simply pointing out that it's not as black and
white as your making it.

▸ quoted from Scott Walters

My job requires data be useful, not just interesting.  That is not to

say there aren't jobs were useful is good enough.


Something being just interesting initially can sometimes uncover problems
that
you didn't see before.

▸ quoted from Scott Walters

The graph DB's that vmstat feeds data into (the RRD files) are

constructed in such a way that a 5-minute interval is what makes
sense. So running them with anything else really just a waste of
ressources.

With the stock larrd/hobbit RRD definitions you are correct.  He'll
only use one of the five, and whine about the timestamp of the other
four.


Firstly, can you explain your comment in more detail? Secondly,
im confused as to why you would state that I would "whine" about anything
when you have no basis for a conclusion to that effect. It seems to be a
rather
pointed comment in a discussion that hasn't involved the use of language
that
would dictate a response like that.

▸ quoted from Scott Walters

. The design goal of larrd, (I can't speak for Henrik and hobbit/RRD)

was capacity planning and trending.  5m samples are  more than
adequate for that activity.

IMO, sampling at a high frequency implies real-time performance
analysis, and I've always felt that outside the scope of capacity
planning and trending.  EG. We don't run sendmail in debug all the
time . . . .

All that being said, those long term trends are very helpful for
problem resolution.  One can compare a single 5m sample against an
aggregate of 5m samples and determine if things are 'normal'.  But
the art of comparing all the activity within a single 5m sample for
normal is very very difficult.


That is a very good point you make. There is a difference between
real-time analysis and capacity planning/trending. I don't however think
that it is that far outside of hobbit's scope to try and leverage it for
a more pointed analysis. My goal isn't to take every machine in my
environment
and make them into 1 minute sampling period machines. To have the ability to
do
so on a machine-by-machine basis could be useful

▸ quoted from Scott Walters

That's my design you inherited and because of the complexity of the
parts, I think it is a very solid design.


I don't think anyone is really questioning that.

▸ quoted from Scott Walters

To become flexible enough

to handle different sampling rates, the server would need to know the
frequency of the tests.  And then changing the RRD in the future is
'almost' impossible (very difficult at the least).  And I've never
seen what happens to 1.5 years of data when you start messing with
the RRD.

In the end, I think you'd get the worst of both worlds.


Honestly, I don't claim to know anything about the way larrd and hobbit
are coded in the slightest. There are difficulties to be sure, but part of
having a
community such as this is to foster ideas and innovation. Just because you
don't think it's useful or that it's hard doesn't mean the same is true for
everyone out
there. What if you could add a high-frequency tag to a server and it
generates a seperate
high-frequncey graph for that, as well as updating the normal trend graph
for whatever
resource you wanted? That way you could choose for a day to look at a graph
for resource x every minute for a day then turn it off? There are lots of
ideas and I don't know if mine would even work, but you shouldn't just kill
the idea.

▸ quoted from Scott Walters

I disagree.  If real-time performance analysis is needed, I would

pick other tools --  "vmstat 5"  works for me;)  Or construct/fork
the client agent specifically designed for such a task, and run it on
an as-needed basis.


There are other tools yes. I am trying to leverage hobbit. If it's not
possible
and nobody wants to do it, then yes, ill look into other tools. On the same
token
I don't want to kill my performance by running lots of different monitoring
on the server.
Hobbit is extrodinarily lightweight on the client (as opposed to other
solutions out there)
so I think something like this is possible without overloading a client.

Just my two cents.

list Adam Goryachev · Thu, 15 Dec 2005 13:50:49 +1100 ·

▸ quoted from Jeff Newman

On Wed, 2005-12-14 at 17:31 -0600, Jeff Newman wrote:


On 12/13/05, Scott Walters <user-2c405ccfe1ee@xymon.invalid> wrote:

The graph DB's that vmstat feeds data into (the RRD files)
        are
constructed in such a way that a 5-minute interval is what
        makes 
sense. So running them with anything else really just a
        waste of
ressources.

With the stock larrd/hobbit RRD definitions you are
        correct.  He'll
only use one of the five, and whine about the timestamp of
        the other 
four.

 
Firstly, can you explain your comment in more detail? Secondly, 
im confused as to why you would state that I would "whine" about
anything
when you have no basis for a conclusion to that effect. It seems to be
a rather
pointed comment in a discussion that hasn't involved the use of
language that
would dictate a response like that.

I think he was referring to the error message that would be generated
because of the extra data compared to the interval configured in the rrd
files.

▸ quoted from Jeff Newman

To become flexible enough
to handle different sampling rates, the server would need to
        know the 
frequency of the tests.  And then changing the RRD in the
        future is
'almost' impossible (very difficult at the least).  And I've
        never
seen what happens to 1.5 years of data when you start
        messing with 
the RRD.

In the end, I think you'd get the worst of both worlds.

Well, we could take this the other way, and say that we only wanted to
run our tests once every 10 minutes, because it was causing too much
overhead to run the tests every 5 minutes. How would we deal with that?

I think there are benefits, on the client side, the client should pass
the frequency which it is calling the tests at, to them, so that for
example, the vmstat test can adjust how long it will run for to either
600 seconds, or 60 seconds, or whatever else is needed.

Further to that, there is some additional work (which I feel is the real
place that all the work is involved, the client side stuff would be
quite simple, or so it sounds). On the server, we either need to
re-create/adjust the rrd file so that we can insert data more
frequently, or else we need to somehow summarise the data before
insertion (which means there was no benefit in collecting the data more
frequently in the first place).
So, the question becomes, how difficult is it to convert an rrd file,
which was initially created to store data-points every 300 seconds, such
that we can now store data-points every X seconds?
The second part to this question, is how does hobbit know how frequently
you want to send your reports? ie, it can't be based on 'however often
they are received', because that value would change very frequantly, ie,
the reports are done every 300 seconds, but by the time the report is
submitted/processed by hobbit, it might be 1 or 2 seconds late/early
compared to last time..... Could hobbit server 'learn' the frequency
from the client (which is where this is configured anyway), because the
client would report that value to the server as a part of the vmstat
output?

Of course, even once both of those questions are satisfactorily
answered, you (yes, you) need to convince somebody to take the effort to
actually do it. Simply seeing the methods, and the interest some people
have taken in the possibility might be enough for someone with the
coding skills to do it, or, sometimes you might need to provide other
incentives (even paid).  Let me state clearly right now before I go on,
no, don't pay me, I don't have that level of coding skills in C to do it
for you, nor the knowledge of RRD. No, I don't know who you might pay,
or anything else, I have no interest in any of it, I'm just stating a
simple fact of life (you want something done, you can't do it, so find
someone who can do it, and motivate them to the point where they will do
it whether they want to or not)...

▸ quoted from Jeff Newman

I disagree.  If real-time performance analysis is needed, I
        would
pick other tools --  "vmstat 5"  works for me;)  Or
        construct/fork 
the client agent specifically designed for such a task, and
        run it on
an as-needed basis.

 
There are other tools yes. I am trying to leverage hobbit. If it's not
possible
and nobody wants to do it, then yes, ill look into other tools. On the
same token
I don't want to kill my performance by running lots of different
monitoring on the server.
Hobbit is extrodinarily lightweight on the client (as opposed to other
solutions out there)
so I think something like this is possible without overloading a
client.

Agreed, it would be nice to not have to run hobbit plus something else
when they are both collecting the same data (just a different
frequency).

Just my two cents.

Here's a couple of mine also :)

Regards,
Adam

list Tracy J. di Marco White · Thu, 15 Dec 2005 00:09:39 CST ·

▸ quoted from Scott Walters

In message <user-68f464cf21f6@xymon.invalid>, Scott Walters writes:
}>
}> We run pretty much all of our big brother tests every minute.  On
}> our new hobbit servers, we're running them at the default intervals.
}>
}> BB shows us that our primary name server is going out for less than
}> a minute, about every 62 minutes.
}> Hobbit is missing most of those
}> outages, although the longer "xxxx events received in the last xxx
}> minutes" is what helped us spot the problem, as a whole bunch of
}> machines' services don't respond well when our primary name server
}> is out, and having a mass of servers go yellow then green, in
}> unison, is sort of eye catching.
}
}So hobbit with the xxx events (running every 5m) did provide enough  
}information to indicate an intermittent problem with DNS?


Hobbit's non-green page, with last xxx events, gave us a large
enough view that we could see all the machine services going yellow
at the same time dns went red.  We're monitoring a bit over 260
machines with a whole lot of difference services, so there's often
something going red or yellow.  With BB's older default of the last
25 events, there wasn't ever that much on screen to notice a group
of swings to yellow, then back to green.

▸ quoted from Scott Walters


}Things running every 5m will collide with a problem that happens for  
}a minute frequently enough to 'show up on the radar'


Sure, but we'd see up to 13 hours between dns 'red', when BB would
get several in that period.

I haven't changed hobbit yet to 1 minute checks.  I've even made
an explicit explanation that I wasn't planning to shorten it to
1 minute checks when we officially switched over, and that was
agreed to.  However, with the fact that the 1 minute checks did
actually make a difference in tracking down and solving the problem
with DNS, I may yet have to work on that change.  We'll see what
kind of feedback I get after today.  Even then, the only thing I'd
really be willing to shorten to that frequency of checks are the
remote checks, over the network.

▸ quoted from Scott Walters


}But every site has different requirements.  It's just been my  
}experience that sampling more frequently than 5m hits the knee-bend  
}of diminishing returns.  It also increases the potential for state  
}changes, which chews up the filesystem with the history info.


I thought it was unnecessary when I originally brought BB into
production years ago, but it was one of the requirements I ended
up with to sell switching to BB.  Some things can't be checked
every minute, I have raid checks that can take more than a
minute to run.

Tracy J. Di Marco White
Information Technology Services
Iowa State University

list Scott Walters · Thu, 15 Dec 2005 03:16:22 -0500 ·

First off, I know I can come off terse in e-mail, but they are not  personal attacks.

▸ quoted from Jeff Newman

It can be a bad idea sometimes, others not (for example, the reply  from
the person catching intermittant problems with BB running every  minute)

Who ended up stating  the anomaly *was* detected in 5m intervals, but  only once every 13h instead of every hour.   But I still don't  understand how it will help *you*.

A smaller sampling period can show things in a more granular  aspect. For example, a process kicks off and 5 minutes later you  see 100 errors (im keeping things generic for illustrative  purposes) Were those 100 errors in the first minute? the last?  constantly throughout the 5 minutes?

The 5m averages over a week would be quite low compared so a single  5m plot.  From that, one could extrapolate in the last 5m things have  not been 'normal'.

Im not saying your wrong, simply pointing out that it's not as  black and white as your making it.

And I am disagreeing with you ;)  I've been watching the data in  these graphs for many many years now, and I have yet to come across a  situation where having a 1m sampling/graphing period would have  helped me fix/improve something . . .

It's like a story problem with too much information, it makes coming  up with the real answer harder in the end.  Most people don't have  time/enegry/brains to be able to sift all the data correctly.   If if  they do, the 5m samples are good enough.

Most people (including really smart people that are forgetful) can't  deal with an auto-scaling y-axis.

▸ quoted from Jeff Newman

Something being just interesting initially can sometimes uncover  problems that
you didn't see before.

Like I said, if you have job were interesting is worthwhile,  wonderful.  In my experience, most folks that are running the BB/ hobbit tools are involved in the operational aspects of  infrastructure, not R&D.

▸ quoted from Adam Goryachev

With the stock larrd/hobbit RRD definitions you are correct.  He'll
only use one of the five, and whine about the timestamp of the other
four.

Firstly, can you explain your comment in more detail?

RRD interpolates Time Series Data to put a value at a fixed  interval.  That is why you hardly ever see integers in the data.  If  you sample comes in at 299s, RRD interpolates what that value to what  would have been at 300s.  How this is done can be tuned.  The default  settings with the RRAs expect data to happen every 300s.  RRD will  only insert data one time within that interval.

▸ quoted from Adam Goryachev

Secondly,
im confused as to why you would state that I would "whine" about  anything
when you have no basis for a conclusion to that effect. It seems to  be a rather
pointed comment in a discussion that hasn't involved the use of  language that
would dictate a response like that.

"He'll whine" meant rrdtool, not you:

ERROR: illegal attempt to update using time 1042731000 when last  update time

is 1043099100 (minimum one second step)

That's whining in my book.  Sorry you thought I was speaking about you.

▸ quoted from Jeff Newman

That is a very good point you make. There is a difference between
real-time analysis and capacity planning/trending. I don't however  think
that it is that far outside of hobbit's scope to try and leverage  it for
a more pointed analysis.

 From a software development standpoint there is a lot to be said  for: "Do one thing and do it well".  If architecting the RRD  framework for RTA breaks trending, bad idea.

▸ quoted from Jeff Newman

My goal isn't to take every machine in my environment
and make them into 1 minute sampling period machines. To have the  ability to do
so on a machine-by-machine basis could be useful

Which is why I proposed another client collector for this activity.

▸ quoted from Jeff Newman

That's my design you inherited and because of the complexity of the
parts, I think it is a very solid design.

I don't think anyone is really questioning that.

You are questioning that.  And that is fine.  I don't take it  personally you think there may be a better way.  I know my way may  not be the best, but I sure know exactly *why* I chose it.

▸ quoted from Jeff Newman

Honestly, I don't claim to know anything about the way larrd and  hobbit
are coded in the slightest. There are difficulties to be sure, but  part of having a
community such as this is to foster ideas and innovation. Just  because you
don't think it's useful or that it's hard doesn't mean the same is  true for everyone out
there.

Ahhhhh, to the heart of the matter.   Don't suggest ideas in a public  forum if you are not prepared to defend them.  Fostering ideas comes  from intelligent discussions.  I merely wanted to understand why you  felt you needed a higher sampling rate from a business perspective.


scott

list Scott Walters · Thu, 15 Dec 2005 03:23:40 -0500 ·

▸ quoted from Tracy J. di Marco White

Sure, but we'd see up to 13 hours between dns 'red', when BB would
get several in that period.

I haven't changed hobbit yet to 1 minute checks.  I've even made
an explicit explanation that I wasn't planning to shorten it to
1 minute checks when we officially switched over, and that was
agreed to.  However, with the fact that the 1 minute checks did
actually make a difference in tracking down and solving the problem
with DNS, I may yet have to work on that change.

It sounds like your shop is so tidy that you would have found it and  
fixed it anyway.  It was just a little brighter with the shorter  
interval.

▸ quoted from Tracy J. di Marco White

We'll see what
kind of feedback I get after today.  Even then, the only thing I'd
really be willing to shorten to that frequency of checks are the
remote checks, over the network.

I believe hobbit has great 're-test' logic.  So if it is down, it  
will test more frequently . . . .

▸ quoted from Tracy J. di Marco White

I thought it was unnecessary when I originally brought BB into
production years ago, but it was one of the requirements I ended
up with to sell switching to BB.  Some things can't be checked
every minute, I have raid checks that can take more than a
minute to run.

On the client, 1m samples are opening Pandora's Box . . . .

scott

list Scott Walters · Thu, 15 Dec 2005 03:41:49 -0500 ·

▸ quoted from Adam Goryachev

Well, we could take this the other way, and say that we only wanted to
run our tests once every 10 minutes, because it was causing too much
overhead to run the tests every 5 minutes. How would we deal with  that?

Same issue, the server needs to know the sampling rate when the RRD  is created.

▸ quoted from Adam Goryachev

I think there are benefits, on the client side, the client should pass
the frequency which it is calling the tests at, to them, so that for
example, the vmstat test can adjust how long it will run for to either
600 seconds, or 60 seconds, or whatever else is needed.

As long as those never need to change, that wouldn't be too bad.  But  then you run into the display logic needing help depending on the  granularity of the data/RRAs in the RRDs.

▸ quoted from Adam Goryachev

Further to that, there is some additional work (which I feel is the  real
place that all the work is involved, the client side stuff would be
quite simple, or so it sounds).

Yes.

▸ quoted from Adam Goryachev

So, the question becomes, how difficult is it to convert an rrd file,
which was initially created to store data-points every 300 seconds,  such
that we can now store data-points every X seconds?

I am not aware of a way to  change the granularity of RRAs (the  things inside the RRDs) once they are created.  You'd have to rrdtool  export; create a new rrd with different RRA's, then rrdtool import.    Basically export/import the database.  You can't even add an RRA to  an existing RRD.

▸ quoted from Adam Goryachev

The second part to this question, is how does hobbit know how  frequently
you want to send your reports? ie, it can't be based on 'however often
they are received', because that value would change very  frequantly, ie,
the reports are done every 300 seconds, but by the time the report is
submitted/processed by hobbit, it might be 1 or 2 seconds late/early
compared to last time..... Could hobbit server 'learn' the frequency
from the client (which is where this is configured anyway), because  the
client would report that value to the server as a part of the vmstat
output?

Yes, but that is not what makes all this really hard, it's the server  logic.  I can think of  ways to do it, but it would involve a lot of  changes to the server side parsing, many small client changes,  restructuring/redefining existing rrds, and some potentially hairy  presentation logic to make the server smarter about what to show  based on what is in the RRD.   I wrote larrd with Christian, and I  can tell you, this would not be a weekend hack.

Time Series Data (telemetry data) is all about data on regular  intervals.  Changing that regular interval is a very significant thing.

▸ quoted from Adam Goryachev

or anything else, I have no interest in any of it, I'm just stating a
simple fact of life (you want something done, you can't do it, so find
someone who can do it, and motivate them to the point where they  will do
it whether they want to or not)...

The real trick there is convincing them they want to do it.  Forcing  someone to do something might work, but is no good over the long term.

▸ quoted from Adam Goryachev

Agreed, it would be nice to not have to run hobbit plus something else
when they are both collecting the same data (just a different
frequency).

I'm tellin' ya: vmstat 5

scott

list Henrik Størner · Thu, 15 Dec 2005 14:50:35 +0100 ·

▸ quoted from Jeff Newman

On Wed, Dec 14, 2005 at 05:31:28PM -0600, Jeff Newman wrote:

On 12/13/05, Scott Walters <user-2c405ccfe1ee@xymon.invalid> wrote:

The graph DB's that vmstat feeds data into (the RRD files) are

constructed in such a way that a 5-minute interval is what makes
sense. So running them with anything else really just a waste of
ressources.

With the stock larrd/hobbit RRD definitions you are correct.  He'll
only use one of the five, and whine about the timestamp of the other
four.

Firstly, can you explain your comment in more detail? Secondly,
im confused as to why you would state that I would "whine" about anything
when you have no basis for a conclusion to that effect. It seems to be a
rather pointed comment in a discussion that hasn't involved the use 
of language that would dictate a response like that.

Jeff, I think you misunderstood what Scott wrote. The "he" that is doing
the whining is the rrdtool library; if you feed data into an rrd file
more often than the minimum interval between updates, it will complain
about this in the logs and just ignore the extra updates.

▸ quoted from Jeff Newman

The design goal of larrd, (I can't speak for Henrik and hobbit/RRD)
was capacity planning and trending.  5m samples are  more than
adequate for that activity.

[interesting discussion about using hobbit for capacity-planning vs.
 real-time analysis snipped]

The Hobbit design - as far as the graphing and trending is concerned -
was really just to re-implement the LARRD features. So in that respect
I have adopted Scott's design goals - even though I wasn't aware what
they were.

However, that doesn't mean hobbit cannot be leveraged to support other

▸ quoted from Jeff Newman

uses for the data we collect about our systems. As Jeff writes:

part of having a community such as this is to foster ideas and innovation.

[snip]

What if you could add a high-frequency tag to a server and it generates a 
seperate high-frequncey graph for that

I've picked up quite a few ideas from the discussions that have occurred
here, and Hobbit wouldn't be as good as it is without it. So please - 
keep those ideas coming, even though they might seem to be "off-topic"
or just plain weird. There's no guarantee that I'll use any of it, but
it is still interesting to discuss.

I actually think that Hobbit could support both uses, e.g. in the way
that Jeff suggests with a special high-frequency graph, in addition to
the normal ones. Hobbit does have several building-blocks that you could
use to implement this:
  - a method for the client to send data to the Hobbit server for
    processing without affecting the status display (the "data"
    messagetype)
  - a plugin-mechanism where hobbit "worker modules" can pick up these
    data and process them
  - a simple unix-pipe can be used to feed data into the normal graph
    handling module

E.g. one way Jeff could get his real-time graph goes like this:
  - On the client, run a job every minute to grab the data and send it
    to the Hobbit server using 
        $BB $BBDISP "data $MACHINE.xdata ...
  - On the Hobbit server, write a Hobbit module that grabs messages
    off the Hobbit "data" channel. This really just means reading
    messages from stdin - each message begins with a "@@data" line,
    and ends with a "@@" line. You can easily then pick out those
    messages that are "xdata" sent by the once-a-minute job.
  - This module then feeds the message into an RRD file. If it's
    one of the standard tests (e.g. disk), you can just change
    the "xdata" into "disk" and feed it into a child process running
    the normal hobbitd_rrd program. Start hobbitd_rrd for this
    purpose with a different BBRRDS setting, so the RRD files go
    into a separate directory (perhaps on a RAM disk, if you are
    really updating once a minute).

What's missing then is to get the RRD file created in a way so that it
will accept such frequent updates, and perhaps only store the last 
1 or 2 hours of data. So you'll have to dig into the "rrdtool create"
command to get the RRD file setup correctly, before you start feeding
data into it.


Regards,
Henrik

list Tracy J. di Marco White · Thu, 15 Dec 2005 10:53:21 CST ·

▸ quoted from Scott Walters

In message <user-4257cb743ca6@xymon.invalid>, Scott Walters writes:
}>
}> Sure, but we'd see up to 13 hours between dns 'red', when BB would
}> get several in that period.
}>
}> I haven't changed hobbit yet to 1 minute checks.  I've even made
}> an explicit explanation that I wasn't planning to shorten it to
}> 1 minute checks when we officially switched over, and that was
}> agreed to.  However, with the fact that the 1 minute checks did
}> actually make a difference in tracking down and solving the problem
}> with DNS, I may yet have to work on that change.
}
}It sounds like your shop is so tidy that you would have found it and  
}fixed it anyway.  It was just a little brighter with the shorter  
}interval.


I agree we would have found it.  I'm amused at the thought of our
shop being tidy, but thanks.

▸ quoted from Scott Walters


}> We'll see what
}> kind of feedback I get after today.  Even then, the only thing I'd
}> really be willing to shorten to that frequency of checks are the
}> remote checks, over the network.
}
}I believe hobbit has great 're-test' logic.  So if it is down, it  
}will test more frequently . . . .


And that's how I sold the 5 minute testing interval.  And how I
think I'll not have to shorten the interval for hobbit now.

▸ quoted from Scott Walters


}> I thought it was unnecessary when I originally brought BB into
}> production years ago, but it was one of the requirements I ended
}> up with to sell switching to BB.  Some things can't be checked
}> every minute, I have raid checks that can take more than a
}> minute to run.
}
}On the client, 1m samples are opening Pandora's Box . . . .


Everyone really needs to consider what all the effects are of 
the frequency of the monitoring.

Tracy J. Di Marco White
Information Technology Services
Iowa State University

list Scott Walters · Thu, 15 Dec 2005 14:45:30 -0500 (EST) ·

▸ quoted from Tracy J. di Marco White

On Thu, 15 Dec 2005, Tracy J. Di Marco White wrote:

Everyone really needs to consider what all the effects are of
the frequency of the monitoring.

I understand a more frequent sampling period is an easy sell, but I don't
think it is a valid one when the rubber meets the road.

Plus, I try and make sure all technical decisions have a business reason.

I dislike technology and its advocates that try and drive the business.
I guess I've gotten old and 'kewl' is no longer good enough ;)

Businessmen don't think in terms of technology.  It's our job as
professionals to make technology help the business.  If we cannot clearly
articulate how technology (or architecture changes) can help the business,
it probably won't.

Unfortunately, mailing lists are not the best forum for these discussions.

-- 
Scott Walters
-PacketPusher

list Jeff Newman · Fri, 23 Dec 2005 09:19:50 -0600 ·

Scott,

I wanted to respond to you regarding technical reasons on a decreased
interval.

I agree that in most cases where people would want an increase in frequency
it
would be for real-time performance analysis, whereas hobbit/bb are more for
capacity planning/trending.

In my business, we deal with recieving all financial data and pushing that
data around
servers. a graph would have little data until the stock market opens, then
the floodgates open :-)
The graph then fluctuates with another surge at market close.

The interval being at 1 minute for specifically CPU and network is important
to us
for capacity planning purposes because during, say, market open, there are
huge peaks
that a 5m interval doesn't catch. We need to plan capacity based around
those spikes, as those are indicative of future market trends in stock
volume. It's not that the 5m interval does nothing, indeed it is helpful,
but from a business perspective, a 1m interval allows us to plan capacity
because it helps us catch the spikes that we want to see.

So something like a low-interval cpu/network column would be beneficial.
Those tests could
use seperate rrd files etc...

I recently integrated mrtg into hobbit. I assume that the 5m interval
"issue" (not really an issue I know) exists with it as well since it
utilizes the same rrd structure? Or can I set the interval of mrtg to be 1
minute? That would solve my networking interval problem.

Anyway, I hope I have explained the business reason well enough, feel free
to ask any questions. I feel that while not all circumstances are ideal for
a 1m polling sample, there
are some situations where this is ideal.

-Jeff

▸ quoted from Scott Walters



On 12/15/05, Scott Walters <user-2c405ccfe1ee@xymon.invalid> wrote:

On Thu, 15 Dec 2005, Tracy J. Di Marco White wrote:

Everyone really needs to consider what all the effects are of
the frequency of the monitoring.

I understand a more frequent sampling period is an easy sell, but I don't
think it is a valid one when the rubber meets the road.

Plus, I try and make sure all technical decisions have a business reason.

I dislike technology and its advocates that try and drive the business.
I guess I've gotten old and 'kewl' is no longer good enough ;)

Businessmen don't think in terms of technology.  It's our job as
professionals to make technology help the business.  If we cannot clearly
articulate how technology (or architecture changes) can help the business,
it probably won't.

Unfortunately, mailing lists are not the best forum for these discussions.

--
Scott Walters
-PacketPusher

list Jeff Newman · Fri, 23 Dec 2005 11:08:12 -0600 ·

Sorry, one more thing (don't mean to add to message volume)

I discovered that if I updated the hobbitlaunch.cfg to have mrtg start at 1m
intervals,
AND specified Interval: 1 in the mrtg.cfg file, hobbit handles it just fine
(draws the graphs
with the correct 1m deliniations and updates accordingly)

-Jeff

▸ quoted from Jeff Newman



On 12/23/05, Jeff Newman <user-e96740e73ca8@xymon.invalid> wrote:

Scott,

I wanted to respond to you regarding technical reasons on a decreased
interval.

I agree that in most cases where people would want an increase in
frequency it
would be for real-time performance analysis, whereas hobbit/bb are more
for capacity planning/trending.

In my business, we deal with recieving all financial data and pushing that
data around
servers. a graph would have little data until the stock market opens, then
the floodgates open :-)
The graph then fluctuates with another surge at market close.

The interval being at 1 minute for specifically CPU and network is
important to us
for capacity planning purposes because during, say, market open, there are
huge peaks
that a 5m interval doesn't catch. We need to plan capacity based around
those spikes, as those are indicative of future market trends in stock
volume. It's not that the 5m interval does nothing, indeed it is helpful,
but from a business perspective, a 1m interval allows us to plan capacity
because it helps us catch the spikes that we want to see.

So something like a low-interval cpu/network column would be beneficial.
Those tests could
use seperate rrd files etc...

I recently integrated mrtg into hobbit. I assume that the 5m interval
"issue" (not really an issue I know) exists with it as well since it
utilizes the same rrd structure? Or can I set the interval of mrtg to be 1
minute? That would solve my networking interval problem.

Anyway, I hope I have explained the business reason well enough, feel free
to ask any questions. I feel that while not all circumstances are ideal for
a 1m polling sample, there
are some situations where this is ideal.

-Jeff


On 12/15/05, Scott Walters <user-2c405ccfe1ee@xymon.invalid> wrote:

On Thu, 15 Dec 2005, Tracy J. Di Marco White wrote:

Everyone really needs to consider what all the effects are of
the frequency of the monitoring.

I understand a more frequent sampling period is an easy sell, but I
don't
think it is a valid one when the rubber meets the road.

Plus, I try and make sure all technical decisions have a business
reason.

I dislike technology and its advocates that try and drive the business.
I guess I've gotten old and 'kewl' is no longer good enough ;)

Businessmen don't think in terms of technology.  It's our job as
professionals to make technology help the business.  If we cannot
clearly
articulate how technology (or architecture changes) can help the
business,
it probably won't.

Unfortunately, mailing lists are not the best forum for these
discussions.

--
Scott Walters
-PacketPusher

list Scott Walters · Fri, 23 Dec 2005 12:16:49 -0500 ·

This helps *immensely*.  Now we'll be able to justify shiny new gear  to management to reliably provide an IT infrastructure capable of  meeting the long term growth of trade volumes.

▸ quoted from Jeff Newman

On Dec 23, 2005, at 10:19 AM, Jeff Newman wrote:

servers. a graph would have little data until the stock market  opens, then the floodgates open :-)
The graph then fluctuates with another surge at market close.

gotcha

▸ quoted from Jeff Newman

The interval being at 1 minute for specifically CPU and network is  important to us
for capacity planning purposes because during, say, market open,  there are huge peaks
that a 5m interval doesn't catch. We need to plan capacity based  around those spikes, as those are indicative of future market  trends in stock volume. It's not that the 5m interval does nothing,  indeed it is helpful, but from a business perspective, a 1m  interval allows us to plan capacity because it helps us catch the  spikes that we want to see.

Absolutely.  I am glad to hear you aware you must plan for the peaks.

Busy doesn't mean slow.

The server stats are generally only 1/2 the equation.  They are the  impact on the machine.  Ideally, for these types of situations, you  are also able to measure the load E.G. trade volumes and their  average execution times.

Knowing the RPMs of your motor doesn't tell you you MPH.  If you  could see/prove that when CPU is 100% execution times can grow  outside of SLAs, its easier to convince management you need a bigger/ better environment and/or testing/QA/integration.

I hear there's a few nickels on Wall Street ;)

▸ quoted from Jeff Newman

So something like a low-interval cpu/network column would be  beneficial. Those tests could
use seperate rrd files etc...

I am still going to argue this isn't the right way to measure the  data in your environment to provide the information you are looking for.

1)  RRD makes the presumption the older data gets, the less important  it is.  In your case that is *not true*.  Each 'peak' is a set of  data where the granularity needs to be preserved.  So even if the RRA  gets configured to keep 1m samples, which might help 'see'  the last  2 days or so of peaks,  it won't help when you want to review the  data set of Black Monday.  Those peaks will have been averaged down.

2)  One cannot assume causation of UNIX statistics and performance in  the business environment.  If you need to know your servers will  handle 5 million trades in 5 minutes, you need throw 5 million trades  at the boxes and see what happens.  "If it ain't tested, it doesn't  work."

3)  When environments reach bottlenecks, it's impossible to say what  the real peak is.  If your CPU is at 100%, one cannot know (without  testing) what the real demand for CPU is . . . .

4)  It's the always the code/SQL/CICS anyway ;)

I recently integrated mrtg into hobbit. I assume that the 5m  interval "issue" (not really an issue I know) exists with it as  well since it utilizes the same rrd structure? Or can I set the  interval of mrtg to be 1 minute? That would solve my networking  interval problem.

But that is only one sample per minute.  For your application, you  need something *much* more granular.

▸ quoted from Jeff Newman

Anyway, I hope I have explained the business reason well enough,  feel free to ask any questions. I feel that while not all  circumstances are ideal for a 1m polling sample, there
are some situations where this is ideal.

You have legitimate business needs for sure, and an idea for a  feature which would be very *useful*.

A high interval/sampling for 'stress testing' impact with the data  being preserved.

That would be a great addition hobbit.  I am not sure if RRD is the  right backend, but it might work if the solution is clever.

I'll let it rattle around . . . .

scott

Client interval question 🔗 link

Client interval question