More Granular data than 300 second samples, duh!

5 messages in this thread

list Scott Walters · Sun, 14 Oct 2007 22:19:22 -0400 ·

One of the most common requests to the trending of data is "How do I
make the charts graph data samples which are smaller than 300
seconds?"  And the answer has been, you have the source, have fun.

The original design decision that Henrik inherited was larrd should
only be for capacity planning and NOT real-time performance analysis.
Do one thing and do it well.

I had a thought the other day, and I think we could possibly get the
"best of both worlds."

Instead of

$ vmstat 300 2 (resulting in one 300 second sample)

why not

$ vmstat 5 61 (resulting in sixty 5 second samples)

The data would still only be transported every five minutes, but
contain more granular samples.

This could not be done for all metrics, but many.  This would also
require the RRAs of the all the RRDs be re-made (export, re-create,
import).  But I've that's been on my mind anyway cause the original
RRA structure was based on screen sizes for 800x600, instead of
business requirements.

Henrik, do you follow my thinking?  It's kinda hard for me to believe
it's taken me over five years to think of this!

My biggest concern is not the technical details of the collectors and
RRD/RRA restructuring, but inflicting resource usage on servers
measuring themselves.

$ vmstat 1 301 would definitely be a bad idea.

Scott Walters
-PacketPusher

list Stef Coene · Mon, 15 Oct 2007 08:47:15 +0200 ·

▸ quoted from Scott Walters

On Monday 15 October 2007, Scott Walters wrote:

One of the most common requests to the trending of data is "How do I
make the charts graph data samples which are smaller than 300
seconds?"  And the answer has been, you have the source, have fun.

The original design decision that Henrik inherited was larrd should
only be for capacity planning and NOT real-time performance analysis.
Do one thing and do it well.

I had a thought the other day, and I think we could possibly get the
"best of both worlds."

I disagree.  Hobbit is designed for monitoring-with-5-minutes interval.  It will also create a 100 x load if you go fro; 5 minutes to 3 second intervals.  Same for the rrd size.  My rrd dir is currently 154.  I'm migrating to a new hobbit and I changed the rrds so I have 5760 and not 576 data ponints / rrd.  The new rrd's are 5921 MB.  Going to a 3 second interval means 592100 MB = 593 GB .......
It's also possible that the hobbit client will generate more load then the applications ...

For real time monitoring, take a look at nmon.  http://www.ibm.com/developerworks/aix/library/au-analyze_aix/
The tool was original AIX only, but linux is also supported in the latest releases.  This is a tool that combines all other monitor tools in 1 screen.  Great tool.


Stef

list Stef Coene · Mon, 15 Oct 2007 09:37:53 +0200 ·

▸ quoted from Stef Coene

On Monday 15 October 2007, Stef Coene wrote:

On Monday 15 October 2007, Scott Walters wrote:

One of the most common requests to the trending of data is "How do I
make the charts graph data samples which are smaller than 300
seconds?"  And the answer has been, you have the source, have fun.

The original design decision that Henrik inherited was larrd should
only be for capacity planning and NOT real-time performance analysis.
Do one thing and do it well.

I had a thought the other day, and I think we could possibly get the
"best of both worlds."

I once had the plan (but not the time) to change how to info is collected.  For vmstat, you can run
vmstat 1 300
Process the data on the client, find the max, min and average and report the data to the hobbit server.  So you still have the 5 minute updates (the average is the same number as what is reported now), but you also have the maximum and minimum from the 5 minutes.


Stef

list Henrik Størner · Mon, 15 Oct 2007 13:38:10 +0200 ·

▸ quoted from Scott Walters

On Sun, Oct 14, 2007 at 10:19:22PM -0400, Scott Walters wrote:

One of the most common requests to the trending of data is "How do I
make the charts graph data samples which are smaller than 300
seconds?"  And the answer has been, you have the source, have fun.

The original design decision that Henrik inherited was larrd should
only be for capacity planning and NOT real-time performance analysis.
Do one thing and do it well.

I had a thought the other day, and I think we could possibly get the
"best of both worlds."

Instead of

$ vmstat 300 2 (resulting in one 300 second sample)

why not

$ vmstat 5 61 (resulting in sixty 5 second samples)

The data would still only be transported every five minutes, but
contain more granular samples.

Scott, You have obviously been on the receiving end of LARRD related
questions for a long time, so I guess you know what the users have
asked for.

I haven't had a lot of requests for more granular data to begin with;
most of the requests have been for the fine-grained (5-minute) data to
be maintained for a longer period of time than the current 48 hours.
In the next version (or the current snapshot), you can define RRA's
individually for each type of RRD files. So you can configure the vmstat
RRD's to maintain the fine-grained data for a longer time. That should
take care of this issue.

I think your idea is worth looking into.

▸ quoted from Scott Walters

This could not be done for all metrics, but many.  This would also
require the RRAs of the all the RRDs be re-made (export, re-create,
import).  But I've that's been on my mind anyway cause the original
RRA structure was based on screen sizes for 800x600, instead of
business requirements.

If I understand your suggestion correctly, you would change the client
to run "vmstat 5 61" (for instance), collect all 60 samples, and then
send them off to Hobbit every 5 minutes. So we would essentially be
caching data for 5 minutes on the client, then send it off to the Hobbit
server and do a single multi-update of the RRD data when it arrives.

One complication with this is that Hobbit needs to determine the
timestamps for each of the samples, because RRDtool needs each
measurement timestamped. In the current setup, Hobbit just uses the time
that the data arrives from the client - this will be "close enough" to
the time the measurement was done to work. But if the client caches the
data for some amount of time, we have to find a way of generating the
correct timestamps. Just having the client timestamp it with its own
local time won't work - there are too many hosts where the clocks are
way off. I guess this could be done by having the client timestamp the
data, but then use these as relative timestamps (so we can see sample 10
was done 236 seconds before the last sample) and then work out the exact
timestamps over on the Hobbit server, like we do today.

This could be done - it would require a bit of change to the clients,
but I'm not really happy with the current way the vmstat data collection 
works (it usually leaves a vmstat process hanging around when the client
is stopped), so I wouldn't mind having to do some code for this. I'd 
probably write a small tool to run "vmstat 60" so it runs forever, and
then the tool would pick up the data, timestamp it and then regularly 
feed it into the client report.

And of course the server-end would need changing to accomodate the new 
data format and the multiple updates.  It's certainly doable, without a 
whole lot of re-designing.

But I think we should consider which datasets one might want to have
these frequent updates for. vmstat is obvious; but what about memory
utilisation? Disk utilisation rarely changes rapidly - or perhaps it
does ? Process counts? Network test response times ? Once we start doing
it for vmstat, I'd expect everyone to come forward and ask for it for
lots of other datasets - so instead of doing a quick hack just for
vmstat, we should consider what would be the "right" way of doing it for
all/most of the data.

▸ quoted from Scott Walters

Henrik, do you follow my thinking?  It's kinda hard for me to believe
it's taken me over five years to think of this!

Things take time - and you often don't get it right until the third try.

▸ quoted from Scott Walters

My biggest concern is not the technical details of the collectors and
RRD/RRA restructuring, but inflicting resource usage on servers
measuring themselves.

$ vmstat 1 301 would definitely be a bad idea.

Agreed - but I don't think that should be something Hobbit decides. I
can easily imagine a scenario where you would do that for some
troubleshooting situation, and if that is what is needed then Hobbit
should let you do it. No reason to setup arbitrary restrictions.
(This is in line with Unix thinking - "if you insist on shooting your
foot off, it's your decision to do so". Just as "rm -rf /" is not
recommended, but still possible).


Regards,
Henrik

list Scott Walters · Sat, 2 Feb 2008 23:27:40 -0500 ·

A wee bit of lag on this response.

▸ quoted from Henrik Størner


On Oct 15, 2007 6:38 AM, Henrik Stoerner <user-ce4a2c883f75@xymon.invalid> wrote:

I haven't had a lot of requests for more granular data to begin with;
most of the requests have been for the fine-grained (5-minute) data to
be maintained for a longer period of time than the current 48 hours.
In the next version (or the current snapshot), you can define RRA's
individually for each type of RRD files. So you can configure the vmstat
RRD's to maintain the fine-grained data for a longer time. That should
take care of this issue.


Adding the ability to define RRA's for each RRD is a very nice feature.  I
definitely believe the the stock RRA's should be adjusted to meet user
requests. As I've mentioned before, the original RRA definitions were very
arbitrary.

Thinking out loud:  I believe we could redefine the stock RRAs for each RRD
that could handle the need to keep data longer, and the ability to keep more
granular forms of data.  This *should* be a simple change that is completely
backwards compatible and would not mandate a more granular sample rate.
But, by doing so, Hobbit could from this point forward, would allow for
both, without RRA/RRD manipulations.  "Migrating old RRDs" would then be an
independent task.

For example, create a new RRA structure which allows for (rough draft):

5 second samples for 48 hours (the granular RRA we don't have today)
1 hour samples for 400 days (this gives the ability to run business
"reports" over at least the last year)
24 hour samples for 9600 days (24*400=9600 I think I understand what the Y2K
bug creators went through, I can't imagine this code running 9600 days from
now!  But then how many times I've missed data flowing off the 576 day
chart!)

▸ quoted from Henrik Størner

If I understand your suggestion correctly, you would change the client
to run "vmstat 5 61" (for instance), collect all 60 samples, and then
send them off to Hobbit every 5 minutes. So we would essentially be
caching data for 5 minutes on the client, then send it off to the Hobbit
server and do a single multi-update of the RRD data when it arrives.


Exactly.  I don't know if rrdtool has the ability to handle "batch" inputs,
but it would be nice for this.  Since file IO is such an issue anyway, I'd
hate to aggravate the condition.

▸ quoted from Henrik Størner

way off. I guess this could be done by having the client timestamp the
data, but then use these as relative timestamps (so we can see sample 10
was done 236 seconds before the last sample) and then work out the exact
timestamps over on the Hobbit server, like we do today.

I'd say keep the time "interpretation" exactly the same as before since it
has worked.  The increase of samples are merely offsets from the current
single input.  I don't see any reason to change the logic.

▸ quoted from Henrik Størner

This could be done - it would require a bit of change to the clients,
but I'm not really happy with the current way the vmstat data collection
works (it usually leaves a vmstat process hanging around when the client
is stopped), so I wouldn't mind having to do some code for this. I'd
probably write a small tool to run "vmstat 60" so it runs forever, and
then the tool would pick up the data, timestamp it and then regularly
feed it into the client report.


Heh, it's annoying how such an ugly shell hack can work so well.  I think
going to a "vmstat 60" running all the time would only move where the
ugliness happens (from the collector, to the parser--plus you'd still have
to keep track of the PID, and kill it on hobbit shutdowns).  The only
"clean" way I can think of is to make the collectors run once per sample:
e.g. "vmstat 5 2".  We are also bumping into statistical problems because
vmstat info is a "rate" (something/second) and other data is a gauge
(something).  I think it's a scalar vs. vector issue. RRD can of course deal
with input streams of GAUGE or COUNTER (DERIVE was used as a "poor-mans"
solution to kill spikes), but not all metrics are available via COUNTERS (
e.g. load average).  But you can get COUNTER for system calls).

The currently collection of the metrics might be ugly but it works.  I'd be
really amazed if cleaner way could be developed.  The *stat commands give a
nice abstraction layer to kernel metrics, that I think overall would be a
nightmare to normalize across platforms.

▸ quoted from Henrik Størner

And of course the server-end would need changing to accomodate the new
data format and the multiple updates.  It's certainly doable, without a
whole lot of re-designing.

But I think we should consider which datasets one might want to have
these frequent updates for. vmstat is obvious; but what about memory
utilisation? Disk utilisation rarely changes rapidly - or perhaps it
does ? Process counts? Network test response times ? Once we start doing
it for vmstat, I'd expect everyone to come forward and ask for it for
lots of other datasets - so instead of doing a quick hack just for
vmstat, we should consider what would be the "right" way of doing it for
all/most of the data.


This is a symptom of one of the bigger issues I ran into with larrd
development:  the chicken and egg dilemma regarding collecting, parsing, and
reporting metrics.  You need to define all three, but you don't necessarily
know how you'd like to see data until you see it, which of course affects
how you collect it.

Over the years, I've been drawn towards the "industrial strength" "one size
fits all" architectures vs. "super custom elite" configurations. Curtis
Preston in his backup book, has a saying, "special is bad."  I agree whole
heartedly.

This means despite the fact I can't think of one good reason why disk usage
should be kept at five second intervals, customizing each RRD for the data
would be a pain.  If the heartbeats were general enough, we could define a
"stock RRA structure" that could handle data inputted at 5 or 300 second
intervals.

Since you've already got the code to handle custom RRA configurations for
each RRD this may be a moot point.

▸ quoted from Henrik Størner

$ vmstat 1 301 would definitely be a bad idea.

Agreed - but I don't think that should be something Hobbit decides. I
can easily imagine a scenario where you would do that for some
troubleshooting situation, and if that is what is needed then Hobbit
should let you do it. No reason to setup arbitrary restrictions.
(This is in line with Unix thinking - "if you insist on shooting your
foot off, it's your decision to do so". Just as "rm -rf /" is not
recommended, but still possible).


As long as the stock configuration is not brain dead, I won't loose sleep
over giving administrators enough rope to hang themselves.  Although
generally speaking, I prefer systems that protect users against themselves.

So the long and short of all of this, is the request that with the
4.3.0release the standard RRAs within RRDs being created could handle
the
requirement of 5 second samples, 1 hour samples kept for at least one year,
and "keep one day samples longer than you think you could possibly need
them."

-Scott

More Granular data than 300 second samples, duh! 🔗 link

More Granular data than 300 second samples, duh!