Xymon Mailing List Archive search

Client interval question

list Scott Walters
Fri, 23 Dec 2005 12:16:49 -0500
Message-Id: <user-5725903ea078@xymon.invalid>

This helps *immensely*.  Now we'll be able to justify shiny new gear  to management to reliably provide an IT infrastructure capable of  meeting the long term growth of trade volumes.
On Dec 23, 2005, at 10:19 AM, Jeff Newman wrote:

servers. a graph would have little data until the stock market  opens, then the floodgates open :-)
The graph then fluctuates with another surge at market close.
gotcha
The interval being at 1 minute for specifically CPU and network is  important to us
for capacity planning purposes because during, say, market open,  there are huge peaks
that a 5m interval doesn't catch. We need to plan capacity based  around those spikes, as those are indicative of future market  trends in stock volume. It's not that the 5m interval does nothing,  indeed it is helpful, but from a business perspective, a 1m  interval allows us to plan capacity because it helps us catch the  spikes that we want to see.
Absolutely.  I am glad to hear you aware you must plan for the peaks.

Busy doesn't mean slow.

The server stats are generally only 1/2 the equation.  They are the  impact on the machine.  Ideally, for these types of situations, you  are also able to measure the load E.G. trade volumes and their  average execution times.

Knowing the RPMs of your motor doesn't tell you you MPH.  If you  could see/prove that when CPU is 100% execution times can grow  outside of SLAs, its easier to convince management you need a bigger/ better environment and/or testing/QA/integration.

I hear there's a few nickels on Wall Street ;)
So something like a low-interval cpu/network column would be  beneficial. Those tests could
use seperate rrd files etc...
I am still going to argue this isn't the right way to measure the  data in your environment to provide the information you are looking for.

1)  RRD makes the presumption the older data gets, the less important  it is.  In your case that is *not true*.  Each 'peak' is a set of  data where the granularity needs to be preserved.  So even if the RRA  gets configured to keep 1m samples, which might help 'see'  the last  2 days or so of peaks,  it won't help when you want to review the  data set of Black Monday.  Those peaks will have been averaged down.

2)  One cannot assume causation of UNIX statistics and performance in  the business environment.  If you need to know your servers will  handle 5 million trades in 5 minutes, you need throw 5 million trades  at the boxes and see what happens.  "If it ain't tested, it doesn't  work."

3)  When environments reach bottlenecks, it's impossible to say what  the real peak is.  If your CPU is at 100%, one cannot know (without  testing) what the real demand for CPU is . . . .

4)  It's the always the code/SQL/CICS anyway ;)
I recently integrated mrtg into hobbit. I assume that the 5m  interval "issue" (not really an issue I know) exists with it as  well since it utilizes the same rrd structure? Or can I set the  interval of mrtg to be 1 minute? That would solve my networking  interval problem.
But that is only one sample per minute.  For your application, you  need something *much* more granular.
Anyway, I hope I have explained the business reason well enough,  feel free to ask any questions. I feel that while not all  circumstances are ideal for a 1m polling sample, there
are some situations where this is ideal.
You have legitimate business needs for sure, and an idea for a  feature which would be very *useful*.

A high interval/sampling for 'stress testing' impact with the data  being preserved.

That would be a great addition hobbit.  I am not sure if RRD is the  right backend, but it might work if the solution is clever.

I'll let it rattle around . . . .

scott