diskstat.sh/RRD oddity

list Vernon Everett
Thu, 29 Mar 2012 09:46:41 +0800
Message-Id: <user-f9f88c7d59e7@xymon.invalid>

Hi Steve

I think this is the script that I wrote.
Apologies that it got you into a bit of a mess, but I am am quite thrilled
to hear that it worked as-is on RHEL. It was originally written for Solaris
10, and I made no effort to test it on anything else.

The other fix I would suggest is to change the sample time.
Change DURATION variable at the top.
The script takes a default 10-second "sample" of disk activity, and uses
that as those values for graphing.
It just may be that through a curious alignment of system times, that
multiple clients were sending their data to the Xymon server at the same
time as your system was sampling the IO.
Change the sample to something like 30 seconds, and you might find you get
a better average.

This was one of the risks with that script.
Make the sample time too long, and we see too much of an average - very
smooth graph.
Too short,and we might pick up peaks (as in your case).

What I was originally looking for when I wrote the script, was sustained
high IO, in which case, any sample size would have done the trick. So, for
me, 10 seconds was as good a value as any, but feel free to experiment.
YMMV.
If you find some settings give significantly better results than others,
feel free to add these notes to the Description or Known Bugs & Issues
sections on Xymonton.
And while you are there, if you can update the Compatibility entry to
include your OS, that would be great.

Regards
      Vernon


On 29 March 2012 06:02, Steve Holmes <user-ec1bf77b1b44@xymon.invalid> wrote:

This is just a comment on an oddity with respect to diskstat.sh and RRD.

We make pretty heavy use of the diskstat.sh script, which I believe I
downloaded from xymonton. When I installed it I used the standard
clientlaunch.cfg stanza for the configuration and everything worked great.

I was called to task today because we have been having some disk io issues
on the RHEL VMs and someone was looking at the trend graphs for some
servers to see if there was anything they could learn and they noticed that
beginning at about 4pm local time on Monday the graphs for the number of
sectors written per second on a couple of file systems on several VMs
jumped from the 10 to 20 range to the 300 to 340 range and stayed there.
The graph for number of disk writes per second had a corresponding jump up
to about 40 or 50 from close to zero.

In analyzing the data I discovered that the file system that was
displaying this behavior is the same file system to which the diskstat.sh
script is writing its temp files. It appears that for some reason, starting
at 4pm on Monday the 5 minute test interval and the 5 minute average for
RRD got in sync and all it was seeing was the data point that corresponded
to its own writing activity and RRD was using it for the entire 5 minute
average (of course, that's what RRD does).

I 'fixed' it by changing the test interval to something less than 5
minutes. I tried 2, 3, and 4 minutes and they all had the effect of
reducing the data in the plot back to the expected level, i.e. to the level
it was before 4pm on Monday.

The mystery remains why it suddenly started seeing and using its own disk
activity at the same time on several different servers.

Steve Holmes
ITaP/Purdue University

--
If they give you ruled paper, write the other way. -Juan Ramon Jimenez,
poet, Nobel Prize in literature (1881-1958)

I prayed for freedom for twenty years, but received no answer until I
prayed with my legs. -Frederick Douglass, Former slave, abolitionist,
editor, and orator (1817-1895)

-- 
"While it is futile to try to eliminate risk, and questionable to try to
minimize it, it is essential that the risks taken be the right risks. "
- Peter F. Drucker