external script problem - question

4 messages in this thread

list Steve Holmes · Wed, 5 Oct 2011 11:41:43 -0400 ·

Xymonphiles:

[Disclaimer: I don't think this is a Xymon problem but my boss thinks it
might be and has directed me to ask the list for advice.]

Running Xymon 4.2.3, display server is a Solaris box, monitoring a few
hundred servers which are mostly RHEL 5, many of them are VMs hosted on ESX
VMware.

The test is an external script which basically does a 'sudo touch foo' on
each file system and waits for it to either return with no problem, or
return with an error indicating that the file system is read-only, or after
60 seconds declares that the file system is 'hung'. We had been having
problems particularly with the read-only file system problem cropping up on
the VMs, which is what prompted the implementation of this test.

The problem with the test is that once or twice a day we get a flurry of
alerts from a dozen or so servers, all at about the same time reporting that
there is a hung file system. Other file systems on the same server are
reporting that it takes longer to do the touch than we think it should (e.g.
12 to 25 or even 60 seconds). The alerts all go away the next test cycle.
The file systems are on local fiber channel disks (i.e. not NFS mounted).
The servers getting the alerts are not all VMs and it is not always the same
set of servers that show up.

Occasionally we catch a file system that is read-only and we are going to
modify the test to not send a panic on the hung file system condition so we
don't miss the read-only condition, but we really would like to figure out
why we are getting these bursts of hung, or at least very slow writes on
file systems and why they come in bursts.

Thanks for any insight you might have.
Steve
Purdue University/ITaP

-- 
If they give you ruled paper, write the other way. -Juan Ramon Jimenez,
poet, Nobel Prize in literature (1881-1958)

Truth never damages a cause that is just. -Mohandas Karamchand Gandhi
(1869-1948)

list Henrik Størner · Wed, 05 Oct 2011 22:37:37 +0200 ·

▸ quoted from Steve Holmes

On 05-10-2011 17:41, Steve Holmes wrote:

The test is an external script which basically does a 'sudo touch foo'
on each file system and waits for it to either return with no problem,
or return with an error indicating that the file system is read-only, or
after 60 seconds declares that the file system is 'hung'.

[snip]

The problem with the test is that once or twice a day we get a flurry of
alerts from a dozen or so servers, all at about the same time reporting
that there is a hung file system. Other file systems on the same server
are reporting that it takes longer to do the touch than we think it
should (e.g. 12 to 25 or even 60 seconds). The alerts all go away the
next test cycle. The file systems are on local fiber channel disks (i.e.
not NFS mounted). The servers getting the alerts are not all VMs and it
is not always the same set of servers that show up.

Sounds nasty, troubleshooting that kind of "only happens occasionally" problems is really difficult.

OK, off the top of my head here are some ideas:

* sudo - what kind of user authentication are you using ? If it's LDAP or NIS, could that explain why the test suddenly takes longer ?

* clocks - how do you measure the time it takes to run the test ? If you just use "date" before and after the touch-command, what happens if your server's clocks are stepped (jump a few seconds) while the test is running? In my experience, clocks on virtual machines are horrible at keeping correct time and can quite easily skip a couple of seconds if set to follow the clock of the host OS.

* Have you looked at the vmstat1 graphs for these systems ? How is the "I/O wait" on them ? Some types of I/O on Linux systems can cause quite a slow-down; deleting large files on ext2 or ext3 systems could be quite time-consuming and cause the whole system to really stall. Also doing things that touch a lot of files - a large find, or grep'ing through a large number of files, especially if you don't mount filesystems with the "noatime" option - can cause a lot of I/O that slows down filesystem operations.

* I've seen VMware Workstation consistently bring a system to its knees when a VM was being shut down. Apparently some bad interaction between the kernel version (2.6.18, if memory serves me right) and the way it was updating the virtual disk images - it would just churn away for 5 or 10 minutes doing nothing but hitting the disk. Disappeared when I upgraded the kernel on the box. No idea if your combination of RHEL and ESX could do the same thing. But it was quite reproducible here, so it should be easy to spot.

Just some thoughts.

Regards,
Henrik

list Jeremy Laidman · Tue, 11 Oct 2011 16:56:14 +1100 ·

▸ quoted from Henrik Størner

On Thu, Oct 6, 2011 at 7:37 AM, Henrik Størner <user-ce4a2c883f75@xymon.invalid> wrote:

* Have you looked at the vmstat1 graphs for these systems ? How is the "I/O
wait" on them ? Some types of I/O on Linux systems can cause quite a
slow-down; deleting large files on ext2 or ext3 systems could be quite
time-consuming and cause the whole system to really stall. Also doing things
that touch a lot of files - a large find, or grep'ing through a large number
of files, especially if you don't mount filesystems with the "noatime"
option - can cause a lot of I/O that slows down filesystem operations.

I've seen this happen during log rotation and compression, shown by "sar -d"
(I recommend installing sar if you haven't already).  The I/O contention
during removal of a big file after compression is sufficient to cause any
filesystem operation to block for a loooong time.  For me, this causes
sufficient back-pressure that syslog-ng starts dropping UDP packets while it
waits for the logfile to become writeable.

To prove that it's not a Xymon problem, why not create a cron task that does
the same thing, but logs how long it takes.  If the log shows delays, then
it's got nothing to do with Xymon.  Perhaps create /etc/cron.d/touchtest
with the following:

 * * * * * root time touch /path/to/file >> /tmp/touchlog 2>&1
 50 23 * * * root cp /dev/null /tmp/touchlog

Cheers
Jeremy

list Steve Holmes · Wed, 19 Oct 2011 17:02:02 -0400 ·

Thanks all. The problem has quieted down considerably this week so it isn't
as much of a priority as before. But see my replies below.

▸ quoted from Henrik Størner



On Wed, Oct 5, 2011 at 4:37 PM, Henrik Størner <user-ce4a2c883f75@xymon.invalid> wrote:

On 05-10-2011 17:41, Steve Holmes wrote:

 The test is an external script which basically does a 'sudo touch foo'

on each file system and waits for it to either return with no problem,
or return with an error indicating that the file system is read-only, or
after 60 seconds declares that the file system is 'hung'.

[snip]

 The problem with the test is that once or twice a day we get a flurry of

alerts from a dozen or so servers, all at about the same time reporting
that there is a hung file system. Other file systems on the same server
are reporting that it takes longer to do the touch than we think it
should (e.g. 12 to 25 or even 60 seconds). The alerts all go away the
next test cycle. The file systems are on local fiber channel disks (i.e.
not NFS mounted). The servers getting the alerts are not all VMs and it
is not always the same set of servers that show up.

Sounds nasty, troubleshooting that kind of "only happens occasionally"
problems is really difficult.

OK, off the top of my head here are some ideas:

* sudo - what kind of user authentication are you using ? If it's LDAP or
NIS, could that explain why the test suddenly takes longer ?

Local authentication. So that isn't it.

▸ quoted from Henrik Størner

* clocks - how do you measure the time it takes to run the test ? If you
just use "date" before and after the touch-command, what happens if your
server's clocks are stepped (jump a few seconds) while the test is running?
In my experience, clocks on virtual machines are horrible at keeping correct
time and can quite easily skip a couple of seconds if set to follow the
clock of the host OS.

The script is sleeping for N seconds and adding N to a counter.  So, I don't
think that is the problem. We increased the total time for a test to 60
seconds (it was 30), but that didn't help. On some systems if we go much
more than 60 seconds per file system the whole test could break the 5 minute
limit (and I'd have to reduce the frequency of the test).

▸ quoted from Jeremy Laidman


* Have you looked at the vmstat1 graphs for these systems ? How is the "I/O

wait" on them ? Some types of I/O on Linux systems can cause quite a
slow-down; deleting large files on ext2 or ext3 systems could be quite
time-consuming and cause the whole system to really stall. Also doing things
that touch a lot of files - a large find, or grep'ing through a large number
of files, especially if you don't mount filesystems with the "noatime"
option - can cause a lot of I/O that slows down filesystem operations.

For some reason I don't have vmstat1 graphs. Old problem I've never gone
back to try to fix. So, no...

▸ quoted from Henrik Størner

* I've seen VMware Workstation consistently bring a system to its knees
when a VM was being shut down. Apparently some bad interaction between the
kernel version (2.6.18, if memory serves me right) and the way it was
updating the virtual disk images - it would just churn away for 5 or 10
minutes doing nothing but hitting the disk. Disappeared when I upgraded the
kernel on the box. No idea if your combination of RHEL and ESX could do the
same thing. But it was quite reproducible here, so it should be easy to
spot.

Our problem is really spotty, so probably not this. We've had the VMware
admins looking into underlying problems (which is where we think the problem
really is), but they've not come up with anything.

Just some thoughts.

Thanks!

Regards,
Henrik
______________________________**
Xymon at xymon.com<

From Jeremy:

▸ quoted from Jeremy Laidman


To prove that it's not a Xymon problem, why not create a cron task that does

the same thing, but logs how long it takes.  If the log shows delays, then
it's got nothing to do with Xymon.  Perhaps create /etc/cron.d/touchtest
with the following:

 * * * * * root time touch /path/to/file >> /tmp/touchlog 2>&1
 50 23 * * * root cp /dev/null /tmp/touchlog

We thought about doing exactly this, but have not yet implemented it.
Thanks,

Steve

▸ quoted from Steve Holmes

-- 
If they give you ruled paper, write the other way. -Juan Ramon Jimenez,
poet, Nobel Prize in literature (1881-1958)

Truth never damages a cause that is just. -Mohandas Karamchand Gandhi
(1869-1948)

external script problem - question 🔗 link

external script problem - question