Critical System Page -- HTTP 500 Error

3 messages in this thread

list Schminke_Erik_D · Fri, 28 Jul 2017 10:06:32 -0500 ·

My Data Center operators have informed me that the Critical Systems Page
becomes unavailable for them due to an "Internal Server Error".  This
occurs predictably every night for them at approximately 2:45 (GMT-5) and
lasts approximately 30 minutes.

This problem has been impacting their ability to ensure that they properly
inform us if any of our systems with sensitive SLAs become unavailable.  A
quick solution would be appreciated!!

I started monitoring the Critical Page through Xymon itself for about a
week now and have included the history.  There seems to be no useful
information that I have found in any Xymon or Apache logs.  I also see no
way of enabling any debugging logs for this script.  I enabled the --debug
option in cgioptions.cfg, but that cause a bunch of info to be displayed on
the page, and that produced even more complaints, so I turned it back off.

There is a series of entries in /var/log/messages every time the URL for
the page is requested.  During the time when the Critical Systems Page is
unavailable, no other CGI script that is part of Xymon has any issues.  (I
offer that up because all of these scripts are the same, hard-linked file.)

The /var/log/message file shows me that ABRT is also capturing crash
information.  It *was* discarding this information until I enabled the
option for ProcessUnpackaged
in /etc/abrt/abrt-action-save-package-data.conf.  I have included all of
the files (excluding the sosreport data; not sure if that might contain
sensitive infomation or not).

All of this information I have placed into my GitHub repo for people to
view: https://github.com/edschminke/xymon

If I can include any other information about this crash, please let me
know.

Thanks!


Erik D. Schminke | Associate Systems Programmer
Hormel Foods Corporation | One Hormel Place | Austin, MN XXXXX
Phone: (XXX) XXX-XXXX
user-15513f33c451@xymon.invalid | www.hormelfoods.com

list Schminke_Erik_D · Fri, 4 Aug 2017 10:53:21 -0500 ·

I think I can point to a specific cause for this issue.  It seems to be a
combination of the "uptime" test being in an alert condition and the same
test failing during an exclusion window on the Critical Systems Page.

I have a number of Windows systems monitored for uptime.

In analysis.cfg:
UP 10m 37d yellow

In critical.cfg:
CTX_Template|uptime|||*:0400:2400|1|EPD|System has rebooted|rchicks
2017-08-04 07:58:11

I also set Xymon to send me alerts for ALL systems between 2:30AM and
3:30AM; the average time window for the Critical Systems Page going down

In alerts.cfg:
HOST=%.*
    MAIL user-15513f33c451@xymon.invalid FORMAT=text REPEAT=1h TIME=*:0230:0330
FORMAT=text
    MAIL user-15513f33c451@xymon.invalid FORMAT=text TIME=*:0230:0330 FORMAT=text
RECOVERED


Last night, around 2:45, 4 of these systems were rebooted.  As soon as the
first email was sent that a system went yellow for uptime, I got the alert
that http went red for the Critical Systems Page.  When the last email was
sent that uptime recovered, I got the alert that http recovered.

This morning, I rebooted a different Windows host.  I watched the test go
yellow, but the Critical Systems Page was fine.  In this case, the
condition was within the "Monitoring Time" window.  I then went into the
Critical Systems Editor and modified the "Monitoring Time" and put it
outside the window (e.g. current time 8AM, window: 12PM-12AM).  As soon as
I refresh the Critical Systems Page, it crashes.  Change the "Monitoring
Time" so that the condition is back inside the window (e.g. 4AM), refresh,
it loads fine.

I tested the same process with a few tests; disk, memory, cpu.  I could not
duplicate the problem with those tests.  I think the problem is limited to
uptime, but it very well could be others.  It also does not seem to matter
whether it is the actual host config, or a "cloned" host config.  The crash
happens with both.

If it matters, here's my environment..

I'm currently running Xymon v4.3.27.  The OS is Red Hat Enterprise Linux
v6.8.  Kernel is 2.6.32-431.el6.  Architecture is x86_64.  glibc version is
2.12-1.192.el6; for what it's worth, but i686 and x86_64 packages are
installed.

A gdb backtrace shows that crash occurs in a "strncmp" function in
lib/loadcriticalconf.c on line 249

(gdb) backtrace
#0  0x0000003603729420 in __strncmp_sse42 () from /lib64/libc.so.6
#1  0x000000000040fa40 in get_critconfig (key=<value optimized out>,
flags=<value optimized out>, resultkey=<value optimized out>) at
loadcriticalconf.c:249
#2  0x00000000004030eb in loadstatus (maxprio=3, maxage=31536000,
mincolor=3, wantacked=0) at criticalview.c:115
#3  0x00000000004036f0 in main (argc=<value optimized out>, argv=<value
optimized out>) at criticalview.c:513
(gdb) frame 1
#1  0x000000000040fa40 in get_critconfig (key=<value optimized out>,
flags=<value optimized out>, resultkey=<value optimized out>) at
loadcriticalconf.c:249
249					if (strncmp(realkey, rec->key, strlen
(realkey)) != 0) handle=xtreeEnd(rbconf);
(gdb) print realkey
$1 = 0x1c20c80 "CTX_Template|uptime"
(gdb) print *rec
$2 = {key = 0x435f6c65746e6957 <Address 0x435f6c65746e6957 out of bounds>,
priority = 1769236850, starttime = 7310575213499737428, endtime = 0,
crittime = 0x1c1d8e0 "Wintel_Critical_Template",
  ttgroup = 0x21 <Address 0x21 out of bounds>, ttextra = 0x6364727673737763
<Address 0x6364727673737763 out of bounds>, updinfo = 0x3603003d31 <Address
0x3603003d31 out of bounds>}

All of the crash details are still in my GitHub repo at
https://github.com/edschminke/xymon  ...including the coredump file.  I
suspect better C developers than myself can put that to a lot better use.

▸ quoted from Schminke_Erik_D


Thanks!

Erik D. Schminke | Associate Systems Programmer
Hormel Foods Corporation | One Hormel Place | Austin, MN XXXXX
Phone: (XXX) XXX-XXXX
user-15513f33c451@xymon.invalid | www.hormelfoods.com

list Schminke_Erik_D · Mon, 28 Aug 2017 16:00:23 -0500 ·

I have another update to this issue.  I think I can pin point the problem a
little more specifically.  In my previous message, I blamed the "uptime"
test for causing the Critical Systems Page to crash.  After this weekend, I
discovered that there's a little more to it than that.

This weekend, the Critical Systems Page crashed due to a "disk" test being
non-green. After picking random hosts/test with varying degrees of success,
I finally noticed that isn't necessarily "disk" or "uptime" that causes it;
rather whichever test is the LAST test defined for a host (or cloned host).
It just so happens, that "uptime" usually ends up being the LAST test
defined for most of my hosts since the Critical Systems Page Editor sorts
them as it gets written.

To test, I made "disk" the last (only) test defined for a host.  I would
then modify thresholds for memory, disk and procs to put the tests into a
non-green state.  I set the "monitoring time" window 11:58PM to 11:59PM.
First, disk crashed the page.  I then duplicated the "disk" entry to
"memory", making that the last test for that host.  Disk no longer crashed
the page, but when I put memory into a non-green state, it would crash.  I
then made "procs" the last test for the host.  Memory no longer crashed the
page, but procs would after putting that test into a non-green state.

So in short, the test conditions.
- Current time OUTSIDE monitoring time window.
- Target test in a non-green state
- Target test is the last (or only) test defined for a given host.

I imagine this must be caused by something running off the end of the loop.

▸ quoted from Schminke_Erik_D



Erik D. Schminke | Associate Systems Programmer
Hormel Foods Corporation | One Hormel Place | Austin, MN XXXXX
Phone: (XXX) XXX-XXXX
user-15513f33c451@xymon.invalid | www.hormelfoods.com

Critical System Page -- HTTP 500 Error 🔗 link

Critical System Page -- HTTP 500 Error