Xymon Mailing List Archive search

xymongen crashes in 4.3.29

8 messages in this thread

list Matt Vander Werf · Tue, 20 Aug 2019 11:10:53 -0400 ·
Hi all,

Every day since we updated our Xymon server to 4.3.29 (from 4.3.28), I've
gotten an e-mail alert due to xymond turning red that reads:

red xymongen program crashed

Fatal signal caught!

The strange thing is that this has happened at 1:04 AM every day...like
clockwork. I have xymongen set to run every 1 minute and it has no problems
running any other time of the day. We are using the Terabithia RPMs and the
Xymon server is running RHEL 7.

I've scoured the system to find anything that is set to run at/around that
time via cron, etc. and haven't found anything. The system logs don't show
anything is happening around that time either.

I turned on debug logging for xymond and xymongen and haven't been able to
find anything unusual in either logs around that time. But it is dumping
core files for xymongen every time it crashes.

I used gdb to get the backtrace on all of the core files (so far) and I've
found that they all show the same thing. It shows the same host in the
backtrace too (although I'm farily confident it isn't specific or isolated
to that host but just the first one it runs into that it has issues with
when processing).

I've included an example gdb output below (the most recent one) [1].

Is anyone else running into this by chance? Or any idea what might be the
cause?

Thanks!


[1]
# gdb -q /usr/libexec/xymon/xymongen core.16327
Reading symbols from /usr/libexec/xymon/xymongen...Reading symbols from
/usr/lib/debug/usr/libexec/xymon/xymongen.debug...done.
done.
[New LWP 16327]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/libexec/xymon/xymongen
--reportopts=1566187200:1566273599:0:nongr --recent'.
Program terminated with signal 6, Aborted.
#0  0x00007f4657c49377 in __GI_raise (sig=sig at entry=6) at
../nptl/sysdeps/unix/sysv/linux/raise.c:55
55  return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig);
(gdb) bt
#0  0x00007f4657c49377 in __GI_raise (sig=sig at entry=6) at
../nptl/sysdeps/unix/sysv/linux/raise.c:55
#1  0x00007f4657c4aa68 in __GI_abort () at abort.c:90
#2  0x00005589375dd455 in sigsegv_handler (signum=<optimized out>) at
sig.c:57
#3  <signal handler called>
#4  strchrnul () at ../sysdeps/x86_64/strchrnul.S:33
#5  0x00007f4657c5b681 in __find_specmb (format=0xfce <Address 0xfce out of
bounds>) at printf-parse.h:109
#6  _IO_vfprintf_internal (s=s at entry=0x7ffd5dabcc00,
    format=format at entry=0xfce <Address 0xfce out of bounds>,
ap=ap at entry=0x7ffd5dabcd38)
at vfprintf.c:1308
#7  0x00007f4657d28c78 in ___vsprintf_chk (s=0x7ffd5dabcf82 "", flags=1,
slen=18446744073709551615,
    format=0xfce <Address 0xfce out of bounds>, args=args at entry=0x7ffd5dabcd38)
at vsprintf_chk.c:83
#8  0x00007f4657d28bcd in ___sprintf_chk (s=<optimized out>,
flags=flags at entry=1,
    slen=slen at entry=18446744073709551615, format=<optimized out>) at
sprintf_chk.c:32
#9  0x00005589375ce8ca in sprintf (__fmt=<optimized out>, __s=<optimized
out>)
    at /usr/include/bits/stdio2.h:33
#10 parse_histlogfile (starttime=1566187200,
    timespec=0x558937840f50 <timespec.7157> "Wed_Sep_2_19:34:55_2015",
servicename=0x5589383b6d70 "procs",
    hostname=0x558938a335d0 "<client hostname>") at availability.c:174
#11 parse_historyfile (fd=fd at entry=0x558938a3aea0, repinfo=<optimized out>,
    hostname=0x558938a335d0 "<client hostname>", servicename=0x5589383b6d70
"procs",
    fromtime=<optimized out>, totime=1566273599,
for_history=for_history at entry=0, warnlevel=97,
    greenlevel=99.995000000000005, warnstops=-1, reporttime=0x0) at
availability.c:475
#12 0x00005589375c38cc in init_state (filename=<optimized out>,
    filename at entry=0x7ffd5dacf210 "<client hostname>.procs", log=log at entry
=0x7ffd5dacf120)
    at loaddata.c:275
#13 0x00005589375c45ee in load_state (sumhead=sumhead at entry=0x558937809d48
<dispsums>) at loaddata.c:626
#14 0x00005589375be6f4 in main (argc=5, argv=0x7ffd5dad4418) at
xymongen.c:599


-- 
Matt Vander Werf
list Torsten Richter · Wed, 21 Aug 2019 11:43:00 +0200 (CEST) ·

Hi Matt,

dumb question: is there any cron job running at this time that is restarting XYmon fiddling with the network, like restarting the network for some reason?

Regards,
Torsten

Matt Vander Werf hat am 20. August 2019 um 17:10 geschrieben:

quoted from Matt Vander Werf
Hi all,

Every day since we updated our Xymon server to 4.3.29 (from 4.3.28), I've gotten an e-mail alert due to xymond turning red that reads:

red xymongen program crashed

Fatal signal caught!

The strange thing is that this has happened at 1:04 AM every day...like clockwork. I have xymongen set to run every 1 minute and it has no problems running any other time of the day. We are using the Terabithia RPMs and the Xymon server is running RHEL 7.

I've scoured the system to find anything that is set to run at/around that time via cron, etc. and haven't found anything. The system logs don't show anything is happening around that time either.

I turned on debug logging for xymond and xymongen and haven't been able to find anything unusual in either logs around that time. But it is dumping core files for xymongen every time it crashes.

I used gdb to get the backtrace on all of the core files (so far) and I've found that they all show the same thing. It shows the same host in the backtrace too (although I'm farily confident it isn't specific or isolated to that host but just the first one it runs into that it has issues with when processing).

I've included an example gdb output below (the most recent one) [1].

Is anyone else running into this by chance? Or any idea what might be the cause?

Thanks!


[1]
# gdb -q /usr/libexec/xymon/xymongen core.16327
Reading symbols from /usr/libexec/xymon/xymongen...Reading symbols from /usr/lib/debug/usr/libexec/xymon/xymongen.debug...done.
done.
[New LWP 16327]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/libexec/xymon/xymongen --reportopts=1566187200:1566273599:0:nongr --recent'.
Program terminated with signal 6, Aborted.
#0  0x00007f4657c49377 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:55
55  return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig);
(gdb) bt
#0  0x00007f4657c49377 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:55
#1  0x00007f4657c4aa68 in __GI_abort () at abort.c:90
#2  0x00005589375dd455 in sigsegv_handler (signum=) at sig.c:57
#3  
#4  strchrnul () at ../sysdeps/x86_64/strchrnul.S:33
#5  0x00007f4657c5b681 in __find_specmb (format=0xfce ) at printf-parse.h:109
#6  _IO_vfprintf_internal (s=s@entry=0x7ffd5dabcc00,
    format=format@entry=0xfce , ap=ap@entry=0x7ffd5dabcd38) at vfprintf.c:1308
#7  0x00007f4657d28c78 in ___vsprintf_chk (s=0x7ffd5dabcf82 "", flags=1, slen=18446744073709551615,
    format=0xfce , args=args@entry=0x7ffd5dabcd38) at vsprintf_chk.c:83
#8  0x00007f4657d28bcd in ___sprintf_chk (s=, flags=flags@entry=1,
    slen=slen@entry=18446744073709551615, format=) at sprintf_chk.c:32
#9  0x00005589375ce8ca in sprintf (__fmt=, __s=)
    at /usr/include/bits/stdio2.h:33
#10 parse_histlogfile (starttime=1566187200,
    timespec=0x558937840f50 "Wed_Sep_2_19:34:55_2015", servicename=0x5589383b6d70 "procs",
    hostname=0x558938a335d0 "") at availability.c:174
#11 parse_historyfile (fd=fd@entry=0x558938a3aea0, repinfo=,
    hostname=0x558938a335d0 "", servicename=0x5589383b6d70 "procs",
    fromtime=, totime=1566273599, for_history=for_history@entry=0, warnlevel=97,
    greenlevel=99.995000000000005, warnstops=-1, reporttime=0x0) at availability.c:475
#12 0x00005589375c38cc in init_state (filename=,
    filename@entry=0x7ffd5dacf210 ".procs", log=log@entry=0x7ffd5dacf120)
    at loaddata.c:275
#13 0x00005589375c45ee in load_state (sumhead=sumhead@entry=0x558937809d48 ) at loaddata.c:626
#14 0x00005589375be6f4 in main (argc=5, argv=0x7ffd5dad4418) at xymongen.c:599


--
Matt Vander Werf

Xymon@xymon.com
list Matt Vander Werf · Thu, 22 Aug 2019 14:34:00 -0400 ·
Hi Torsten,

No, there wasn't anything running from cron or anything else around that
time, let alone anything that restarts the network or Xymon.

Thanks.

--
Matt Vander Werf
quoted from Torsten Richter


On Wed, Aug 21, 2019 at 5:43 AM Torsten Richter <user-c862b499d9fa@xymon.invalid> wrote:
Hi Matt,

dumb question: is there any cron job running at this time that is
restarting XYmon fiddling with the network, like restarting the network for
some reason?

Regards,
Torsten

Matt Vander Werf <user-dfc3cf2ca434@xymon.invalid> hat am 20. August 2019 um 17:10
quoted from Torsten Richter
geschrieben:

Hi all,

Every day since we updated our Xymon server to 4.3.29 (from 4.3.28), I've
gotten an e-mail alert due to xymond turning red that reads:

red xymongen program crashed

Fatal signal caught!

The strange thing is that this has happened at 1:04 AM every day...like
clockwork. I have xymongen set to run every 1 minute and it has no problems
running any other time of the day. We are using the Terabithia RPMs and the
Xymon server is running RHEL 7.

I've scoured the system to find anything that is set to run at/around that
time via cron, etc. and haven't found anything. The system logs don't show
anything is happening around that time either.

I turned on debug logging for xymond and xymongen and haven't been able to
find anything unusual in either logs around that time. But it is dumping
core files for xymongen every time it crashes.

I used gdb to get the backtrace on all of the core files (so far) and I've
found that they all show the same thing. It shows the same host in the
backtrace too (although I'm farily confident it isn't specific or isolated
to that host but just the first one it runs into that it has issues with
when processing).

I've included an example gdb output below (the most recent one) [1].

Is anyone else running into this by chance? Or any idea what might be the
cause?

Thanks!


[1]
# gdb -q /usr/libexec/xymon/xymongen core.16327
Reading symbols from /usr/libexec/xymon/xymongen...Reading symbols from
/usr/lib/debug/usr/libexec/xymon/xymongen.debug...done.
done.
[New LWP 16327]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/libexec/xymon/xymongen
--reportopts=1566187200:1566273599:0:nongr --recent'.
Program terminated with signal 6, Aborted.
#0  0x00007f4657c49377 in __GI_raise (sig=sig at entry=6) at
../nptl/sysdeps/unix/sysv/linux/raise.c:55
55  return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig);
(gdb) bt
#0  0x00007f4657c49377 in __GI_raise (sig=sig at entry=6) at
../nptl/sysdeps/unix/sysv/linux/raise.c:55
#1  0x00007f4657c4aa68 in __GI_abort () at abort.c:90
#2  0x00005589375dd455 in sigsegv_handler (signum=<optimized out>) at
sig.c:57
#3  <signal handler called>
#4  strchrnul () at ../sysdeps/x86_64/strchrnul.S:33
#5  0x00007f4657c5b681 in __find_specmb (format=0xfce <Address 0xfce out
of bounds>) at printf-parse.h:109
#6  _IO_vfprintf_internal (s=s at entry=0x7ffd5dabcc00,
    format=format at entry=0xfce <Address 0xfce out of bounds>, ap=ap at entry=0x7ffd5dabcd38)
at vfprintf.c:1308
#7  0x00007f4657d28c78 in ___vsprintf_chk (s=0x7ffd5dabcf82 "", flags=1,
slen=18446744073709551615,
    format=0xfce <Address 0xfce out of bounds>, args=args at entry=0x7ffd5dabcd38)
at vsprintf_chk.c:83
#8  0x00007f4657d28bcd in ___sprintf_chk (s=<optimized out>,
flags=flags at entry=1,
    slen=slen at entry=18446744073709551615, format=<optimized out>) at
sprintf_chk.c:32
#9  0x00005589375ce8ca in sprintf (__fmt=<optimized out>, __s=<optimized
out>)
    at /usr/include/bits/stdio2.h:33
#10 parse_histlogfile (starttime=1566187200,
    timespec=0x558937840f50 <timespec.7157> "Wed_Sep_2_19:34:55_2015",
servicename=0x5589383b6d70 "procs",
    hostname=0x558938a335d0 "<client hostname>") at availability.c:174
#11 parse_historyfile (fd=fd at entry=0x558938a3aea0, repinfo=<optimized
out>,
    hostname=0x558938a335d0 "<client hostname>",
servicename=0x5589383b6d70 "procs",
    fromtime=<optimized out>, totime=1566273599,
for_history=for_history at entry=0, warnlevel=97,
    greenlevel=99.995000000000005, warnstops=-1, reporttime=0x0) at
availability.c:475
#12 0x00005589375c38cc in init_state (filename=<optimized out>,

    filename at entry=0x7ffd5dacf210 "<client hostname>.procs", log=log at entry=0x7ffd5dacf120)
quoted from Torsten Richter

    at loaddata.c:275
#13 0x00005589375c45ee in load_state (sumhead=sumhead at entry=0x558937809d48
<dispsums>) at loaddata.c:626
#14 0x00005589375be6f4 in main (argc=5, argv=0x7ffd5dad4418) at
xymongen.c:599


--
Matt Vander Werf

list Japheth Cleaver · Thu, 22 Aug 2019 14:11:59 -0700 ·
Hi,

I think this might be xymongen in report mode from the "dailyreport" 
file in /tasks.d/; the timing would check out.? I believe the problem 
here is one of the Terabithia patches now doing the wrong thing after 
some of the string-handling changes in 4.3.29 -- causing core dumps in 
certain situations.

If you're running actual RHEL7 on this (not CentOS, which hasn't 
released 7.7 yet), would you mind checking the xymon-4.3.30-0.5 package 
in the EL7 Terabithia testing repo and see if this helps?
https://repo.terabithia.org/rpms/xymon/testing/el7/x86_64/

Regards,
-jc
quoted from Matt Vander Werf

On 8/22/2019 11:34 AM, Matt Vander Werf wrote:
Hi Torsten,

No, there wasn't anything running from cron or anything else around 
that time, let alone anything that restarts the network or Xymon.

Thanks.

--
Matt Vander Werf


On Wed, Aug 21, 2019 at 5:43 AM Torsten Richter <user-c862b499d9fa@xymon.invalid 
<mailto:user-c862b499d9fa@xymon.invalid>> wrote:

    Hi Matt,

    dumb question: is there any cron job running at this time that is
    restarting XYmon fiddling with the network, like restarting the
    network for some reason?

    Regards,
    Torsten
    Matt Vander Werf <user-dfc3cf2ca434@xymon.invalid <mailto:user-dfc3cf2ca434@xymon.invalid>>
quoted from Matt Vander Werf
    hat am 20. August 2019 um 17:10 geschrieben:

    Hi all,

    Every day since we updated our Xymon server to 4.3.29 (from
    4.3.28), I've gotten an e-mail alert due to xymond turning red
    that reads:

    red xymongen program crashed

    Fatal signal caught!

    The strange thing is that this has happened at 1:04 AM every
    day...like clockwork. I have xymongen set to run every 1 minute
    and it has no problems running any other time of the day. We are
    using the Terabithia RPMs and the Xymon server is running RHEL 7.

    I've scoured the system to find anything that is set to run
    at/around that time via cron, etc. and haven't found anything.
    The system logs don't show anything is happening around that time
    either.

    I turned on debug logging for xymond and xymongen and haven't
    been able to find anything unusual in either logs around that
    time. But it is dumping core files for xymongen every time it
    crashes.

    I used gdb to get the backtrace on all of the core files (so far)
    and I've found that they all show the same thing. It shows the
    same host in the backtrace too (although I'm farily confident it
    isn't specific or isolated to that host but just the first one it
    runs into that it has issues with when processing).

    I've included an example gdb output below (the most recent one) [1].

    Is anyone else running into this by chance? Or any idea what
    might be the cause?

    Thanks!


    [1]
    # gdb -q /usr/libexec/xymon/xymongen core.16327
    Reading symbols from /usr/libexec/xymon/xymongen...Reading
    symbols from /usr/lib/debug/usr/libexec/xymon/xymongen.debug...done.
    done.
    [New LWP 16327]
    [Thread debugging using libthread_db enabled]
    Using host libthread_db library "/lib64/libthread_db.so.1".
    Core was generated by `/usr/libexec/xymon/xymongen
    --reportopts=1566187200:1566273599:0:nongr --recent'.
    Program terminated with signal 6, Aborted.
    #0 ?0x00007f4657c49377 in __GI_raise (sig=sig at entry=6) at
    ../nptl/sysdeps/unix/sysv/linux/raise.c:55
    55 ?return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig);
    (gdb) bt
    #0 ?0x00007f4657c49377 in __GI_raise (sig=sig at entry=6) at
    ../nptl/sysdeps/unix/sysv/linux/raise.c:55
    #1 ?0x00007f4657c4aa68 in __GI_abort () at abort.c:90
    #2 ?0x00005589375dd455 in sigsegv_handler (signum=<optimized
    out>) at sig.c:57
    #3 ?<signal handler called>
    #4 ?strchrnul () at ../sysdeps/x86_64/strchrnul.S:33
    #5 ?0x00007f4657c5b681 in __find_specmb (format=0xfce <Address
    0xfce out of bounds>) at printf-parse.h:109
    #6 ?_IO_vfprintf_internal (s=s at entry=0x7ffd5dabcc00,
    ? ? format=format at entry=0xfce <Address 0xfce out of bounds>,
    ap=ap at entry=0x7ffd5dabcd38) at vfprintf.c:1308
    #7 ?0x00007f4657d28c78 in ___vsprintf_chk (s=0x7ffd5dabcf82 "",
    flags=1, slen=18446744073709551615,
    ? ? format=0xfce <Address 0xfce out of bounds>,
    args=args at entry=0x7ffd5dabcd38) at vsprintf_chk.c:83
    #8 ?0x00007f4657d28bcd in ___sprintf_chk (s=<optimized out>,
    flags=flags at entry=1,
    ? ? slen=slen at entry=18446744073709551615, format=<optimized out>)
    at sprintf_chk.c:32
    #9 ?0x00005589375ce8ca in sprintf (__fmt=<optimized out>,
    __s=<optimized out>)
    ? ? at /usr/include/bits/stdio2.h:33
    #10 parse_histlogfile (starttime=1566187200,
    ? ? timespec=0x558937840f50 <timespec.7157>
    "Wed_Sep_2_19:34:55_2015", servicename=0x5589383b6d70 "procs",
    ? ? hostname=0x558938a335d0 "<client hostname>") at
    availability.c:174
    #11 parse_historyfile (fd=fd at entry=0x558938a3aea0,
    repinfo=<optimized out>,
    ? ? hostname=0x558938a335d0 "<client hostname>",
    servicename=0x5589383b6d70 "procs",
    ? ? fromtime=<optimized out>, totime=1566273599,
    for_history=for_history at entry=0, warnlevel=97,
    ? ? greenlevel=99.995000000000005, warnstops=-1, reporttime=0x0)
    at availability.c:475
    #12 0x00005589375c38cc in init_state (filename=<optimized out>,
    ? ? filename at entry=0x7ffd5dacf210 "<client hostname>.procs",
    log=log at entry=0x7ffd5dacf120)
    ? ? at loaddata.c:275
    #13 0x00005589375c45ee in load_state
    (sumhead=sumhead at entry=0x558937809d48 <dispsums>) at loaddata.c:626
    #14 0x00005589375be6f4 in main (argc=5, argv=0x7ffd5dad4418) at
    xymongen.c:599


    -- 
    Matt Vander Werf
    
list Matt Vander Werf · Thu, 22 Aug 2019 17:29:12 -0400 ·
Hi JC,

Ah ha! That is one place I did not look and the timing certainly matches up!

I have installed that new version on my Xymon server (running actual RHEL
7) and we'll see how it fares tomorrow morning....

Thanks!

--
Matt Vander Werf


On Thu, Aug 22, 2019 at 5:12 PM Japheth Cleaver <user-87556346d4af@xymon.invalid>
quoted from Japheth Cleaver
wrote:
Hi,

I think this might be xymongen in report mode from the "dailyreport" file
in /tasks.d/; the timing would check out.  I believe the problem here is
one of the Terabithia patches now doing the wrong thing after some of the
string-handling changes in 4.3.29 -- causing core dumps in certain
situations.

If you're running actual RHEL7 on this (not CentOS, which hasn't released
7.7 yet), would you mind checking the xymon-4.3.30-0.5 package in the EL7
Terabithia testing repo and see if this helps?
https://repo.terabithia.org/rpms/xymon/testing/el7/x86_64/

Regards,
-jc

On 8/22/2019 11:34 AM, Matt Vander Werf wrote:

Hi Torsten,

No, there wasn't anything running from cron or anything else around that
time, let alone anything that restarts the network or Xymon.

Thanks.

--
Matt Vander Werf


On Wed, Aug 21, 2019 at 5:43 AM Torsten Richter <user-c862b499d9fa@xymon.invalid>
wrote:
Hi Matt,

dumb question: is there any cron job running at this time that is
restarting XYmon fiddling with the network, like restarting the network for
some reason?

Regards,
Torsten

Matt Vander Werf <user-dfc3cf2ca434@xymon.invalid> hat am 20. August 2019 um 17:10
geschrieben:

Hi all,

Every day since we updated our Xymon server to 4.3.29 (from 4.3.28), I've
gotten an e-mail alert due to xymond turning red that reads:

red xymongen program crashed

Fatal signal caught!

The strange thing is that this has happened at 1:04 AM every day...like
clockwork. I have xymongen set to run every 1 minute and it has no problems
running any other time of the day. We are using the Terabithia RPMs and the
Xymon server is running RHEL 7.

I've scoured the system to find anything that is set to run at/around
that time via cron, etc. and haven't found anything. The system logs don't
show anything is happening around that time either.

I turned on debug logging for xymond and xymongen and haven't been able
to find anything unusual in either logs around that time. But it is dumping
core files for xymongen every time it crashes.

I used gdb to get the backtrace on all of the core files (so far) and
I've found that they all show the same thing. It shows the same host in the
backtrace too (although I'm farily confident it isn't specific or isolated
to that host but just the first one it runs into that it has issues with
when processing).

I've included an example gdb output below (the most recent one) [1].

Is anyone else running into this by chance? Or any idea what might be the
cause?

Thanks!


[1]
# gdb -q /usr/libexec/xymon/xymongen core.16327
Reading symbols from /usr/libexec/xymon/xymongen...Reading symbols from
/usr/lib/debug/usr/libexec/xymon/xymongen.debug...done.
done.
[New LWP 16327]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/libexec/xymon/xymongen
--reportopts=1566187200:1566273599:0:nongr --recent'.
Program terminated with signal 6, Aborted.
#0  0x00007f4657c49377 in __GI_raise (sig=sig at entry=6) at
../nptl/sysdeps/unix/sysv/linux/raise.c:55
55  return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig);
(gdb) bt
#0  0x00007f4657c49377 in __GI_raise (sig=sig at entry=6) at
../nptl/sysdeps/unix/sysv/linux/raise.c:55
#1  0x00007f4657c4aa68 in __GI_abort () at abort.c:90
#2  0x00005589375dd455 in sigsegv_handler (signum=<optimized out>) at
sig.c:57
#3  <signal handler called>
#4  strchrnul () at ../sysdeps/x86_64/strchrnul.S:33
#5  0x00007f4657c5b681 in __find_specmb (format=0xfce <Address 0xfce out
of bounds>) at printf-parse.h:109
#6  _IO_vfprintf_internal (s=s at entry=0x7ffd5dabcc00,
    format=format at entry=0xfce <Address 0xfce out of bounds>, ap=ap at entry=0x7ffd5dabcd38)
at vfprintf.c:1308
#7  0x00007f4657d28c78 in ___vsprintf_chk (s=0x7ffd5dabcf82 "", flags=1,
slen=18446744073709551615,
    format=0xfce <Address 0xfce out of bounds>, args=args at entry=0x7ffd5dabcd38)
at vsprintf_chk.c:83
#8  0x00007f4657d28bcd in ___sprintf_chk (s=<optimized out>,
flags=flags at entry=1,
    slen=slen at entry=18446744073709551615, format=<optimized out>) at
sprintf_chk.c:32
#9  0x00005589375ce8ca in sprintf (__fmt=<optimized out>, __s=<optimized
out>)
    at /usr/include/bits/stdio2.h:33
#10 parse_histlogfile (starttime=1566187200,
    timespec=0x558937840f50 <timespec.7157> "Wed_Sep_2_19:34:55_2015",
servicename=0x5589383b6d70 "procs",
    hostname=0x558938a335d0 "<client hostname>") at availability.c:174
#11 parse_historyfile (fd=fd at entry=0x558938a3aea0, repinfo=<optimized
out>,
    hostname=0x558938a335d0 "<client hostname>",
servicename=0x5589383b6d70 "procs",
    fromtime=<optimized out>, totime=1566273599,
for_history=for_history at entry=0, warnlevel=97,
    greenlevel=99.995000000000005, warnstops=-1, reporttime=0x0) at
availability.c:475
#12 0x00005589375c38cc in init_state (filename=<optimized out>,
    filename at entry=0x7ffd5dacf210 "<client hostname>.procs",
log=log at entry=0x7ffd5dacf120)
    at loaddata.c:275
#13 0x00005589375c45ee in load_state (sumhead=sumhead at entry=0x558937809d48
<dispsums>) at loaddata.c:626
#14 0x00005589375be6f4 in main (argc=5, argv=0x7ffd5dad4418) at
xymongen.c:599


--
Matt Vander Werf

Xymon mailing user-d459c9d661b6@xymon.invalid

list Matt Vander Werf · Fri, 23 Aug 2019 12:00:31 -0400 ·
Hi JC,

Unfortunately, this didn't seem to fix the crashes. Today I got the e-mail
at 1:05 AM, though the core file has a timestamp of 1:04 AM. This time
frame still matches up with it being the dailyreport task that is
triggering the crashes (since there are no crashes any other time of the
day).

[root@<xymon server> ~]# xymoncmd --version
Xymon version 4.3.30-0.5.el7.terabithia
[root@<xymon server> ~]# cat /etc/redhat-release
Red Hat Enterprise Linux Server release 7.7 (Maipo)

The latest core backtrace looks to be the same as previously (same client
and service too), but I'm including it here [1] just for completeness.

Let me know if there's anything else I can provide to debug this.

Thanks!


[1]
[root@<xymon server> ~]# gdb -q /usr/libexec/xymon/xymongen core.1312
Reading symbols from /usr/libexec/xymon/xymongen...Reading symbols from
/usr/lib/debug/usr/libexec/xymon/xymongen.debug...done.
done.
[New LWP 1312]
quoted from Matt Vander Werf
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/libexec/xymon/xymongen

--reportopts=1566446400:1566532799:0:nongr --recent'.
Program terminated with signal 6, Aborted.
#0  0x00007fc746f7e377 in __GI_raise (sig=sig at entry=6) at
../nptl/sysdeps/unix/sysv/linux/raise.c:55
55  return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig);
(gdb) bt
#0  0x00007fc746f7e377 in __GI_raise (sig=sig at entry=6) at
../nptl/sysdeps/unix/sysv/linux/raise.c:55
#1  0x00007fc746f7fa68 in __GI_abort () at abort.c:90
#2  0x00005559fc1ce4f5 in sigsegv_handler (signum=<optimized out>) at
sig.c:57
#3  <signal handler called>
#4  strchrnul () at ../sysdeps/x86_64/strchrnul.S:33
#5  0x00007fc746f90681 in __find_specmb (format=0xfce <Address 0xfce out of
bounds>) at printf-parse.h:109
#6  _IO_vfprintf_internal (s=s at entry=0x7ffeb3e6d340,
    format=format at entry=0xfce <Address 0xfce out of bounds>,
ap=ap at entry=0x7ffeb3e6d478)
at vfprintf.c:1308
#7  0x00007fc74705dc78 in ___vsprintf_chk (s=0x7ffeb3e6d6c2 "", flags=1,
slen=18446744073709551615,
    format=0xfce <Address 0xfce out of bounds>, args=args at entry=0x7ffeb3e6d478)
at vsprintf_chk.c:83
#8  0x00007fc74705dbcd in ___sprintf_chk (s=<optimized out>,
flags=flags at entry=1,
    slen=slen at entry=18446744073709551615, format=<optimized out>) at
sprintf_chk.c:32
#9  0x00005559fc1bf96a in sprintf (__fmt=<optimized out>, __s=<optimized
out>)
    at /usr/include/bits/stdio2.h:33
#10 parse_histlogfile (starttime=1566446400,
    timespec=0x5559fc431f50 <timespec.7157> "Wed_Sep_2_19:34:55_2015",
servicename=0x5559fd61b2d0 "procs",
    hostname=0x5559fdc94520 "<client hostname>") at availability.c:174
#11 parse_historyfile (fd=fd at entry=0x5559fdc9be00, repinfo=<optimized out>,
    hostname=0x5559fdc94520 "<client hostname>", servicename=0x5559fd61b2d0
"procs",
    fromtime=<optimized out>, totime=1566532799,
for_history=for_history at entry=0, warnlevel=97,
    greenlevel=99.995000000000005, warnstops=-1, reporttime=0x0) at
availability.c:475
#12 0x00005559fc1b496c in init_state (filename=<optimized out>,
    filename at entry=0x7ffeb3e7f950 "<client hostname>.procs", log=log at entry
=0x7ffeb3e7f860)
    at loaddata.c:275
#13 0x00005559fc1b568e in load_state (sumhead=sumhead at entry=0x5559fc3fad48
<dispsums>) at loaddata.c:626
#14 0x00005559fc1af794 in main (argc=5, argv=0x7ffeb3e84b58) at
xymongen.c:599
quoted from Matt Vander Werf


--
Matt Vander Werf


On Thu, Aug 22, 2019 at 5:29 PM Matt Vander Werf <user-dfc3cf2ca434@xymon.invalid> wrote:
Hi JC,

Ah ha! That is one place I did not look and the timing certainly matches
up!

I have installed that new version on my Xymon server (running actual RHEL
7) and we'll see how it fares tomorrow morning....

Thanks!

--
Matt Vander Werf


On Thu, Aug 22, 2019 at 5:12 PM Japheth Cleaver <user-87556346d4af@xymon.invalid>
wrote:
Hi,

I think this might be xymongen in report mode from the "dailyreport" file
in /tasks.d/; the timing would check out.  I believe the problem here is
one of the Terabithia patches now doing the wrong thing after some of the
string-handling changes in 4.3.29 -- causing core dumps in certain
situations.

If you're running actual RHEL7 on this (not CentOS, which hasn't released
7.7 yet), would you mind checking the xymon-4.3.30-0.5 package in the EL7
Terabithia testing repo and see if this helps?
https://repo.terabithia.org/rpms/xymon/testing/el7/x86_64/

Regards,
-jc

On 8/22/2019 11:34 AM, Matt Vander Werf wrote:

Hi Torsten,

No, there wasn't anything running from cron or anything else around that
time, let alone anything that restarts the network or Xymon.

Thanks.

--
Matt Vander Werf


On Wed, Aug 21, 2019 at 5:43 AM Torsten Richter <user-c862b499d9fa@xymon.invalid>
wrote:
Hi Matt,

dumb question: is there any cron job running at this time that is
restarting XYmon fiddling with the network, like restarting the network for
some reason?

Regards,
Torsten

Matt Vander Werf <user-dfc3cf2ca434@xymon.invalid> hat am 20. August 2019 um 17:10
geschrieben:

Hi all,

Every day since we updated our Xymon server to 4.3.29 (from 4.3.28),
I've gotten an e-mail alert due to xymond turning red that reads:

red xymongen program crashed

Fatal signal caught!

The strange thing is that this has happened at 1:04 AM every day...like
clockwork. I have xymongen set to run every 1 minute and it has no problems
running any other time of the day. We are using the Terabithia RPMs and the
Xymon server is running RHEL 7.

I've scoured the system to find anything that is set to run at/around
that time via cron, etc. and haven't found anything. The system logs don't
show anything is happening around that time either.

I turned on debug logging for xymond and xymongen and haven't been able
to find anything unusual in either logs around that time. But it is dumping
core files for xymongen every time it crashes.

I used gdb to get the backtrace on all of the core files (so far) and
I've found that they all show the same thing. It shows the same host in the
backtrace too (although I'm farily confident it isn't specific or isolated
to that host but just the first one it runs into that it has issues with
when processing).

I've included an example gdb output below (the most recent one) [1].

Is anyone else running into this by chance? Or any idea what might be
the cause?

Thanks!


[1]
# gdb -q /usr/libexec/xymon/xymongen core.16327
Reading symbols from /usr/libexec/xymon/xymongen...Reading symbols from
/usr/lib/debug/usr/libexec/xymon/xymongen.debug...done.
done.
[New LWP 16327]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/libexec/xymon/xymongen
--reportopts=1566187200:1566273599:0:nongr --recent'.
Program terminated with signal 6, Aborted.
#0  0x00007f4657c49377 in __GI_raise (sig=sig at entry=6) at
../nptl/sysdeps/unix/sysv/linux/raise.c:55
55  return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig);
(gdb) bt
#0  0x00007f4657c49377 in __GI_raise (sig=sig at entry=6) at
../nptl/sysdeps/unix/sysv/linux/raise.c:55
#1  0x00007f4657c4aa68 in __GI_abort () at abort.c:90
#2  0x00005589375dd455 in sigsegv_handler (signum=<optimized out>) at
sig.c:57
#3  <signal handler called>
#4  strchrnul () at ../sysdeps/x86_64/strchrnul.S:33
#5  0x00007f4657c5b681 in __find_specmb (format=0xfce <Address 0xfce out
of bounds>) at printf-parse.h:109
#6  _IO_vfprintf_internal (s=s at entry=0x7ffd5dabcc00,
    format=format at entry=0xfce <Address 0xfce out of bounds>, ap=ap at entry=0x7ffd5dabcd38)
at vfprintf.c:1308
#7  0x00007f4657d28c78 in ___vsprintf_chk (s=0x7ffd5dabcf82 "", flags=1,
slen=18446744073709551615,
    format=0xfce <Address 0xfce out of bounds>, args=args at entry=0x7ffd5dabcd38)
at vsprintf_chk.c:83
#8  0x00007f4657d28bcd in ___sprintf_chk (s=<optimized out>,
flags=flags at entry=1,
    slen=slen at entry=18446744073709551615, format=<optimized out>) at
sprintf_chk.c:32
#9  0x00005589375ce8ca in sprintf (__fmt=<optimized out>, __s=<optimized
out>)
    at /usr/include/bits/stdio2.h:33
#10 parse_histlogfile (starttime=1566187200,
    timespec=0x558937840f50 <timespec.7157> "Wed_Sep_2_19:34:55_2015",
servicename=0x5589383b6d70 "procs",
    hostname=0x558938a335d0 "<client hostname>") at availability.c:174
#11 parse_historyfile (fd=fd at entry=0x558938a3aea0, repinfo=<optimized
out>,
    hostname=0x558938a335d0 "<client hostname>",
servicename=0x5589383b6d70 "procs",
    fromtime=<optimized out>, totime=1566273599,
for_history=for_history at entry=0, warnlevel=97,
    greenlevel=99.995000000000005, warnstops=-1, reporttime=0x0) at
availability.c:475
#12 0x00005589375c38cc in init_state (filename=<optimized out>,
    filename at entry=0x7ffd5dacf210 "<client hostname>.procs",
log=log at entry=0x7ffd5dacf120)
    at loaddata.c:275
#13 0x00005589375c45ee in load_state (sumhead=sumhead at entry=0x558937809d48
<dispsums>) at loaddata.c:626
#14 0x00005589375be6f4 in main (argc=5, argv=0x7ffd5dad4418) at
xymongen.c:599


--
Matt Vander Werf

Xymon mailing user-d459c9d661b6@xymon.invalid

list Japheth Cleaver · Tue, 27 Aug 2019 13:15:01 -0700 ·
Alright, I believe I have fixed the issue here... Multiple issues within 
the availability code (fixed in 
https://sourceforge.net/p/xymon/code/8081/), and then a typo in a 
Terabithia patch. Please try out 4.3.30-0.6 in the /testing/ repo if 
possible.

You can also manually perform a run of this by executing: `xymoncmd 
xymonreports.sh daily` as the xymon user. That should give a 
reproduceable crash in 4.3.30-0.5 and be clean in 4.3.30-0.6.

I've also built the EL7 packages on a CentOS 7 box, which should provide 
proper compatibility while we're in a mixed 7.6/7.7 state in the ecosystem.

Regards,
-jc
quoted from Matt Vander Werf


On 8/23/2019 9:00 AM, Matt Vander Werf wrote:
Hi JC,

Unfortunately, this didn't seem to fix the crashes. Today I got the 
e-mail at 1:05 AM, though the core file has a timestamp of 1:04 AM. 
This time frame still matches up with it being the dailyreport task 
that is triggering the crashes (since there are no crashes any other 
time of the day).

[root@<xymon server> ~]# xymoncmd --version
Xymon version 4.3.30-0.5.el7.terabithia
[root@<xymon server> ~]# cat /etc/redhat-release
Red Hat Enterprise Linux Server release 7.7 (Maipo)

The latest core backtrace looks to be the same as previously (same 
client and service too), but I'm including it here [1] just for 
completeness.

Let me know if there's anything else I can provide to debug this.

Thanks!


[1]
[root@<xymon server> ~]# gdb -q /usr/libexec/xymon/xymongen core.1312
Reading symbols from /usr/libexec/xymon/xymongen...Reading symbols 
from /usr/lib/debug/usr/libexec/xymon/xymongen.debug...done.
done.
[New LWP 1312]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/libexec/xymon/xymongen 
--reportopts=1566446400:1566532799:0:nongr --recent'.
Program terminated with signal 6, Aborted.
#0 ?0x00007fc746f7e377 in __GI_raise (sig=sig at entry=6) at 
../nptl/sysdeps/unix/sysv/linux/raise.c:55
55 ?return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig);
(gdb) bt
#0 ?0x00007fc746f7e377 in __GI_raise (sig=sig at entry=6) at 
../nptl/sysdeps/unix/sysv/linux/raise.c:55
#1 ?0x00007fc746f7fa68 in __GI_abort () at abort.c:90
#2 ?0x00005559fc1ce4f5 in sigsegv_handler (signum=<optimized out>) at 
sig.c:57
#3 ?<signal handler called>
#4 ?strchrnul () at ../sysdeps/x86_64/strchrnul.S:33
#5 ?0x00007fc746f90681 in __find_specmb (format=0xfce <Address 0xfce 
out of bounds>) at printf-parse.h:109
#6 ?_IO_vfprintf_internal (s=s at entry=0x7ffeb3e6d340,
? ? format=format at entry=0xfce <Address 0xfce out of bounds>, 
ap=ap at entry=0x7ffeb3e6d478) at vfprintf.c:1308
#7 ?0x00007fc74705dc78 in ___vsprintf_chk (s=0x7ffeb3e6d6c2 "", 
flags=1, slen=18446744073709551615,
? ? format=0xfce <Address 0xfce out of bounds>, 
args=args at entry=0x7ffeb3e6d478) at vsprintf_chk.c:83
#8 ?0x00007fc74705dbcd in ___sprintf_chk (s=<optimized out>, 
flags=flags at entry=1,
? ? slen=slen at entry=18446744073709551615, format=<optimized out>) at 
sprintf_chk.c:32
#9 ?0x00005559fc1bf96a in sprintf (__fmt=<optimized out>, 
__s=<optimized out>)
? ? at /usr/include/bits/stdio2.h:33
#10 parse_histlogfile (starttime=1566446400,
? ? timespec=0x5559fc431f50 <timespec.7157> "Wed_Sep_2_19:34:55_2015", 
servicename=0x5559fd61b2d0 "procs",
? ? hostname=0x5559fdc94520 "<client hostname>") at availability.c:174
#11 parse_historyfile (fd=fd at entry=0x5559fdc9be00, repinfo=<optimized 
out>,
? ? hostname=0x5559fdc94520 "<client hostname>", 
servicename=0x5559fd61b2d0 "procs",
? ? fromtime=<optimized out>, totime=1566532799, 
for_history=for_history at entry=0, warnlevel=97,
? ? greenlevel=99.995000000000005, warnstops=-1, reporttime=0x0) at 
availability.c:475
#12 0x00005559fc1b496c in init_state (filename=<optimized out>,
? ? filename at entry=0x7ffeb3e7f950 "<client hostname>.procs", 

log=log at entry=0x7ffeb3e7f860)
quoted from Matt Vander Werf
? ? at loaddata.c:275
#13 0x00005559fc1b568e in load_state 
(sumhead=sumhead at entry=0x5559fc3fad48 <dispsums>) at loaddata.c:626
#14 0x00005559fc1af794 in main (argc=5, argv=0x7ffeb3e84b58) at 
xymongen.c:599


--
Matt Vander Werf


On Thu, Aug 22, 2019 at 5:29 PM Matt Vander Werf <user-dfc3cf2ca434@xymon.invalid 
<mailto:user-dfc3cf2ca434@xymon.invalid>> wrote:

    Hi JC,

    Ah ha! That is one place I did not look and the timing certainly
    matches up!

    I have installed that new version on my Xymon server (running
    actual RHEL 7) and we'll see how it fares tomorrow morning....

    Thanks!

    --
    Matt Vander Werf


    On Thu, Aug 22, 2019 at 5:12 PM Japheth Cleaver
    <user-87556346d4af@xymon.invalid <mailto:user-87556346d4af@xymon.invalid>> wrote:

        Hi,

        I think this might be xymongen in report mode from the
        "dailyreport" file in /tasks.d/; the timing would check out.?
        I believe the problem here is one of the Terabithia patches
        now doing the wrong thing after some of the string-handling
        changes in 4.3.29 -- causing core dumps in certain situations.

        If you're running actual RHEL7 on this (not CentOS, which
        hasn't released 7.7 yet), would you mind checking the
        xymon-4.3.30-0.5 package in the EL7 Terabithia testing repo
        and see if this helps?
        https://repo.terabithia.org/rpms/xymon/testing/el7/x86_64/

        Regards,
        -jc

        On 8/22/2019 11:34 AM, Matt Vander Werf wrote:
        Hi Torsten,

        No, there wasn't anything running from cron or anything else
        around that time, let alone anything that restarts the
        network or Xymon.

        Thanks.

        --
        Matt Vander Werf


        On Wed, Aug 21, 2019 at 5:43 AM Torsten Richter
        <user-c862b499d9fa@xymon.invalid <mailto:user-c862b499d9fa@xymon.invalid>> wrote:

            Hi Matt,

            dumb question: is there any cron job running at this time
            that is restarting XYmon fiddling with the network, like
            restarting the network for some reason?

            Regards,
            Torsten
            Matt Vander Werf <user-dfc3cf2ca434@xymon.invalid
            <mailto:user-dfc3cf2ca434@xymon.invalid>> hat am 20. August 2019 um
            17:10 geschrieben:

            Hi all,

            Every day since we updated our Xymon server to 4.3.29
            (from 4.3.28), I've gotten an e-mail alert due to xymond
            turning red that reads:

            red xymongen program crashed

            Fatal signal caught!

            The strange thing is that this has happened at 1:04 AM
            every day...like clockwork. I have xymongen set to run
            every 1 minute and it has no problems running any other
            time of the day. We are using the Terabithia RPMs and
            the Xymon server is running RHEL 7.

            I've scoured the system to find anything that is set to
            run at/around that time via cron, etc. and haven't found
            anything. The system logs don't show anything is
            happening around that time either.

            I turned on debug logging for xymond and xymongen and
            haven't been able to find anything unusual in either
            logs around that time. But it is dumping core files for
            xymongen every time it crashes.

            I used gdb to get the backtrace on all of the core files
            (so far) and I've found that they all show the same
            thing. It shows the same host in the backtrace too
            (although I'm farily confident it isn't specific or
            isolated to that host but just the first one it runs
            into that it has issues with when processing).

            I've included an example gdb output below (the most
            recent one) [1].

            Is anyone else running into this by chance? Or any idea
            what might be the cause?

            Thanks!


            [1]
            # gdb -q /usr/libexec/xymon/xymongen core.16327
            Reading symbols from
            /usr/libexec/xymon/xymongen...Reading symbols from
            /usr/lib/debug/usr/libexec/xymon/xymongen.debug...done.
            done.
            [New LWP 16327]
            [Thread debugging using libthread_db enabled]
            Using host libthread_db library "/lib64/libthread_db.so.1".
            Core was generated by `/usr/libexec/xymon/xymongen
            --reportopts=1566187200:1566273599:0:nongr --recent'.
            Program terminated with signal 6, Aborted.
            #0 ?0x00007f4657c49377 in __GI_raise (sig=sig at entry=6)
            at ../nptl/sysdeps/unix/sysv/linux/raise.c:55
            55 ?return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig);
            (gdb) bt
            #0 ?0x00007f4657c49377 in __GI_raise (sig=sig at entry=6)
            at ../nptl/sysdeps/unix/sysv/linux/raise.c:55
            #1 ?0x00007f4657c4aa68 in __GI_abort () at abort.c:90
            #2 ?0x00005589375dd455 in sigsegv_handler
            (signum=<optimized out>) at sig.c:57
            #3 ?<signal handler called>
            #4 ?strchrnul () at ../sysdeps/x86_64/strchrnul.S:33
            #5 ?0x00007f4657c5b681 in __find_specmb (format=0xfce
            <Address 0xfce out of bounds>) at printf-parse.h:109
            #6 ?_IO_vfprintf_internal (s=s at entry=0x7ffd5dabcc00,
            ? ? format=format at entry=0xfce <Address 0xfce out of
            bounds>, ap=ap at entry=0x7ffd5dabcd38) at vfprintf.c:1308
            #7 ?0x00007f4657d28c78 in ___vsprintf_chk
            (s=0x7ffd5dabcf82 "", flags=1, slen=18446744073709551615,
            ? ? format=0xfce <Address 0xfce out of bounds>,
            args=args at entry=0x7ffd5dabcd38) at vsprintf_chk.c:83
            #8 ?0x00007f4657d28bcd in ___sprintf_chk (s=<optimized
            out>, flags=flags at entry=1,
            slen=slen at entry=18446744073709551615, format=<optimized
            out>) at sprintf_chk.c:32
            #9 ?0x00005589375ce8ca in sprintf (__fmt=<optimized
            out>, __s=<optimized out>)
            ? ? at /usr/include/bits/stdio2.h:33
            #10 parse_histlogfile (starttime=1566187200,
            ? ? timespec=0x558937840f50 <timespec.7157>
            "Wed_Sep_2_19:34:55_2015", servicename=0x5589383b6d70
            "procs",
            ? ? hostname=0x558938a335d0 "<client hostname>") at
            availability.c:174
            #11 parse_historyfile (fd=fd at entry=0x558938a3aea0,
            repinfo=<optimized out>,
            ? ? hostname=0x558938a335d0 "<client hostname>",
            servicename=0x5589383b6d70 "procs",
            ? ? fromtime=<optimized out>, totime=1566273599,
            for_history=for_history at entry=0, warnlevel=97,
            ? ? greenlevel=99.995000000000005, warnstops=-1,
            reporttime=0x0) at availability.c:475
            #12 0x00005589375c38cc in init_state
            (filename=<optimized out>,
            ? ? filename at entry=0x7ffd5dacf210 "<client
            hostname>.procs", log=log at entry=0x7ffd5dacf120)
            ? ? at loaddata.c:275
            #13 0x00005589375c45ee in load_state
            (sumhead=sumhead at entry=0x558937809d48 <dispsums>) at
            loaddata.c:626
            #14 0x00005589375be6f4 in main (argc=5,
            argv=0x7ffd5dad4418) at xymongen.c:599


            -- 
            Matt Vander Werf
            
list Matt Vander Werf · Wed, 28 Aug 2019 12:16:16 -0400 ·
Hi JC,

Awesome! Thanks for putting this together!

I installed 4.3.30-0.6.el7.terabithia yesterday evening and there was no
xymongen crash this morning (no e-mail and no core dump)! I'll keep an eye
out the next couple days, but looks like these changes have fixed the
crashes we've been seeing!

Thanks for your help as always!

--
Matt Vander Werf


On Tue, Aug 27, 2019 at 4:15 PM Japheth Cleaver <user-87556346d4af@xymon.invalid>
quoted from Japheth Cleaver
wrote:
Alright, I believe I have fixed the issue here... Multiple issues within
the availability code (fixed in https://sourceforge.net/p/xymon/code/8081/),
and then a typo in a Terabithia patch. Please try out 4.3.30-0.6 in the
/testing/ repo if possible.

You can also manually perform a run of this by executing: `xymoncmd
xymonreports.sh daily` as the xymon user. That should give a reproduceable
crash in 4.3.30-0.5 and be clean in 4.3.30-0.6.

I've also built the EL7 packages on a CentOS 7 box, which should provide
proper compatibility while we're in a mixed 7.6/7.7 state in the ecosystem.

Regards,
-jc


On 8/23/2019 9:00 AM, Matt Vander Werf wrote:

Hi JC,

Unfortunately, this didn't seem to fix the crashes. Today I got the e-mail
at 1:05 AM, though the core file has a timestamp of 1:04 AM. This time
frame still matches up with it being the dailyreport task that is
triggering the crashes (since there are no crashes any other time of the
day).

[root@<xymon server> ~]# xymoncmd --version
Xymon version 4.3.30-0.5.el7.terabithia
[root@<xymon server> ~]# cat /etc/redhat-release
Red Hat Enterprise Linux Server release 7.7 (Maipo)

The latest core backtrace looks to be the same as previously (same client
and service too), but I'm including it here [1] just for completeness.

Let me know if there's anything else I can provide to debug this.

Thanks!


[1]
[root@<xymon server> ~]# gdb -q /usr/libexec/xymon/xymongen core.1312
Reading symbols from /usr/libexec/xymon/xymongen...Reading symbols from
/usr/lib/debug/usr/libexec/xymon/xymongen.debug...done.
done.
[New LWP 1312]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/libexec/xymon/xymongen
--reportopts=1566446400:1566532799:0:nongr --recent'.
Program terminated with signal 6, Aborted.
#0  0x00007fc746f7e377 in __GI_raise (sig=sig at entry=6) at
../nptl/sysdeps/unix/sysv/linux/raise.c:55
55  return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig);
(gdb) bt
#0  0x00007fc746f7e377 in __GI_raise (sig=sig at entry=6) at
../nptl/sysdeps/unix/sysv/linux/raise.c:55
#1  0x00007fc746f7fa68 in __GI_abort () at abort.c:90
#2  0x00005559fc1ce4f5 in sigsegv_handler (signum=<optimized out>) at
sig.c:57
#3  <signal handler called>
#4  strchrnul () at ../sysdeps/x86_64/strchrnul.S:33
#5  0x00007fc746f90681 in __find_specmb (format=0xfce <Address 0xfce out
of bounds>) at printf-parse.h:109
#6  _IO_vfprintf_internal (s=s at entry=0x7ffeb3e6d340,
    format=format at entry=0xfce <Address 0xfce out of bounds>, ap=ap at entry=0x7ffeb3e6d478)
at vfprintf.c:1308
#7  0x00007fc74705dc78 in ___vsprintf_chk (s=0x7ffeb3e6d6c2 "", flags=1,
slen=18446744073709551615,
    format=0xfce <Address 0xfce out of bounds>, args=args at entry=0x7ffeb3e6d478)
at vsprintf_chk.c:83
#8  0x00007fc74705dbcd in ___sprintf_chk (s=<optimized out>,
flags=flags at entry=1,
    slen=slen at entry=18446744073709551615, format=<optimized out>) at
sprintf_chk.c:32
#9  0x00005559fc1bf96a in sprintf (__fmt=<optimized out>, __s=<optimized
out>)
    at /usr/include/bits/stdio2.h:33
#10 parse_histlogfile (starttime=1566446400,
    timespec=0x5559fc431f50 <timespec.7157> "Wed_Sep_2_19:34:55_2015",
servicename=0x5559fd61b2d0 "procs",
    hostname=0x5559fdc94520 "<client hostname>") at availability.c:174
#11 parse_historyfile (fd=fd at entry=0x5559fdc9be00, repinfo=<optimized
out>,
    hostname=0x5559fdc94520 "<client hostname>",
servicename=0x5559fd61b2d0 "procs",
    fromtime=<optimized out>, totime=1566532799,
for_history=for_history at entry=0, warnlevel=97,
    greenlevel=99.995000000000005, warnstops=-1, reporttime=0x0) at
availability.c:475
#12 0x00005559fc1b496c in init_state (filename=<optimized out>,
    filename at entry=0x7ffeb3e7f950 "<client hostname>.procs", log=log at entry
=0x7ffeb3e7f860)
    at loaddata.c:275
#13 0x00005559fc1b568e in load_state (sumhead=sumhead at entry=0x5559fc3fad48
<dispsums>) at loaddata.c:626
#14 0x00005559fc1af794 in main (argc=5, argv=0x7ffeb3e84b58) at
xymongen.c:599


--
Matt Vander Werf


On Thu, Aug 22, 2019 at 5:29 PM Matt Vander Werf <user-dfc3cf2ca434@xymon.invalid>
wrote:
Hi JC,

Ah ha! That is one place I did not look and the timing certainly matches
up!

I have installed that new version on my Xymon server (running actual RHEL
7) and we'll see how it fares tomorrow morning....

Thanks!

--
Matt Vander Werf


On Thu, Aug 22, 2019 at 5:12 PM Japheth Cleaver <user-87556346d4af@xymon.invalid>
wrote:
Hi,

I think this might be xymongen in report mode from the "dailyreport"
file in /tasks.d/; the timing would check out.  I believe the problem here
is one of the Terabithia patches now doing the wrong thing after some of
the string-handling changes in 4.3.29 -- causing core dumps in certain
situations.

If you're running actual RHEL7 on this (not CentOS, which hasn't
released 7.7 yet), would you mind checking the xymon-4.3.30-0.5 package in
the EL7 Terabithia testing repo and see if this helps?
https://repo.terabithia.org/rpms/xymon/testing/el7/x86_64/

Regards,
-jc

On 8/22/2019 11:34 AM, Matt Vander Werf wrote:

Hi Torsten,

No, there wasn't anything running from cron or anything else around that
time, let alone anything that restarts the network or Xymon.

Thanks.

--
Matt Vander Werf


On Wed, Aug 21, 2019 at 5:43 AM Torsten Richter <user-c862b499d9fa@xymon.invalid>
wrote:
Hi Matt,

dumb question: is there any cron job running at this time that is
restarting XYmon fiddling with the network, like restarting the network for
some reason?

Regards,
Torsten

Matt Vander Werf <user-dfc3cf2ca434@xymon.invalid> hat am 20. August 2019 um 17:10
geschrieben:

Hi all,

Every day since we updated our Xymon server to 4.3.29 (from 4.3.28),
I've gotten an e-mail alert due to xymond turning red that reads:

red xymongen program crashed

Fatal signal caught!

The strange thing is that this has happened at 1:04 AM every day...like
clockwork. I have xymongen set to run every 1 minute and it has no problems
running any other time of the day. We are using the Terabithia RPMs and the
Xymon server is running RHEL 7.

I've scoured the system to find anything that is set to run at/around
that time via cron, etc. and haven't found anything. The system logs don't
show anything is happening around that time either.

I turned on debug logging for xymond and xymongen and haven't been able
to find anything unusual in either logs around that time. But it is dumping
core files for xymongen every time it crashes.

I used gdb to get the backtrace on all of the core files (so far) and
I've found that they all show the same thing. It shows the same host in the
backtrace too (although I'm farily confident it isn't specific or isolated
to that host but just the first one it runs into that it has issues with
when processing).

I've included an example gdb output below (the most recent one) [1].

Is anyone else running into this by chance? Or any idea what might be
the cause?

Thanks!


[1]
# gdb -q /usr/libexec/xymon/xymongen core.16327
Reading symbols from /usr/libexec/xymon/xymongen...Reading symbols from
/usr/lib/debug/usr/libexec/xymon/xymongen.debug...done.
done.
[New LWP 16327]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/libexec/xymon/xymongen
--reportopts=1566187200:1566273599:0:nongr --recent'.
Program terminated with signal 6, Aborted.
#0  0x00007f4657c49377 in __GI_raise (sig=sig at entry=6) at
../nptl/sysdeps/unix/sysv/linux/raise.c:55
55  return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig);
(gdb) bt
#0  0x00007f4657c49377 in __GI_raise (sig=sig at entry=6) at
../nptl/sysdeps/unix/sysv/linux/raise.c:55
#1  0x00007f4657c4aa68 in __GI_abort () at abort.c:90
#2  0x00005589375dd455 in sigsegv_handler (signum=<optimized out>) at
sig.c:57
#3  <signal handler called>
#4  strchrnul () at ../sysdeps/x86_64/strchrnul.S:33
#5  0x00007f4657c5b681 in __find_specmb (format=0xfce <Address 0xfce
out of bounds>) at printf-parse.h:109
#6  _IO_vfprintf_internal (s=s at entry=0x7ffd5dabcc00,
    format=format at entry=0xfce <Address 0xfce out of bounds>,
ap=ap at entry=0x7ffd5dabcd38) at vfprintf.c:1308
#7  0x00007f4657d28c78 in ___vsprintf_chk (s=0x7ffd5dabcf82 "",
flags=1, slen=18446744073709551615,
    format=0xfce <Address 0xfce out of bounds>, args=args at entry=0x7ffd5dabcd38)
at vsprintf_chk.c:83
#8  0x00007f4657d28bcd in ___sprintf_chk (s=<optimized out>,
flags=flags at entry=1,
    slen=slen at entry=18446744073709551615, format=<optimized out>) at
sprintf_chk.c:32
#9  0x00005589375ce8ca in sprintf (__fmt=<optimized out>,
__s=<optimized out>)
    at /usr/include/bits/stdio2.h:33
#10 parse_histlogfile (starttime=1566187200,
    timespec=0x558937840f50 <timespec.7157> "Wed_Sep_2_19:34:55_2015",
servicename=0x5589383b6d70 "procs",
    hostname=0x558938a335d0 "<client hostname>") at availability.c:174
#11 parse_historyfile (fd=fd at entry=0x558938a3aea0, repinfo=<optimized
out>,
    hostname=0x558938a335d0 "<client hostname>",
servicename=0x5589383b6d70 "procs",
    fromtime=<optimized out>, totime=1566273599,
for_history=for_history at entry=0, warnlevel=97,
    greenlevel=99.995000000000005, warnstops=-1, reporttime=0x0) at
availability.c:475
#12 0x00005589375c38cc in init_state (filename=<optimized out>,
    filename at entry=0x7ffd5dacf210 "<client hostname>.procs",
log=log at entry=0x7ffd5dacf120)
    at loaddata.c:275
#13 0x00005589375c45ee in load_state (sumhead=sumhead at entry=0x558937809d48
<dispsums>) at loaddata.c:626
#14 0x00005589375be6f4 in main (argc=5, argv=0x7ffd5dad4418) at
xymongen.c:599


--
Matt Vander Werf

Xymon mailing user-d459c9d661b6@xymon.invalid