To get more information I have enabled "--debug" to both channels (status and data).
Then we see a bit more information in rrd-status.log:
....
2019-10-17 13:40:02.376153 Host 'synologyhost.domain.eu' reports netstat for an unknown OS
408 2019-10-17 13:40:02.376181 Flush, but xymonmsg is empty
408 2019-10-17 13:40:02.376185 0 status messages merged into 1 transmissions
408 2019-10-17 13:40:02.376203 xymond_rrd: Got message 612 @@status#612/synologyhost.domain.eu|1571308802.357389|83.99.221.6||synologyhost.domain.eu|procs|1571326802|green||green|1570620002|0||0||1571051696||p_cominder|0|
408 2019-10-17 13:40:02.376210 startpos 95710, fillpos 99309, endpos 97006
408 2019-10-17 13:40:02.376227 Flush, but xymonmsg is empty
408 2019-10-17 13:40:02.376233 0 status messages merged into 1 transmissions
408 2019-10-17 13:40:02.376244 xymond_rrd: Got message 613 @@status#613/synologyhost.domain.eu|1571308802.357673|83.99.221.6||synologyhost.domain.eu|raid|1571326802|green||green|1570620002|0||0||1571051696||p_cominder|0|
408 2019-10-17 13:40:02.376251 startpos 97010, fillpos 99309, endpos 97945
408 2019-10-17 13:40:02.376269 Flush, but xymonmsg is empty
408 2019-10-17 13:40:02.376276 0 status messages merged into 1 transmissions
408 2019-10-17 13:40:02.376288 xymond_rrd: Got message 614 @@status#614/synologyhost.domain.eu|1571308802.368308|83.99.221.6||synologyhost.domain.eu|temperature|1571326802|green||green|1570620002|0||0||1571051696||p_cominder|0|
408 2019-10-17 13:40:02.376294 startpos 97949, fillpos 99309, endpos 98645
2019-10-17 13:40:02.381339 Child process 408 died: Signal 6
2019-10-17 13:40:04.432302 Peer at 0.0.0.0:0 failed: Broken pipe
2019-10-17 13:40:04.452708 Peer not up, flushing message queue
13920 2019-10-17 13:40:04.557656 setup_feedback_queue: got ID -1 for key 0xA03EB91
13920 2019-10-17 13:40:04.558141 Opening file /u01/app/xymon/product/xymon4.3.30/server/etc/rrddefinitions.cfg
13920 2019-10-17 13:40:04.558326 Want msg 1, startpos 0, fillpos 0, endpos -1, usedbytes=0, bufleft=1052671
13920 2019-10-17 13:40:04.558359 Got 6716 bytes
...
Here we can see processing of data from our Synology NAS with Synology Monitoring Tool 1.4.8, http://www.sysco.ch/synomon/ enabled.
Make note - despite RRD crash we can see good status and text of "temperature" metric status like:
--
Device Temp(C) Temp(F)
green system 52 125
green /dev/sda 36 96
green /dev/sdb 38 100
green /dev/sdd 36 96
Synology Monitoring Tool 1.4.8, http://www.sysco.ch/synomon/
Model: RS812+ (synologyhost,domain.eu)
Processor: Intel(R) Atom(TM) CPU D2701 @ 2.13GHz
System temperature: 52?C
Serial number: serialnumberdata-replaced
Firmware: 6.2-24922
MAC address(s): number-replaced, number-replaced
Linux version 3.10.105 (root at build10) (gcc version 4.9.3 20150311 (prerelease) (crosstool-NG 1.20.0) ) #24922 SMP Fri May 10 02:51:01 CST 2019
--
After stopping the plugin on Synology we have got no more data from it and no more xymond_rrd crash (red changed to purple, as expected).
I am note sure where is the problem/bug. So I have added the Synology Monitoring Tool developers e-mail to our communictaion.
Please, review and give the hint how can we fix the problem - our NAS state monitoring is quite critical thing we need.
The suspection has been also proved by GDC info (as instructed at: http://www.robertandrobert.com/xymon/help/known-issues.html ):
--
[xymon at synologyhost server]$ /bin/gdb /u01/app/xymon/product/xymon4.3.30/server/bin/xymond_rrd tmp/core.408
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-115.el7
... copyright...
...
Reading symbols from /u01/app/xymon/product/xymon4.3.30/server/bin/xymond_rrd...done.
[New LWP 408]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `xymond_rrd --rrddir=/u01/app/xymon/product/xymon4.3.30/data/rrd --debug'.
Program terminated with signal 6, Aborted.
#0 0x00007f62fcd85337 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install bzip2-libs-1.0.6-13.el7.x86_64 cairo-1.15.12-4.el7.x86_64 expat-2.1.0-10.el7_3.x86_64 fontconfig-2.13.0-4.3.el7.x86_64 freetype-2.8-14.el7.x86_64 fribidi-1.0.2-1.el7.x86_64 glib2-2.56.1-5.el7.x86_64 glibc-2.17-292.el7.x86_64 graphite2-1.3.10-1.el7_3.x86_64 harfbuzz-1.7.5-2.el7.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.15.1-37.el7_7.2.x86_64 libX11-1.6.7-2.el7.x86_64 libXau-1.0.8-2.1.el7.x86_64 libXext-1.3.3-3.el7.x86_64 libXrender-0.9.10-1.el7.x86_64 libcom_err-1.42.9-16.el7.x86_64 libffi-3.0.13-18.el7.x86_64 libgcc-4.8.5-39.el7.x86_64 libglvnd-1.0.1-0.8.git5baa1e5.el7.x86_64 libglvnd-egl-1.0.1-0.8.git5baa1e5.el7.x86_64 libglvnd-glx-1.0.1-0.8.git5baa1e5.el7.x86_64 libpng-1.5.13-7.el7_2.x86_64 libselinux-2.5-14.1.el7.x86_64 libthai-0.1.14-9.el7.x86_64 libtirpc-0.2.4-0.16.el7.x86_64 libuuid-2.23.2-61.el7.x86_64 libxcb-1.13-1.el7.x86_64 libxml2-2.9.1-6.el7_2.3.x86_64 openssl-libs-1.0.2k-19.el7.x86_64 pango-1.42.4-4.el7_7.x86_64 pcre-8.32-17.el7.x86_64 pixman-0.34.0-1.el7.x86_64 rrdtool-1.4.8-9.el7.x86_64 xz-libs-5.2.2-1.el7.x86_64 zlib-1.2.7-18.el7.x86_64
(gdb)
(gdb)
(gdb) bt
#0 0x00007f62fcd85337 in raise () at /lib64/libc.so.6
#1 0x00007f62fcd86a28 in abort () at /lib64/libc.so.6
#2 0x0000000000428e63 in sigsegv_handler (signum=<optimized out>) at sig.c:57
#3 0x00007f62fcd853b0 in <signal handler called> () at /lib64/libc.so.6
#4 0x00007f62fcd89f97 in ____strtoll_l_internal () at /lib64/libc.so.6
#5 0x000000000040f9c2 in do_temperature_rrd (__nptr=0x0) at /usr/include/stdlib.h:280
#6 0x000000000040f9c2 in do_temperature_rrd (hostname=hostname at entry=0x7f62fdfceb43 "synologyhost.domain.eu", testname=testname at entry=0x7f62fdfceb58 "temperature", classname=classname at entry=0x7f62fdfceb99 "p_cominder", pagepaths=pagepaths at entry=0x7f62fdfceba4 "0", msg=msg at entry=0x7f62fdfceba7 "status+300 synologyhost,domain.eu.temperature green 2019-10-17 13:40:01 [synologyhost.domain.eu] - temperature\nDevice", ' ' <repeats 13 times>, "Temp(C) Temp(F)\n", '-' <repeats 39 times>, "\n&green system"..., tstamp=tstamp at entry=1571308802) at rrd/do_temperature.c:100
#7 0x000000000041316b in update_rrd (hostname=hostname at entry=0x7f62fdfceb43 "synologyhost.domain.eu", testname=<optimized out>,
testname at entry=0x7f62fdfceb58 "temperature", msg=msg at entry=0x7f62fdfceba7 "status+300 synologyhost,domain.eu.temperature green 2019-10-17 13:40:01 [synologyhost.domain.eu] - temperature\nDevice", ' ' <repeats 13 times>, "Temp(C) Temp(F)\n", '-' <repeats 39 times>, "\n&green system"..., tstamp=tstamp at entry=1571308802, sender=sender at entry=0x7f62fdfceb36 "83.99.221.6", ldef=<optimized out>, classname=classname at entry=0x7f62fdfceb99 "p_cominder", pagepaths=pagepaths at entry=0x7f62fdfceba4 "0") at do_rrd.c:714
#8 0x0000000000403434 in main (argc=<optimized out>, argv=0x7ffffb4bd4b8) at xymond_rrd.c:391
(gdb)
--
So, we know which metric cause RRD crash, we have workaround (to make RRD working to generate other metrics graphs),
but we need better solution to make all that working as expected.
P.S. Note: real hostname is replaced in all outputs submitted in e-mail (just if there are some checksums are used).
Best regards,
Andrey Chervonets
CoMinder Support
http://www.cominder.eu/
mobile: +XXX XXXXXXXX
"Xymon" <xymon-bounces at xymon.com> wrote on 15.10.2019 13:00:01:
From: xymon-request at xymon.com
To: xymon at xymon.com
Date: 15.10.2019 13:00
Subject: Xymon Digest, Vol 105, Issue 9
Sent by: "Xymon" <xymon-bounces at xymon.com>
Message: 1
Date: Mon, 14 Oct 2019 15:09:53 +0300
From: Andrey Chervonets <user-e7fb5c02322c@xymon.invalid>
To: xymon at xymon.com
Subject: [Xymon] xymond_rrd - Program crashed after fresh install of
Xymon 4.3.30 and data from Xymon 4.3.17
Message-ID:
<user-859adb8996e1@xymon.invalid>
Content-Type: text/plain; charset="us-ascii"
Good day!
Recently we have installed Xymon 4.3.30 on new VM (CentOS Linux release 7.7.1908 (Core) - guest under KVM
Guest Kernel: 3.10.0-1062.1.1.el7.x86_64 #1 SMP Fri Sep 13 22:55:44
UTC
2019 x86_64 x86_64 x86_64 GNU/Linux
All OK, except xymond_rrd is crashing frequently - the "xymond_rrd"
metric
is always red (was never green) with message:
- Program crashed
Fatal signal caught!
In rrd-status.log we can find frequent messages like:
2019-10-14 14:35:03.609265 Child process 2997 died: Signal 6
2019-10-14 14:35:04.239677 Peer at 0.0.0.0:0 failed: Broken pipe
2019-10-14 14:35:08.886124 Peer not up, flushing message queue
2019-10-14 14:36:45.883398 Host 'synologyhost.domain.eu' reports netstat
for an unknown OS
2019-10-14 14:36:45.888875 Child process 21622 died: Signal 6
2019-10-14 14:36:52.510319 Peer at 0.0.0.0:0 failed: Broken pipe
2019-10-14 14:36:52.510720 Peer not up, flushing message queue
2019-10-14 14:40:02.689062 Host 'synologyhost.domain.eu' reports netstat
for an unknown OS
2019-10-14 14:40:02.694320 Child process 28158 died: Signal 6
2019-10-14 14:40:05.119354 Peer at 0.0.0.0:0 failed: Broken pipe
2019-10-14 14:40:05.250422 Peer not up, flushing message queue
Note: lines like "Host 'synologyhost.domain.eu' reports netstat for an unknown OS" are comining from Synonlogy NAS with Monitoring package installed.
I am sure it is not related - it was working on old Xymon 4.3.17 (CentOS
6.6)
After fresh installation we just remapped (with symbolic link) the data directory to continue employ old data logs and rra.
There is plenty of core files under server/tmp/
srw-rw-rw- 1 xymon monitor 0 Oct 14 14:40 rrdctl.572
-rw------- 1 xymon monitor 3252224 Oct 14 14:45 core.572
srw-rw-rw- 1 xymon monitor 0 Oct 14 14:45 rrdctl.17027
-rw------- 1 xymon monitor 3248128 Oct 14 14:50 core.17027
srw-rw-rw- 1 xymon monitor 0 Oct 14 14:50 rrdctl.30574
-rw------- 1 xymon monitor 3248128 Oct 14 14:55 core.30574
srw-rw-rw- 1 xymon monitor 0 Oct 14 14:55 rrdctl.13275
-rw------- 1 xymon monitor 3239936 Oct 14 15:00 core.13275
-rw-r--r-- 1 xymon monitor 1887355 Oct 14 15:02 xymond.chk
-rw-r--r-- 1 xymon monitor 0 Oct 14 15:02 alert.chk.sub
-rw-r--r-- 1 xymon monitor 70921 Oct 14 15:02 alert.chk
srw-rw-rw- 1 xymon monitor 0 Oct 14 15:02 rrdctl.5887
srw-rw-rw- 1 xymon monitor 0 Oct 14 15:02 rrdctl.5954
-rw------- 1 xymon monitor 3764224 Oct 14 15:05 core.5887
srw-rw-rw- 1 xymon monitor 0 Oct 14 15:05 rrdctl.10234
Question: How can we diagnose what is the cause of the problem?
Best regards,
Andrey Chervonets
SIA CoMinder
http://www.cominder.eu/
mobile: +XXX XXXXXXXX