rrd logs and graphs
list Vernon Everett
Hi all Am busy trying to investigate a curious problem with rrd graphs, and I stumbled on something else I don't understand, and was hoping somebody out there could help. As part of my investigation, I added --debug to the [rrdstatus] and [rrddata] entries on the server tasks.cfg And the logs started showing heaps of the message 2014-11-14 10:41:36 Peer not up, flushing message queue What is that? It doesn't look right to me. I might as well explain why I am doing this. I am running an experimental script, zresources.ksh Its supposed to allow me to graph the resources being used by individual zones on Solaris. Here's part of the code $XYMON $XYMSRV "status $MACHINE.zmem green $(date) $OUT " $XYMON $XYMSRV "data $MACHINE.zmem $COLOR $(echo; cat "$TEMPFILE.mem" ;echo;echo)" $XYMON $XYMSRV "data $MACHINE.zcpu $COLOR $(echo; cat "$TEMPFILE.cpu" ;echo;echo)" $XYMON $XYMSRV "data $MACHINE.zswap $COLOR $(echo; cat "$TEMPFILE.swap" ;echo;echo)" echo $XYMON $XYMSRV "status $MACHINE.zmem green $(date) $OUT " echo $XYMON $XYMSRV "data $MACHINE.zmem $COLOR $(echo; cat "$TEMPFILE.mem" ;echo;echo)" echo $XYMON $XYMSRV "data $MACHINE.zcpu $COLOR $(echo; cat "$TEMPFILE.cpu" ;echo;echo)" echo $XYMON $XYMSRV "data $MACHINE.zswap $COLOR $(echo; cat "$TEMPFILE.swap" ;echo;echo)" cat $TEMPFILE.swap cat $TEMPFILE.cpu cat $TEMPFILE.mem The first 4 lines are "real" and the rest I added for debug. It's supposed to send data containing the zone names, and the amount of resources it's using. global : 293 zone1 : 575 zone2 : 899 zone3 : 754 All good, it makes nice pretty graphs and what have you. And I expect and get rrd files as follows zmem,global.rrd zcpu,global.rrd zswap,global.rrd zmem,zone1.rrd zcpu,zone.rrd zswap,zone1.rrd etc.etc. Except every now and then, I get something like zmem,c2t0d1.rrd I get others too, like zmem,rmt_0 and sometimes even with database SIDs too. And not just for zmem, for zcpu and zswap with about equal frequency. And often, the database SID is not even a database running on that host! I look inside the rrd file, with rrdtool dump, and there will usualy be only a single data point. Looking in the client logs, which contains the output of the debug code above, there is no indication that the client ever sent any data using anything else but the zone names. So where are these spurious data points coming from? Has anybody seen anything like this? Regards Vernon -- "Accept the challenges so that you can feel the exhilaration of victory" - General George Patton
list Jeremy Laidman
On 14 November 2014 at 14:43, Vernon Everett <user-b3f8dacb72c8@xymon.invalid>
▸
wrote:
Am busy trying to investigate a curious problem with rrd graphs, and I stumbled on something else I don't understand, and was hoping somebody out there could help. As part of my investigation, I added --debug to the [rrdstatus] and [rrddata] entries on the server tasks.cfg And the logs started showing heaps of the message 2014-11-14 10:41:36 Peer not up, flushing message queue What is that? It doesn't look right to me.
It's usually normal. See Henrik's response to a similar question: http://lists.xymon.com/archive/2014-April/039461.html
▸
Except every now and then, I get something likezmem,c2t0d1.rrd
Has anybody seen anything like this?
Yes. It's puzzling, but rare enough that I haven't had time to investigate. J
list Vernon Everett
I am getting those sporadic .rrd files in spades. :-( Sometimes, only a single data point in the file. But enough files, and your graphs start to look like crap. Tomorrow I am off to a client where it's happening all the time. What can I send you to assist with investigating? I am trying to figure out if it's a bug in Xymon, or a bug in my script. So far I have no evidence to support it being either. Regards Vernon On 24 February 2015 at 13:14, Jeremy Laidman <user-71895fb2e44c@xymon.invalid>
▸
wrote:
On 14 November 2014 at 14:43, Vernon Everett <user-b3f8dacb72c8@xymon.invalid> wrote:Am busy trying to investigate a curious problem with rrd graphs, and I stumbled on something else I don't understand, and was hoping somebody out there could help. As part of my investigation, I added --debug to the [rrdstatus] and [rrddata] entries on the server tasks.cfg And the logs started showing heaps of the message 2014-11-14 10:41:36 Peer not up, flushing message queue What is that? It doesn't look right to me.It's usually normal. See Henrik's response to a similar question: http://lists.xymon.com/archive/2014-April/039461.html Except every now and then, I get something likezmem,c2t0d1.rrdHas anybody seen anything like this?Yes. It's puzzling, but rare enough that I haven't had time to investigate. J
-- "Accept the challenges so that you can feel the exhilaration of victory" - General George Patton
list Jeremy Laidman
I'm assuming you've checked your debug output from your script to see if the $TEMPFILE.* file contents look OK. Perhaps run your own instance of "xymond_channel --channel=data" to capture the messages as they come from xymond to xymond_rrd. This will generate a lot of output, so you'll want to use "--filter" and perhaps "grep" to trim it down. You could also run snoop/tcpdump at the same time and try to capture the data message as it arrives at your Xymon server. If you have lots of Xymon traffic it might be better to do so on the client side. The trick is to get a snapshot at the time that the RRD file is created, without collecting so much data that you run out of disk! So doing things like this: while true; do tcpdump -w dump.out -n -c 10000 dest port 1984 and host blabla; gzip dump.out; mv dump.out.gz dump.out-`date +%s`; done This will capture 10k of packets at a time, then compress and rotate. You can also run xymond in a host-specific debug mode, by appending "--dbghost=HOSTNAME". That will spit out all the traffic into /tmp/xymond.dbg for analysis. Again, you might need to periodically rotate that file and signal xymond to re-open output files (I'm guessing a HUP signal might do this, or just kill the process and have xymonlaunch restart it). The path the data take would be: [script] -> [xymon client] -> [TCP/1984] -> [xymond] -> [xymond_channel] -> [xymond_rrd] -> [rrd file] What we want to do is to watch the traffic/messages to determine which of these components is causing the problem. My first step would be to try to isolate whether it's a client or server problem, hence watching the traffic with tcpdump/snoop. If the traffic is transmitted over the wire in the correct form, then I'd look at what xymond gives to xymond_channel. And so on. Once we can identify the process that creates the phantom entity, we can look for the root cause and then work-arounds/solutions. J On 24 February 2015 at 16:46, Vernon Everett <user-b3f8dacb72c8@xymon.invalid>
▸
wrote:
I am getting those sporadic .rrd files in spades. :-( Sometimes, only a single data point in the file. But enough files, and your graphs start to look like crap. Tomorrow I am off to a client where it's happening all the time. What can I send you to assist with investigating? I am trying to figure out if it's a bug in Xymon, or a bug in my script. So far I have no evidence to support it being either. Regards Vernon On 24 February 2015 at 13:14, Jeremy Laidman <user-71895fb2e44c@xymon.invalid> wrote:On 14 November 2014 at 14:43, Vernon Everett <user-b3f8dacb72c8@xymon.invalid> wrote:Am busy trying to investigate a curious problem with rrd graphs, and I stumbled on something else I don't understand, and was hoping somebody out there could help. As part of my investigation, I added --debug to the [rrdstatus] and [rrddata] entries on the server tasks.cfg And the logs started showing heaps of the message 2014-11-14 10:41:36 Peer not up, flushing message queue What is that? It doesn't look right to me.It's usually normal. See Henrik's response to a similar question: http://lists.xymon.com/archive/2014-April/039461.html Except every now and then, I get something likezmem,c2t0d1.rrdHas anybody seen anything like this?Yes. It's puzzling, but rare enough that I haven't had time to investigate. J-- "Accept the challenges so that you can feel the exhilaration of victory" - General George Patton
list John Thurston
▸
On 2/23/2015 8:46 PM, Vernon Everett wrote:
I am getting those sporadic .rrd files in spades. :-( Sometimes, only a single data point in the file. But enough files, and your graphs start to look like crap.
It is my observation that the rrd listener is extremely brittle. If a badly formatted message of the "data" type arrives, it will kill the rrd listener. This leads to gaps in all of the data charts, not just for the client which sent the garbage message. If I remember correctly, and "empty" message is something which will not parse and kills the listener. When the rrd listener crashes, it leaves a tmp file behind (which causes other problems). I've written a script which I run from time to time. It hunts down the orphaned tmp files and kills them off. The pertinent part of it is identifying the orphans:
rrdfiles=~/server/tmp/rrdctl; comm -13 <(pgrep -u xymon xymond_rrd | sort -n) <(for i in `ls -1 ${rrdfiles}.* | sort -n`;do echo ${i##*.};done)
Take a look and see how many orphans you have, and how frequently they are occurring. Maybe you can identify the culprit.
--
Do things because you should, not just because you can.
John Thurston XXX-XXX-XXXX
user-ce4d79d99bab@xymon.invalid
Enterprise Technology Services
Department of Administration
State of Alaska
list John Thurston
▸
On 2/23/2015 8:46 PM, Vernon Everett wrote:
I am getting those sporadic .rrd files in spades. :-( Sometimes, only a single data point in the file. But enough files, and your graphs start to look like crap. Tomorrow I am off to a client where it's happening all the time. What can I send you to assist with investigating?
I was looking through my rrd notes. Back in August and September of 2014 I was chasing an RRD failure. While trying to find the cause, I enabled the --debug option on the rrd daemon which caused the daemon to core dump. http://lists.xymon.com/archive/2014-August/040113.html So be careful what steps you take to chase down your problems. You may make them worse :)
▸
--
Do things because you should, not just because you can.
John Thurston XXX-XXX-XXXX
user-ce4d79d99bab@xymon.invalid
Enterprise Technology Services
Department of Administration
State of Alaska
list Vernon Everett
Hi Jeremy
Added some debug code to my script. Here's an extract.
DATA=$(cat $TMPFILE.drvperf | awk '{ print $1" : "$2 }') # Current IO
latency
$XYMON $XYMSRV "data $ENAME.e-series-dcuriolat $(echo; echo; echo
"$DATA"; echo)"
echo $XYMON $XYMSRV "data $ENAME.e-series-dcuriolat $(echo; echo; echo
"$DATA"; echo)"
DATA=$(cat $TMPFILE.drvperf | awk '{ print $1" : "$3 }') # Max IO
latency
$XYMON $XYMSRV "data $ENAME.e-series-dmaxiolat $(echo; echo; echo
"$DATA"; echo)"
echo $XYMON $XYMSRV "data $ENAME.e-series-dmaxiolat $(echo; echo; echo
"$DATA"; echo)"
DATA=$(cat $TMPFILE.drvperf | awk '{ print $1" : "$3 }') # Avg IO
latency
$XYMON $XYMSRV "data $ENAME.e-series-davgiolat $(echo; echo; echo
"$DATA"; echo)"
echo $XYMON $XYMSRV "data $ENAME.e-series-davgiolat $(echo; echo; echo
"$DATA"; echo)"
And I managed to get a couple of bizarre data files.
e-series-dcuriolat,icmpOutParmProbs.rrd
e-series-dcuriolat,icmpOutRedirects.rrd
e-series-dcuriolat,ipv6InTruncatedPkts.rrd
e-series-dcuriolat,ipv6OutFragFails.rrd
e-series-dcuriolat,UDP_udpInDatagrams.rrd
e-series-dcuriolat,udpInCksumErrs.rrd
And if I grep in my log file for icmp or any of those terms, I come up with
nothing.
So I am guessing it's not coming from the client.
I want to try the snoop, but this client script is running on the server,
as a client script.
It collects data from a bunch of NetApp E-series devices, and sends it to
the server in the normal way.
So you can imagine what the snoop data is going to look like.
But I will give it a go, and see if there is something in it.
As for debugging the rrd tasks, John was right.
Adding --debug to the rrd config causes it to crash.
Then I just het heaps of this.
2015-02-25 11:31:07 Peer not up, flushing message queue
2015-02-25 11:31:07 Peer not up, flushing message queue
2015-02-25 11:31:07 Peer not up, flushing message queue
And the occasional
19073 2015-02-25 11:31:14 2015-02-25 11:31:15 Child process 19073 died:
Signal 6
But I think I am reasonably happy that the strange data isn't coming from
the client script.
Martin Flemming is a list member in Germany (think) who is helping me test
this script.
I will ask him if he's seeing the same issues. If not, I think we can rule
out the script.
Regards
Vernon
On 24 February 2015 at 14:26, Jeremy Laidman <user-71895fb2e44c@xymon.invalid>
▸
wrote:
I'm assuming you've checked your debug output from your script to see if the $TEMPFILE.* file contents look OK. Perhaps run your own instance of "xymond_channel --channel=data" to capture the messages as they come from xymond to xymond_rrd. This will generate a lot of output, so you'll want to use "--filter" and perhaps "grep" to trim it down. You could also run snoop/tcpdump at the same time and try to capture the data message as it arrives at your Xymon server. If you have lots of Xymon traffic it might be better to do so on the client side. The trick is to get a snapshot at the time that the RRD file is created, without collecting so much data that you run out of disk! So doing things like this: while true; do tcpdump -w dump.out -n -c 10000 dest port 1984 and host blabla; gzip dump.out; mv dump.out.gz dump.out-`date +%s`; done This will capture 10k of packets at a time, then compress and rotate. You can also run xymond in a host-specific debug mode, by appending "--dbghost=HOSTNAME". That will spit out all the traffic into /tmp/xymond.dbg for analysis. Again, you might need to periodically rotate that file and signal xymond to re-open output files (I'm guessing a HUP signal might do this, or just kill the process and have xymonlaunch restart it). The path the data take would be: [script] -> [xymon client] -> [TCP/1984] -> [xymond] -> [xymond_channel] -> [xymond_rrd] -> [rrd file] What we want to do is to watch the traffic/messages to determine which of these components is causing the problem. My first step would be to try to isolate whether it's a client or server problem, hence watching the traffic with tcpdump/snoop. If the traffic is transmitted over the wire in the correct form, then I'd look at what xymond gives to xymond_channel. And so on. Once we can identify the process that creates the phantom entity, we can look for the root cause and then work-arounds/solutions. J On 24 February 2015 at 16:46, Vernon Everett <user-b3f8dacb72c8@xymon.invalid> wrote:I am getting those sporadic .rrd files in spades. :-( Sometimes, only a single data point in the file. But enough files, and your graphs start to look like crap. Tomorrow I am off to a client where it's happening all the time. What can I send you to assist with investigating? I am trying to figure out if it's a bug in Xymon, or a bug in my script. So far I have no evidence to support it being either. Regards Vernon On 24 February 2015 at 13:14, Jeremy Laidman <user-71895fb2e44c@xymon.invalid> wrote:On 14 November 2014 at 14:43, Vernon Everett <user-b3f8dacb72c8@xymon.invalid> wrote:Am busy trying to investigate a curious problem with rrd graphs, and I stumbled on something else I don't understand, and was hoping somebody out there could help. As part of my investigation, I added --debug to the [rrdstatus] and [rrddata] entries on the server tasks.cfg And the logs started showing heaps of the message 2014-11-14 10:41:36 Peer not up, flushing message queue What is that? It doesn't look right to me.It's usually normal. See Henrik's response to a similar question: http://lists.xymon.com/archive/2014-April/039461.html Except every now and then, I get something likezmem,c2t0d1.rrdHas anybody seen anything like this?Yes. It's puzzling, but rare enough that I haven't had time to investigate. J-- "Accept the challenges so that you can feel the exhilaration of victory" - General George Patton
-- "Accept the challenges so that you can feel the exhilaration of victory" - General George Patton
list Jeremy Laidman
Yes, I'd be surprised if it's a problem in the script. On 25 February 2015 at 14:40, Vernon Everett <user-b3f8dacb72c8@xymon.invalid> wrote:
e-series-dcuriolat,ipv6OutFragFails.rrd e-series-dcuriolat,UDP_udpInDatagrams.rrd e-series-dcuriolat,udpInCksumErrs.rrd
These all look to be coming from the [netstat] client data. Perhaps have a look at the server's client data, scan through to the [netstat] section, and see if there are an unusual characters or corruption, either within [netstat] or the sections before and after it. I'm not sure what you'd be looking for, but worth a look. J
list Vernon Everett
Hmm, that theory makes a lot of sense. Since I have also seen files like e-series-dcuriolat,vnet.rrd which looks like it comes from the ifstat section. But of course, that would be too easy. :-) I am also seeing e-series-dmaxiolat,syndev72.rrd And syndev72 is the hostname of a system completely unrelated to the test where I am seeing the issues. There are a few others like that which have hostnames in them. I am also seeing the likes of e-series-dcuriolat,vdc1.rrd There are only 2 hosts throwing out "vdc?" data points, and they also has nothing to do with the e-series. The vdc? is from another custom script, which logs and graphs the % busy of LUNs on 2 servers that are sensitive to disk latency fluctuation. These hosts all have nothing at all to do with the storage arrays being monitored, which makes me think the client data might be a red herring. I have asked Martin if he's seeing the same issue, but he hasn't got back to me yet. The Xymon version I have here is 4.3.12, but I have also seen it at my other customers on version 4.3.17 and 4.3.18 (although to a far lesser degree) Regards Vernon On 25 February 2015 at 12:30, Jeremy Laidman <user-71895fb2e44c@xymon.invalid>
▸
wrote:
Yes, I'd be surprised if it's a problem in the script. On 25 February 2015 at 14:40, Vernon Everett <user-b3f8dacb72c8@xymon.invalid> wrote:e-series-dcuriolat,ipv6OutFragFails.rrd e-series-dcuriolat,UDP_udpInDatagrams.rrd e-series-dcuriolat,udpInCksumErrs.rrdThese all look to be coming from the [netstat] client data. Perhaps have a look at the server's client data, scan through to the [netstat] section, and see if there are an unusual characters or corruption, either within [netstat] or the sections before and after it. I'm not sure what you'd be looking for, but worth a look. J
-- "Accept the challenges so that you can feel the exhilaration of victory" - General George Patton
list Jeremy Laidman
On 25 February 2015 at 19:16, Vernon Everett <user-b3f8dacb72c8@xymon.invalid>
▸
wrote:
These hosts all have nothing at all to do with the storage arrays being monitored, which makes me think the client data might be a red herring.
Yup, makes sense.
My best guess is memory corruption within xymond. So let's see if the
corruption is visible in the messages being passed between xymond and
xymond_channel. If we see corrupt messages in there, we can start to delve
into the source code to see if there's a bug somewhere. Are you able to
run your own instance of xymond_channel? Maybe something like this:
sudo -u xymon xymoncmd xymond_channel --channel=data --filter=zmem cat
One you get an idea what it looks like, change "cat" for something like
"egrep -A5 ^@" to get only the first 5 lines. Also, redirect to a file
until you notice a dodgy RRD file and then kill the process.
Did you try running xymond with "--dbghost=HOSTNAME" ? It might be too
voluminous, but might be worth a try, if you can manage to snag the traffic
at the right time.
J
list Vernon Everett
Only back at that client again on Wednesday. But I do have remote access, so will see what I can do this evening. On 25 February 2015 at 18:06, Jeremy Laidman <user-71895fb2e44c@xymon.invalid>
▸
wrote:
On 25 February 2015 at 19:16, Vernon Everett <user-b3f8dacb72c8@xymon.invalid> wrote:These hosts all have nothing at all to do with the storage arrays being monitored, which makes me think the client data might be a red herring.Yup, makes sense. My best guess is memory corruption within xymond. So let's see if the corruption is visible in the messages being passed between xymond and xymond_channel. If we see corrupt messages in there, we can start to delve into the source code to see if there's a bug somewhere. Are you able to run your own instance of xymond_channel? Maybe something like this: sudo -u xymon xymoncmd xymond_channel --channel=data --filter=zmem cat One you get an idea what it looks like, change "cat" for something like "egrep -A5 ^@" to get only the first 5 lines. Also, redirect to a file until you notice a dodgy RRD file and then kill the process. Did you try running xymond with "--dbghost=HOSTNAME" ? It might be too voluminous, but might be worth a try, if you can manage to snag the traffic at the right time. J
-- "Accept the challenges so that you can feel the exhilaration of victory" - General George Patton
list Vernon Everett
Hi all Back at the customer's site, and back on this problem. I just captured data as this happened. Ran the command you suggested, slightly modified for our environment. I also redirected it to a file for analysis. Here's what I ran, with error output. ./xymoncmd xymond_channel --channel=data --filter=e-series cat > /var/tmp/xymon.out 2015-03-04 08:45:22 Using default environment file /opt/local/xymon/server/etc/xymonserver.cfg 2015-03-04 08:45:58 Peer not up, flushing message queue 2015-03-04 09:05:21 Gave up waiting for GOCLIENT to go low. What is that GOCLIENT thing? It might be relevant, since it occurred just *after *some errant data files were created. (Note timestamp) The errant data files are -rw-r--r-- 1 xymon xymon 19580 Mar 4 09:02 SINFSPNA01/e-series-davgiolat,subversionsize.rrd -rw-r--r-- 1 xymon xymon 19580 Mar 4 09:02 SINFSPNA01/e-series-davgiolat,subversionrss.rrd -rw-r--r-- 1 xymon xymon 19580 Mar 4 09:02 SINFSPNA01/e-series-davgiolat,subversionmemory.rrd -rw-r--r-- 1 xymon xymon 19580 Mar 4 09:02 SINFSPNA01/e-series-davgiolat,subversioncpu.rrd -rw-r--r-- 1 xymon xymon 19580 Mar 4 09:02 SINFSPNA01/e-series-davgiolat,subversion.rrd -rw-r--r-- 1 xymon xymon 19580 Mar 4 09:02 SINFSPNA01/e-series-davgiolat,energisesize.rrd This is supposed to graph the average IO latency of the disks in the e-series, so we expect output to look like this. -rw-r--r-- 1 xymon xymon 19580 Mar 4 09:24 SINFSPNA01/e-series-davgiolat,Tray99_Slot1.rrd -rw-r--r-- 1 xymon xymon 19580 Mar 4 09:24 SINFSPNA01/e-series-davgiolat,Tray0_Slot1.rrd -rw-r--r-- 1 xymon xymon 19580 Mar 4 09:22 SINFSPNA03/e-series-davgiolat,Tray99_Slot12.rrd -rw-r--r-- 1 xymon xymon 19580 Mar 4 09:22 SINFSPNA03/e-series-davgiolat,Tray0_Slot12.rrd -rw-r--r-- 1 xymon xymon 19580 Mar 4 09:21 SINFSPNA01/e-series-davgiolat,Tray99_Slot8.rrd -rw-r--r-- 1 xymon xymon 19580 Mar 4 09:21 SINFSPNA01/e-series-davgiolat,Tray99_Slot7.rrd -rw-r--r-- 1 xymon xymon 19580 Mar 4 09:21 SINFSPNA01/e-series-davgiolat,Tray99_Slot6.rrd The subversion and energize are actually host names for completely unrelated servers. The subversionsize and subversionrsss and similar data points are being sent for another host, but are not related to the e-series graphs. In the output file, /var/tmp/xymon.out from ./xymoncmd xymond_channel --channel=data --filter=e-series cat > /var/tmp/xymon.out there is no mention of the subversion or energise stuff either. Does this narrow the field at all? Based on Jeremy's earlier email, it looks like the issue is in xymond_rrd, unless that GOCLIENT error can tell us something more?
▸
Regards
Vernon
On 25 February 2015 at 18:06, Jeremy Laidman <user-71895fb2e44c@xymon.invalid>
wrote:
On 25 February 2015 at 19:16, Vernon Everett <user-b3f8dacb72c8@xymon.invalid> wrote:These hosts all have nothing at all to do with the storage arrays being monitored, which makes me think the client data might be a red herring.Yup, makes sense. My best guess is memory corruption within xymond. So let's see if the corruption is visible in the messages being passed between xymond and xymond_channel. If we see corrupt messages in there, we can start to delve into the source code to see if there's a bug somewhere. Are you able to run your own instance of xymond_channel? Maybe something like this: sudo -u xymon xymoncmd xymond_channel --channel=data --filter=zmem cat One you get an idea what it looks like, change "cat" for something like "egrep -A5 ^@" to get only the first 5 lines. Also, redirect to a file until you notice a dodgy RRD file and then kill the process. Did you try running xymond with "--dbghost=HOSTNAME" ? It might be too voluminous, but might be worth a try, if you can manage to snag the traffic at the right time. J
-- "Accept the challenges so that you can feel the exhilaration of victory" - General George Patton
list Jeremy Laidman
▸
On 4 March 2015 at 12:40, Vernon Everett <user-b3f8dacb72c8@xymon.invalid> wrote:
Here's what I ran, with error output. ./xymoncmd xymond_channel --channel=data --filter=e-series cat > /var/tmp/xymon.out 2015-03-04 08:45:22 Using default environment file /opt/local/xymon/server/etc/xymonserver.cfg 2015-03-04 08:45:58 Peer not up, flushing message queue 2015-03-04 09:05:21 Gave up waiting for GOCLIENT to go low. What is that GOCLIENT thing?
From what I can understand, it's a semaphore shared between xymond and all of the xymond_channel instances. When there are several channel readers, they all get sent the message address, and as each one accepts the message, she decrements GOCLIENT. When GOCLIENT is zero, it means all readers have received (and probably copied) the message, and the memory can be freed. Each reader waits until GOCLIENT goes back to zero before waiting for the next message. There's a timeout of 1 second that xymond_channel waits for GOCLIENT to go back to zero. If the time is exceeded in a channel reader, it means another reader is taking too long to handle a message, and so the first reader gives up, logs the error you saw, and carries on with the next message loop. I'm not sure if this is a sign of trouble. Or it might be normal when you're running your own instance of xymond_channel. Or it might be a side-effect of the "cat" command blocking when writing to your output file due to a high message rate and contention on whatever filesystem has /var/tmp/. There's a description of how GOCLIENT works in the file new-daemon.txt, in the source code.
▸
In the output file, /var/tmp/xymon.out from ./xymoncmd xymond_channel --channel=data --filter=e-series cat > /var/tmp/xymon.out there is no mention of the subversion or energise stuff either.
Does it have mention of the correct data set names? We can't draw any conclusions if it's not collecting the data we expect. Did any of the RRD files skip an update at the time the new rogue files were created? Do these files match up with entries in xymon.out? Or anything interesting at the same time as the rogue entries were created? If you're seeing correct entries in xymon.out, but not the bogus entries, then I'm inclined to agree that xymond_rrd is at fault, and is possibly using memory it's not supposed to. I wonder if running xymond_rrd with "--no-cache" might have an effect. Obviously, it's better if you can cache updates to the RRD files, but it might narrow down the region of code that's responsible. This is not conclusive. It's possible that when you have two instances of xymond_channel, only one is corrupting data names, and it just so happened that it was the one being used by xymond_rrd. Could be that another time you would see your extra reader getting the bogus entries. That's the problem with using a second instance for analysis, rather than somehow getting the analysis happening on the one that writes to the RRD files. On the other hand, if you ran two instances of xymond_rrd, both on the same data channel, and if both instances create the bogus RRD files, then you know that you can probably use the second instance to narrow down the fault, without impacting the creation of RRD files for real work. Are you still running xymond_rrd with "--debug"? Did this show anything interesting when the bogus RRD files were created? What version of Xymon are you running? Did this start happening after an upgrade? I wonder if it's a bug with some versions but not others. J
list Vernon Everett
Thanks for your assistance Jeremy. Quite a bit to digest. Everything in the xymon.out file I am collecting looks exactly like I would expect it to look. There is a line, prefixed by @@data#<sequence> (or at least I think it's a sequence number),followed by pipe-separated data that looks like the host name, time stamp, IP address, an empty field, host name again, the rrd-file prefix, another blank field and the last field is other. I then get a blank line, and the data looks like what I am trying to send. Looks like we might need to check with JC for more on that GOCLIENT thing. I just find it odd that it happened about the same time as the corruption. I haven't seen it again today, and haven't seen any other corruption either. At this site, we are running Xymon 4.3.12, but I have seen similar behaviour (although not to such an extent) elsewhere with 4.3.17, and I think I also saw it with 4.3.18, but I no longer have access to that site. I am not seeing any lost data points in the other graphs. But that could be difficult to spot. Will run a few rrdtool dumps, and look for gaps at that timestamp. Let you know what I find. As for the --debug option, it caused xymond_rrd to crash and burn, dumping cores as we go. It gets ugly. Earlier in this thread, John Thurston mentioned this behaviour too. It also creates a red xymond_rrd button on the xymon server, but the results are not overly helpful. - Program crashedFatal signal caught! Don't think it started after an upgrade. Something I did notice, the problem appears to be limited to data only, used to display graphs in trends. I am not seeing this for data when there is a status and data component. Or at least I haven't seen it yet. What are the implications of running with "--no-cache"? I have implemented this by adding "--no-cache" but if it's going to have a long-term impact, I don't want to leave it that way indefinitely. Regards Vernon
▸
On 4 March 2015 at 14:03, Jeremy Laidman <user-71895fb2e44c@xymon.invalid> wrote:
On 4 March 2015 at 12:40, Vernon Everett <user-b3f8dacb72c8@xymon.invalid> wrote:Here's what I ran, with error output. ./xymoncmd xymond_channel --channel=data --filter=e-series cat > /var/tmp/xymon.out 2015-03-04 08:45:22 Using default environment file /opt/local/xymon/server/etc/xymonserver.cfg 2015-03-04 08:45:58 Peer not up, flushing message queue 2015-03-04 09:05:21 Gave up waiting for GOCLIENT to go low. What is that GOCLIENT thing?From what I can understand, it's a semaphore shared between xymond and all of the xymond_channel instances. When there are several channel readers, they all get sent the message address, and as each one accepts the message, she decrements GOCLIENT. When GOCLIENT is zero, it means all readers have received (and probably copied) the message, and the memory can be freed. Each reader waits until GOCLIENT goes back to zero before waiting for the next message. There's a timeout of 1 second that xymond_channel waits for GOCLIENT to go back to zero. If the time is exceeded in a channel reader, it means another reader is taking too long to handle a message, and so the first reader gives up, logs the error you saw, and carries on with the next message loop. I'm not sure if this is a sign of trouble. Or it might be normal when you're running your own instance of xymond_channel. Or it might be a side-effect of the "cat" command blocking when writing to your output file due to a high message rate and contention on whatever filesystem has /var/tmp/. There's a description of how GOCLIENT works in the file new-daemon.txt, in the source code.In the output file, /var/tmp/xymon.out from ./xymoncmd xymond_channel --channel=data --filter=e-series cat > /var/tmp/xymon.out there is no mention of the subversion or energise stuff either.Does it have mention of the correct data set names? We can't draw any conclusions if it's not collecting the data we expect. Did any of the RRD files skip an update at the time the new rogue files were created? Do these files match up with entries in xymon.out? Or anything interesting at the same time as the rogue entries were created? If you're seeing correct entries in xymon.out, but not the bogus entries, then I'm inclined to agree that xymond_rrd is at fault, and is possibly using memory it's not supposed to. I wonder if running xymond_rrd with "--no-cache" might have an effect. Obviously, it's better if you can cache updates to the RRD files, but it might narrow down the region of code that's responsible. This is not conclusive. It's possible that when you have two instances of xymond_channel, only one is corrupting data names, and it just so happened that it was the one being used by xymond_rrd. Could be that another time you would see your extra reader getting the bogus entries. That's the problem with using a second instance for analysis, rather than somehow getting the analysis happening on the one that writes to the RRD files. On the other hand, if you ran two instances of xymond_rrd, both on the same data channel, and if both instances create the bogus RRD files, then you know that you can probably use the second instance to narrow down the fault, without impacting the creation of RRD files for real work. Are you still running xymond_rrd with "--debug"? Did this show anything interesting when the bogus RRD files were created? What version of Xymon are you running? Did this start happening after an upgrade? I wonder if it's a bug with some versions but not others. J
-- "Accept the challenges so that you can feel the exhilaration of victory" - General George Patton
list Jeremy Laidman
▸
On 04/03/2015 6:02 PM, "Vernon Everett" <user-b3f8dacb72c8@xymon.invalid> wrote:
Looks like we might need to check with JC for more on that GOCLIENT thing. I just find it odd that it happened about the same time as the corruption. I haven't seen it again today, and haven't seen any other corruption either.
If there's a correlation it might help us work out where the fault is. But it might be only a symptom.
▸
As for the --debug option, it caused xymond_rrd to crash and burn, dumping cores as we go.
Could be that thensame bug causing the crash during debug is also causing the corrupt filename. Have you analyzed the core dumps?
▸
Don't think it started after an upgrade. Something I did notice, the problem appears to be limited to data only, used to display graphs in trends. I am not seeing this for data when there is a status and data component. Or at least I haven't seen it yet.
Interesting. Maybe you could modify your script to send a status also?
What are the implications of running with "--no-cache"?
I think then cache is only to save on disk I/O. If your disks aren't "running hot" (ie queueing requests long enough to cause problems) then youre probably OK to leave it.
▸
I have implemented this by adding "--no-cache" but if it's going to have a long-term impact, I don't want to leave it that way indefinitely.
Some people have reported solving some other problems by disabling the cache, with presumably no I'll effects. For example http://lists.xymon.com/archive/2009-November/026572.html. It's a performance improvement that hasn't always been there and so is probably not necessary for average size monitoring service. Also I think that if you have operations staff that refresh lots of graphs (hence forcing the cache to flush) you get minimal benefit from the cache. But it all depends on how many RRD updates per second your server needs to make, and how fast your disks and filesystem can update. If you have enough RAM, your OS might have sufficient buffers to not need the cache anyway. J
list Japheth Cleaver
▸
On Wed, March 4, 2015 2:52 pm, Jeremy Laidman wrote:
On 04/03/2015 6:02 PM, "Vernon Everett" <user-b3f8dacb72c8@xymon.invalid> wrote:Looks like we might need to check with JC for more on that GOCLIENT thing. I just find it odd that it happened about the same time as the corruption. I haven't seen it again today, and haven't seen any other corruption either.If there's a correlation it might help us work out where the fault is. But it might be only a symptom.As for the --debug option, it caused xymond_rrd to crash and burn, dumping cores as we go.Could be that thensame bug causing the crash during debug is also causing the corrupt filename. Have you analyzed the core dumps?
GOCLIENT is indeed the means by which xymond_channel listeners communicate with xymond for the picking up of messages over SysV IPC. I believe the messages there are just a side effect of it re-launching the channel listener pipe to xymond_rrd. The cache routines in xymond_rrd should be stable at this point. Can you send a backtrace in from one of the cores? I'm curious where things could be acting up here. Regards, -jc
list Vernon Everett
Hi all And even with --no-cache, I am still getting these corrupted rrd files. I tried again with --debug (and --no-cache) and it core dumps. Here's the backtrace. # adb /zones/smcconsole/root/opt/local/xymon/server/bin/xymond_rrd ./core_smcconsole_xymond_rrd_61_61_1426033626_20899 core file = ./core_smcconsole_xymond_rrd_61_61_1426033626_20899 -- program `` /zones/smcconsole/root/opt/local/xymon/server/bin/xymond_rrd'' on platform SUNW,SPARC-Enterprise-T2000 SIGABRT: Abort $c libc.so.1`_lwp_kill+8(6, 0, 0, 3a9a0, ffffffff, 6) libc.so.1`abort+0x110(0, 1, feebafec, ffb3c, fef35518, 0) sigsegv_handler+0x30(b, 0, ffbfa160, 0, fe8a2a00, ffbfa160) libc.so.1`__sighndlr+0xc(b, 0, ffbfa160, 3a970, 0, 1) libc.so.1`call_user_handler+0x3b8(b, 0, 0, 0, fe8a2a00, ffbfa160) libc.so.1`sigacthandler+0x60(b, 0, ffbfa160, ffbfb290, 0, fee91924) libc.so.1`strlen+0x50(514cf, ffbfb3ec, ffbfa9d1, 0, 0, 0) libc.so.1`vfprintf+0xec(6c3d0, 514c0, ffbfb3e8, 0, a0ba4, 33e1c) dbgprintf+0xa4(514c0, 0, 51400, 6c3f0, bf, 2ab388) dump_tcp_services+0x74(a0, 1c00, fef37940, 0, 51400, 51400) init_tcp_services+0x91c(6a400, 51400, 51400, 6c3f0, bf, 23) rrd_setup+0x15c(93314, ffbfcfc4, 63c00, ffbfcf50, 6a400, 6a400) find_xymon_rrd+4(93314, 511e8, 54ff8bda, 0, 932f3, 2e) main+0x948(93314, ffbfcfc4, 63c00, ffbfcf50, 54ff8bda, 93374) _start+0x5c(0, 0, 0, 0, 0, 0) Anything else I can offer that will assist? Regards Vernon
▸
On 5 March 2015 at 09:11, J.C. Cleaver <user-87556346d4af@xymon.invalid> wrote:
On Wed, March 4, 2015 2:52 pm, Jeremy Laidman wrote:On 04/03/2015 6:02 PM, "Vernon Everett" <user-b3f8dacb72c8@xymon.invalid> wrote:Looks like we might need to check with JC for more on that GOCLIENT thing. I just find it odd that it happened about the same time as the corruption. I haven't seen it again today, and haven't seen any other corruption either.If there's a correlation it might help us work out where the fault is.Butit might be only a symptom.As for the --debug option, it caused xymond_rrd to crash and burn, dumping cores as we go.Could be that thensame bug causing the crash during debug is also causing the corrupt filename. Have you analyzed the core dumps?GOCLIENT is indeed the means by which xymond_channel listeners communicate with xymond for the picking up of messages over SysV IPC. I believe the messages there are just a side effect of it re-launching the channel listener pipe to xymond_rrd. The cache routines in xymond_rrd should be stable at this point. Can you send a backtrace in from one of the cores? I'm curious where things could be acting up here. Regards, -jc
-- "Accept the challenges so that you can feel the exhilaration of victory" - General George Patton
list Vernon Everett
If it helps, here's the same core, with pstack. # pstack ./core_smcconsole_xymond_rrd_61_61_1426033626_20899 core './core_smcconsole_xymond_rrd_61_61_1426033626_20899' of 20899: xymond_rrd --debug --rrddir=/opt/local/xymon/data/rrd --no-cache feebebc4 _lwp_kill (6, 0, 0, 3a9a0, ffffffff, 6) + 8 fee329b0 abort (0, 1, feebafec, ffb3c, fef35518, 0) + 110 0003a9a0 sigsegv_handler (b, 0, ffbfa160, 0, fe8a2a00, ffbfa160) + 30 feebafec __sighndlr (b, 0, ffbfa160, 3a970, 0, 1) + c feeaf69c call_user_handler (b, 0, 0, 0, fe8a2a00, ffbfa160) + 3b8 feeaf884 sigacthandler (b, 0, ffbfa160, ffbfb290, 0, fee91924) + 60 --- called from signal handler with signal 11 (SIGSEGV) --- fee22d10 strlen (514cf, ffbfb3ec, ffbfa9d1, 0, 0, 0) + 50 fee91924 vfprintf (6c3d0, 514c0, ffbfb3e8, 0, a0ba4, 33e1c) + ec 0002fb4c dbgprintf (514c0, 0, 51400, 6c3f0, bf, 2ab388) + a4 00033e1c dump_tcp_services (a0, 1c00, fef37940, 0, 51400, 51400) + 74 000347f0 init_tcp_services (6a400, 51400, 51400, 6c3f0, bf, 23) + 91c 00030188 rrd_setup (93314, ffbfcfc4, 63c00, ffbfcf50, 6a400, 6a400) + 15c 00030518 find_xymon_rrd (93314, 511e8, 54ff8bda, 0, 932f3, 2e) + 4 00016614 main (93314, ffbfcfc4, 63c00, ffbfcf50, 54ff8bda, 93374) + 948 00015a40 _start (0, 0, 0, 0, 0, 0) + 5c
▸
On 11 March 2015 at 08:37, Vernon Everett <user-b3f8dacb72c8@xymon.invalid> wrote:
Hi all And even with --no-cache, I am still getting these corrupted rrd files. I tried again with --debug (and --no-cache) and it core dumps. Here's the backtrace. # adb /zones/smcconsole/root/opt/local/xymon/server/bin/xymond_rrd ./core_smcconsole_xymond_rrd_61_61_1426033626_20899 core file = ./core_smcconsole_xymond_rrd_61_61_1426033626_20899 -- program `` /zones/smcconsole/root/opt/local/xymon/server/bin/xymond_rrd'' on platform SUNW,SPARC-Enterprise-T2000 SIGABRT: Abort $c libc.so.1`_lwp_kill+8(6, 0, 0, 3a9a0, ffffffff, 6) libc.so.1`abort+0x110(0, 1, feebafec, ffb3c, fef35518, 0) sigsegv_handler+0x30(b, 0, ffbfa160, 0, fe8a2a00, ffbfa160) libc.so.1`__sighndlr+0xc(b, 0, ffbfa160, 3a970, 0, 1) libc.so.1`call_user_handler+0x3b8(b, 0, 0, 0, fe8a2a00, ffbfa160) libc.so.1`sigacthandler+0x60(b, 0, ffbfa160, ffbfb290, 0, fee91924) libc.so.1`strlen+0x50(514cf, ffbfb3ec, ffbfa9d1, 0, 0, 0) libc.so.1`vfprintf+0xec(6c3d0, 514c0, ffbfb3e8, 0, a0ba4, 33e1c) dbgprintf+0xa4(514c0, 0, 51400, 6c3f0, bf, 2ab388) dump_tcp_services+0x74(a0, 1c00, fef37940, 0, 51400, 51400) init_tcp_services+0x91c(6a400, 51400, 51400, 6c3f0, bf, 23) rrd_setup+0x15c(93314, ffbfcfc4, 63c00, ffbfcf50, 6a400, 6a400) find_xymon_rrd+4(93314, 511e8, 54ff8bda, 0, 932f3, 2e) main+0x948(93314, ffbfcfc4, 63c00, ffbfcf50, 54ff8bda, 93374) _start+0x5c(0, 0, 0, 0, 0, 0) Anything else I can offer that will assist? Regards Vernon On 5 March 2015 at 09:11, J.C. Cleaver <user-87556346d4af@xymon.invalid> wrote:On Wed, March 4, 2015 2:52 pm, Jeremy Laidman wrote:On 04/03/2015 6:02 PM, "Vernon Everett" <user-b3f8dacb72c8@xymon.invalid> wrote:Looks like we might need to check with JC for more on that GOCLIENT thing. I just find it odd that it happened about the same time as the corruption. I haven't seen it again today, and haven't seen any other corruption either.If there's a correlation it might help us work out where the fault is.Butit might be only a symptom.As for the --debug option, it caused xymond_rrd to crash and burn, dumping cores as we go.Could be that thensame bug causing the crash during debug is also causing the corrupt filename. Have you analyzed the core dumps?GOCLIENT is indeed the means by which xymond_channel listeners communicate with xymond for the picking up of messages over SysV IPC. I believe the messages there are just a side effect of it re-launching the channel listener pipe to xymond_rrd. The cache routines in xymond_rrd should be stable at this point. Can you send a backtrace in from one of the cores? I'm curious where things could be acting up here. Regards, -jc-- "Accept the challenges so that you can feel the exhilaration of victory" - General George Patton
-- "Accept the challenges so that you can feel the exhilaration of victory" - General George Patton
list Vernon Everett
This might help too.
# pflags ./core_smcconsole_xymond_rrd_61_61_1426033626_20899
core './core_smcconsole_xymond_rrd_61_61_1426033626_20899' of 20899:
xymond_rrd --debug --rrddir=/opt/local/xymon/data/rrd --no-cache
data model = _ILP32 flags = MSACCT|MSFORK
sigpend = 0x00020000,0x00000000
/1: flags = 0
sigmask = 0xffffbefc,0x0000ffff cursig = SIGABRT
▸
On 11 March 2015 at 09:13, Vernon Everett <user-b3f8dacb72c8@xymon.invalid> wrote:
If it helps, here's the same core, with pstack. # pstack ./core_smcconsole_xymond_rrd_61_61_1426033626_20899 core './core_smcconsole_xymond_rrd_61_61_1426033626_20899' of 20899: xymond_rrd --debug --rrddir=/opt/local/xymon/data/rrd --no-cache feebebc4 _lwp_kill (6, 0, 0, 3a9a0, ffffffff, 6) + 8 fee329b0 abort (0, 1, feebafec, ffb3c, fef35518, 0) + 110 0003a9a0 sigsegv_handler (b, 0, ffbfa160, 0, fe8a2a00, ffbfa160) + 30 feebafec __sighndlr (b, 0, ffbfa160, 3a970, 0, 1) + c feeaf69c call_user_handler (b, 0, 0, 0, fe8a2a00, ffbfa160) + 3b8 feeaf884 sigacthandler (b, 0, ffbfa160, ffbfb290, 0, fee91924) + 60 --- called from signal handler with signal 11 (SIGSEGV) --- fee22d10 strlen (514cf, ffbfb3ec, ffbfa9d1, 0, 0, 0) + 50 fee91924 vfprintf (6c3d0, 514c0, ffbfb3e8, 0, a0ba4, 33e1c) + ec 0002fb4c dbgprintf (514c0, 0, 51400, 6c3f0, bf, 2ab388) + a4 00033e1c dump_tcp_services (a0, 1c00, fef37940, 0, 51400, 51400) + 74 000347f0 init_tcp_services (6a400, 51400, 51400, 6c3f0, bf, 23) + 91c 00030188 rrd_setup (93314, ffbfcfc4, 63c00, ffbfcf50, 6a400, 6a400) + 15c 00030518 find_xymon_rrd (93314, 511e8, 54ff8bda, 0, 932f3, 2e) + 4 00016614 main (93314, ffbfcfc4, 63c00, ffbfcf50, 54ff8bda, 93374) + 948 00015a40 _start (0, 0, 0, 0, 0, 0) + 5c On 11 March 2015 at 08:37, Vernon Everett <user-b3f8dacb72c8@xymon.invalid> wrote:Hi all And even with --no-cache, I am still getting these corrupted rrd files. I tried again with --debug (and --no-cache) and it core dumps. Here's the backtrace. # adb /zones/smcconsole/root/opt/local/xymon/server/bin/xymond_rrd ./core_smcconsole_xymond_rrd_61_61_1426033626_20899 core file = ./core_smcconsole_xymond_rrd_61_61_1426033626_20899 -- program `` /zones/smcconsole/root/opt/local/xymon/server/bin/xymond_rrd'' on platform SUNW,SPARC-Enterprise-T2000 SIGABRT: Abort $c libc.so.1`_lwp_kill+8(6, 0, 0, 3a9a0, ffffffff, 6) libc.so.1`abort+0x110(0, 1, feebafec, ffb3c, fef35518, 0) sigsegv_handler+0x30(b, 0, ffbfa160, 0, fe8a2a00, ffbfa160) libc.so.1`__sighndlr+0xc(b, 0, ffbfa160, 3a970, 0, 1) libc.so.1`call_user_handler+0x3b8(b, 0, 0, 0, fe8a2a00, ffbfa160) libc.so.1`sigacthandler+0x60(b, 0, ffbfa160, ffbfb290, 0, fee91924) libc.so.1`strlen+0x50(514cf, ffbfb3ec, ffbfa9d1, 0, 0, 0) libc.so.1`vfprintf+0xec(6c3d0, 514c0, ffbfb3e8, 0, a0ba4, 33e1c) dbgprintf+0xa4(514c0, 0, 51400, 6c3f0, bf, 2ab388) dump_tcp_services+0x74(a0, 1c00, fef37940, 0, 51400, 51400) init_tcp_services+0x91c(6a400, 51400, 51400, 6c3f0, bf, 23) rrd_setup+0x15c(93314, ffbfcfc4, 63c00, ffbfcf50, 6a400, 6a400) find_xymon_rrd+4(93314, 511e8, 54ff8bda, 0, 932f3, 2e) main+0x948(93314, ffbfcfc4, 63c00, ffbfcf50, 54ff8bda, 93374) _start+0x5c(0, 0, 0, 0, 0, 0) Anything else I can offer that will assist? Regards Vernon On 5 March 2015 at 09:11, J.C. Cleaver <user-87556346d4af@xymon.invalid> wrote:On Wed, March 4, 2015 2:52 pm, Jeremy Laidman wrote:On 04/03/2015 6:02 PM, "Vernon Everett" <user-b3f8dacb72c8@xymon.invalid> wrote:Looks like we might need to check with JC for more on that GOCLIENT thing. I just find it odd that it happened about the same time as the corruption. I haven't seen it again today, and haven't seen any other corruption either.If there's a correlation it might help us work out where the fault is.Butit might be only a symptom.As for the --debug option, it caused xymond_rrd to crash and burn, dumping cores as we go.Could be that thensame bug causing the crash during debug is also causing the corrupt filename. Have you analyzed the core dumps?GOCLIENT is indeed the means by which xymond_channel listeners communicate with xymond for the picking up of messages over SysV IPC. I believe the messages there are just a side effect of it re-launching the channel listener pipe to xymond_rrd. The cache routines in xymond_rrd should be stable at this point. Can you send a backtrace in from one of the cores? I'm curious where things could be acting up here. Regards, -jc-- "Accept the challenges so that you can feel the exhilaration of victory" - General George Patton-- "Accept the challenges so that you can feel the exhilaration of victory" - General George Patton
-- "Accept the challenges so that you can feel the exhilaration of victory" - General George Patton
list Jeremy Laidman
▸
On 11 March 2015 at 11:37, Vernon Everett <user-b3f8dacb72c8@xymon.invalid> wrote:
And even with --no-cache, I am still getting these corrupted rrd files.
:-(
▸
I tried again with --debug (and --no-cache) and it core dumps. Here's the backtrace.
libc.so.1`vfprintf+0xec(6c3d0, 514c0, ffbfb3e8, 0, a0ba4, 33e1c) dbgprintf+0xa4(514c0, 0, 51400, 6c3f0, bf, 2ab388) dump_tcp_services+0x74(a0, 1c00, fef37940, 0, 51400, 51400)
So dump_tcp_services() calls dbgprintf() (both on lib/netservices.c) which in turn calls vprintf() from libc, but with bad parameters. I've had a look through the code in dump_tcp_services() and I don't know enough C to recognize any problems. But it might be useful to know which call to dbgprintf() is causing the problem. Does the log file for xymond_rrd show any debug output at all? If so, what's the last line that is shown. It might be helpful if you can recompile xymond_rrd with dump_tcp_services() modified. Initially, I would simply try it with "return" added after the first call to dbgprintf(). That is, dump_tcp_services() will output "Service list dump" and return. This might stop the core dumps so that we can get debug output for other parts of the xymond_rrd processing. If adding "return" at that point fixes this core dump, more diagnostic lines would be useful to determine what the problem is. For example, there's a global array called svcinfo that is iterated over, but if the array is empty, it might cause the core dump. So adding a line that checks whether the array is empty and displays the result would help to pin this down. Note that "svcinfo" appears to be populated from the protocols.cfg file and/or XYMONNETSVCS. Is it possible that your protocols.cfg file is empty, or has some syntax error that causes it to be unparseable? The same for XYMONNETSVCS (in xymonserver.cfg)? J
list Vernon Everett
Looking at the logs below, it appears something might be broken in the [telnet] definition. Here's the first few lines from my protocols.cfg file [ftp] send "quit\r\n" expect "220" options banner port 21 [ftps] send "quit\r\n" expect "220" options ssl,banner port 990 [ssh|ssh1|ssh2] send "SSH-2.0-OpenSSH_4.1\r\n" expect "SSH" options banner port 22 [telnet] options banner,telnet port 23 From rrd-data.log 5614 2015-03-11 08:29:07 Opening file /opt/local/xymon/server/etc/rrddefinitions.cfg 5614 2015-03-11 08:29:07 Want msg 1, startpos 0, fillpos 0, endpos -1, usedbytes=0, bufleft=2101247 5614 2015-03-11 08:29:07 Got 403 bytes 5614 2015-03-11 08:29:07 xymond_rrd: Got message 849930 @@data#849930/transfer|1426033746.620045|172.21.219.192||transfer|iostatdisk|sunos|unix/UnixInf 5614 2015-03-11 08:29:07 startpos 403, fillpos 403, endpos -1 5614 2015-03-11 08:29:07 Transport setup is: 5614 2015-03-11 08:29:07 xymondportnumber = 1984 5614 2015-03-11 08:29:07 xymonproxyhost = NONE 5614 2015-03-11 08:29:07 xymonproxyport = 0 5614 2015-03-11 08:29:07 Recipient listed as '172.21.219.192' 5614 2015-03-11 08:29:07 Standard protocol on port 1984 5614 2015-03-11 08:29:07 Will connect to address 172.21.219.192 port 1984 5614 2015-03-11 08:29:07 Connect status is 0 5614 2015-03-11 08:29:07 Sent 16 bytes 5614 2015-03-11 08:29:07 Read 8192 bytes 5614 2015-03-11 08:29:07 Read 7657 bytes 5614 2015-03-11 08:29:07 Closing connection 5614 2015-03-11 08:29:07 Opening file /opt/local/xymon/server/etc/analysis.cfg 5614 2015-03-11 08:29:07 Compiling regex ctrldom[A-D] 5614 2015-03-11 08:29:07 Compiling regex infdom[A-B]1 5614 2015-03-11 08:29:07 Compiling regex .+dom.. 5614 2015-03-11 08:29:07 Compiling regex infdom[A-B]1|notesdomD1 5614 2015-03-11 08:29:07 Compiling regex [.:]80$ 5614 2015-03-11 08:29:07 Compiling regex syn.*,%oradom.* 5614 2015-03-11 08:29:07 Compiling regex syn.*|oradom.*|SINFV.* 5614 2015-03-11 08:29:07 Compiling regex [.:]514$ 5614 2015-03-11 08:29:07 Compiling regex [.:]3389$ 5614 2015-03-11 08:29:07 Compiling regex energise.* 5614 2015-03-11 08:29:07 Compiling regex [.:]8989$ 5614 2015-03-11 08:29:07 Compiling regex [.:]80$ 5614 2015-03-11 08:29:07 Compiling regex [.:]4848$ 5614 2015-03-11 08:29:07 Compiling regex [.:]8080$ 5614 2015-03-11 08:29:07 Compiling regex ^/var/crash/.* 5614 2015-03-11 08:29:07 Compiling regex ^/var/core/.* 5614 2015-03-11 08:29:07 Compiling regex [.:]22$ 5614 2015-03-11 08:29:07 Compiling regex sinfct.*|domino.* 5614 2015-03-11 08:29:07 Compiling regex [.:]80$ 5614 2015-03-11 08:29:07 Compiling regex [.:]25$ 5614 2015-03-11 08:29:07 Compiling regex sinfct.*|domino.*|SINFV.*|hermes|sirocco 5614 2015-03-11 08:29:07 Compiling regex SINFV.* 5614 2015-03-11 08:29:07 Compiling regex .* 5614 2015-03-11 08:29:07 Compiling regex ^error.* 5614 2015-03-11 08:29:07 Compiling regex ^error.* 5614 2015-03-11 08:29:07 Compiling regex .*restart/shutdown.*|.*Citrix 5614 2015-03-11 08:29:07 Compiling regex .* 5614 2015-03-11 08:29:07 Compiling regex ^warning.* 5614 2015-03-11 08:29:07 Compiling regex ^warning.* 5614 2015-03-11 08:29:07 Compiling regex .*MetaFrameEvents.*|.*restart/shutdown.*|.*Citrix-Multimedia-Flash.*|.*Printer 5614 2015-03-11 08:29:07 Compiling regex .* 5614 2015-03-11 08:29:07 Compiling regex ^failure.* 5614 2015-03-11 08:29:07 Compiling regex ^failure.* 5614 2015-03-11 08:29:07 Compiling regex ^/var/crash/.* 5614 2015-03-11 08:29:07 Compiling regex ^/var/core/.* 5614 2015-03-11 08:29:07 Compiling regex [.:]22$ 5614 2015-03-11 08:29:07 Compiling regex ^/var/crash/.* 5614 2015-03-11 08:29:07 Compiling regex ^/var/core/.* 5614 2015-03-11 08:29:07 Compiling regex [.:]22$ 5614 2015-03-11 08:29:07 Opening file /opt/local/xymon/server/etc/protocols.cfg 5614 2015-03-11 08:29:07 Service list dump 5614 2015-03-11 08:29:07 Name : ftp 5614 2015-03-11 08:29:07 Sendtext: quit\r\n 5614 2015-03-11 08:29:07 Sendlen : 6 5614 2015-03-11 08:29:07 Exp.text: 220 5614 2015-03-11 08:29:07 Exp.len : 3 5614 2015-03-11 08:29:07 Exp.ofs : 0 5614 2015-03-11 08:29:07 Flags : 1 5614 2015-03-11 08:29:07 Port : 21 5614 2015-03-11 08:29:07 Name : ftps 5614 2015-03-11 08:29:07 Sendtext: quit\r\n 5614 2015-03-11 08:29:07 Sendlen : 6 5614 2015-03-11 08:29:07 Exp.text: 220 5614 2015-03-11 08:29:07 Exp.len : 3 5614 2015-03-11 08:29:07 Exp.ofs : 0 5614 2015-03-11 08:29:07 Flags : 5 5614 2015-03-11 08:29:07 Port : 990 5614 2015-03-11 08:29:07 Name : ssh 5614 2015-03-11 08:29:07 Sendtext: SSH-2.0-OpenSSH_4.1\r\n 5614 2015-03-11 08:29:07 Sendlen : 21 5614 2015-03-11 08:29:07 Exp.text: SSH 5614 2015-03-11 08:29:07 Exp.len : 3 5614 2015-03-11 08:29:07 Exp.ofs : 0 5614 2015-03-11 08:29:07 Flags : 1 5614 2015-03-11 08:29:07 Port : 22 5614 2015-03-11 08:29:07 Name : ssh1 5614 2015-03-11 08:29:07 Sendtext: SSH-2.0-OpenSSH_4.1\r\n 5614 2015-03-11 08:29:07 Sendlen : 21 5614 2015-03-11 08:29:07 Exp.text: SSH 5614 2015-03-11 08:29:07 Exp.len : 3 5614 2015-03-11 08:29:07 Exp.ofs : 0 5614 2015-03-11 08:29:07 Flags : 1 5614 2015-03-11 08:29:07 Port : 22 5614 2015-03-11 08:29:07 Name : ssh2 5614 2015-03-11 08:29:07 Sendtext: SSH-2.0-OpenSSH_4.1\r\n 5614 2015-03-11 08:29:07 Sendlen : 21 5614 2015-03-11 08:29:07 Exp.text: SSH 5614 2015-03-11 08:29:07 Exp.len : 3 5614 2015-03-11 08:29:07 Exp.ofs : 0 5614 2015-03-11 08:29:07 Flags : 1 5614 2015-03-11 08:29:07 Port : 22 5614 2015-03-11 08:29:07 Name : telnet 5614 2015-03-11 08:29:07 2015-03-11 08:29:08 Child process 5614 died: Signal 6 From rrd-status.log 5376 2015-03-11 08:29:06 Opening file /opt/local/xymon/server/etc/rrddefinitions.cfg 5376 2015-03-11 08:29:06 Want msg 1, startpos 0, fillpos 0, endpos -1, usedbytes=0, bufleft=2101247 5376 2015-03-11 08:29:06 Got 6075 bytes 5376 2015-03-11 08:29:06 xymond_rrd: Got message 621101 @@status#621101/sinfctap01|1426033745.004291|172.21.198.73||sinfctap01|svcs|1426035545|green||green|1417272435|0||0||1426033594|sunos|unix/appzones|0| 5376 2015-03-11 08:29:06 startpos 6075, fillpos 6075, endpos -1 5376 2015-03-11 08:29:06 Transport setup is: 5376 2015-03-11 08:29:06 xymondportnumber = 1984 5376 2015-03-11 08:29:06 xymonproxyhost = NONE 5376 2015-03-11 08:29:06 xymonproxyport = 0 5376 2015-03-11 08:29:06 Recipient listed as '172.21.219.192' 5376 2015-03-11 08:29:06 Standard protocol on port 1984 5376 2015-03-11 08:29:06 Will connect to address 172.21.219.192 port 1984 5376 2015-03-11 08:29:06 Connect status is 0 5376 2015-03-11 08:29:06 Sent 16 bytes 5376 2015-03-11 08:29:06 Read 8192 bytes 5376 2015-03-11 08:29:06 Read 7657 bytes 5376 2015-03-11 08:29:06 Closing connection 5376 2015-03-11 08:29:06 Opening file /opt/local/xymon/server/etc/analysis.cfg 5376 2015-03-11 08:29:06 Compiling regex ctrldom[A-D] 5376 2015-03-11 08:29:06 Compiling regex infdom[A-B]1 5376 2015-03-11 08:29:06 Compiling regex .+dom.. 5376 2015-03-11 08:29:06 Compiling regex infdom[A-B]1|notesdomD1 5376 2015-03-11 08:29:06 Compiling regex [.:]80$ 5376 2015-03-11 08:29:06 Compiling regex syn.*,%oradom.* 5376 2015-03-11 08:29:06 Compiling regex syn.*|oradom.*|SINFV.* 5376 2015-03-11 08:29:06 Compiling regex [.:]514$ 5376 2015-03-11 08:29:06 Compiling regex [.:]3389$ 5376 2015-03-11 08:29:06 Compiling regex energise.* 5376 2015-03-11 08:29:06 Compiling regex [.:]8989$ 5376 2015-03-11 08:29:06 Compiling regex [.:]80$ 5376 2015-03-11 08:29:06 Compiling regex [.:]4848$ 5376 2015-03-11 08:29:06 Compiling regex [.:]8080$ 5376 2015-03-11 08:29:06 Compiling regex ^/var/crash/.* 5376 2015-03-11 08:29:06 Compiling regex ^/var/core/.* 5376 2015-03-11 08:29:06 Compiling regex [.:]22$ 5376 2015-03-11 08:29:06 Compiling regex sinfct.*|domino.* 5376 2015-03-11 08:29:06 Compiling regex [.:]80$ 5376 2015-03-11 08:29:06 Compiling regex [.:]25$ 5376 2015-03-11 08:29:06 Compiling regex sinfct.*|domino.*|SINFV.*|hermes|sirocco 5376 2015-03-11 08:29:06 Compiling regex SINFV.* 5376 2015-03-11 08:29:06 Compiling regex .* 5376 2015-03-11 08:29:06 Compiling regex ^error.* 5376 2015-03-11 08:29:06 Compiling regex ^error.* 5376 2015-03-11 08:29:06 Compiling regex .*restart/shutdown.*|.*Citrix 5376 2015-03-11 08:29:06 Compiling regex .* 5376 2015-03-11 08:29:06 Compiling regex ^warning.* 5376 2015-03-11 08:29:06 Compiling regex ^warning.* 5376 2015-03-11 08:29:06 Compiling regex .*MetaFrameEvents.*|.*restart/shutdown.*|.*Citrix-Multimedia-Flash.*|.*Printer 5376 2015-03-11 08:29:06 Compiling regex .* 5376 2015-03-11 08:29:06 Compiling regex ^failure.* 5376 2015-03-11 08:29:06 Compiling regex ^failure.* 5376 2015-03-11 08:29:06 Compiling regex ^/var/crash/.* 5376 2015-03-11 08:29:06 Compiling regex ^/var/core/.* 5376 2015-03-11 08:29:06 Compiling regex [.:]22$ 5376 2015-03-11 08:29:06 Compiling regex ^/var/crash/.* 5376 2015-03-11 08:29:06 Compiling regex ^/var/core/.* 5376 2015-03-11 08:29:06 Compiling regex [.:]22$ 5376 2015-03-11 08:29:06 Opening file /opt/local/xymon/server/etc/protocols.cfg 5376 2015-03-11 08:29:06 Service list dump 5376 2015-03-11 08:29:06 Name : ftp 5376 2015-03-11 08:29:06 Sendtext: quit\r\n 5376 2015-03-11 08:29:06 Sendlen : 6 5376 2015-03-11 08:29:06 Exp.text: 220 5376 2015-03-11 08:29:06 Exp.len : 3 5376 2015-03-11 08:29:06 Exp.ofs : 0 5376 2015-03-11 08:29:06 Flags : 1 5376 2015-03-11 08:29:06 Port : 21 5376 2015-03-11 08:29:06 Name : ftps 5376 2015-03-11 08:29:06 Sendtext: quit\r\n 5376 2015-03-11 08:29:06 Sendlen : 6 5376 2015-03-11 08:29:06 Exp.text: 220 5376 2015-03-11 08:29:06 Exp.len : 3 5376 2015-03-11 08:29:06 Exp.ofs : 0 5376 2015-03-11 08:29:06 Flags : 5 5376 2015-03-11 08:29:06 Port : 990 5376 2015-03-11 08:29:06 Name : ssh 5376 2015-03-11 08:29:06 Sendtext: SSH-2.0-OpenSSH_4.1\r\n 5376 2015-03-11 08:29:06 Sendlen : 21 5376 2015-03-11 08:29:06 Exp.text: SSH 5376 2015-03-11 08:29:06 Exp.len : 3 5376 2015-03-11 08:29:06 Exp.ofs : 0 5376 2015-03-11 08:29:06 Flags : 1 5376 2015-03-11 08:29:06 Port : 22 5376 2015-03-11 08:29:06 Name : ssh1 5376 2015-03-11 08:29:06 Sendtext: SSH-2.0-OpenSSH_4.1\r\n 5376 2015-03-11 08:29:06 Sendlen : 21 5376 2015-03-11 08:29:06 Exp.text: SSH 5376 2015-03-11 08:29:06 Exp.len : 3 5376 2015-03-11 08:29:06 Exp.ofs : 0 5376 2015-03-11 08:29:06 Flags : 1 5376 2015-03-11 08:29:06 Port : 22 5376 2015-03-11 08:29:06 Name : ssh2 5376 2015-03-11 08:29:06 Sendtext: SSH-2.0-OpenSSH_4.1\r\n 5376 2015-03-11 08:29:06 Sendlen : 21 5376 2015-03-11 08:29:06 Exp.text: SSH 5376 2015-03-11 08:29:06 Exp.len : 3 5376 2015-03-11 08:29:06 Exp.ofs : 0 5376 2015-03-11 08:29:06 Flags : 1 5376 2015-03-11 08:29:06 Port : 22 5376 2015-03-11 08:29:06 Name : telnet 5376 2015-03-11 08:29:06 2015-03-11 08:29:07 Peer at 0.0.0.0:0 failed: Broken pipe 2015-03-11 08:29:07 Child process 5376 died: Signal 6
▸
On 11 March 2015 at 09:52, Jeremy Laidman <user-71895fb2e44c@xymon.invalid> wrote:
On 11 March 2015 at 11:37, Vernon Everett <user-b3f8dacb72c8@xymon.invalid> wrote:And even with --no-cache, I am still getting these corrupted rrd files.
:-(
▸
I tried again with --debug (and --no-cache) and it core dumps. Here's the backtrace.libc.so.1`vfprintf+0xec(6c3d0, 514c0, ffbfb3e8, 0, a0ba4, 33e1c) dbgprintf+0xa4(514c0, 0, 51400, 6c3f0, bf, 2ab388) dump_tcp_services+0x74(a0, 1c00, fef37940, 0, 51400, 51400)So dump_tcp_services() calls dbgprintf() (both on lib/netservices.c) which in turn calls vprintf() from libc, but with bad parameters. I've had a look through the code in dump_tcp_services() and I don't know enough C to recognize any problems. But it might be useful to know which call to dbgprintf() is causing the problem. Does the log file for xymond_rrd show any debug output at all? If so, what's the last line that is shown. It might be helpful if you can recompile xymond_rrd with dump_tcp_services() modified. Initially, I would simply try it with "return" added after the first call to dbgprintf(). That is, dump_tcp_services() will output "Service list dump" and return. This might stop the core dumps so that we can get debug output for other parts of the xymond_rrd processing. If adding "return" at that point fixes this core dump, more diagnostic lines would be useful to determine what the problem is. For example, there's a global array called svcinfo that is iterated over, but if the array is empty, it might cause the core dump. So adding a line that checks whether the array is empty and displays the result would help to pin this down. Note that "svcinfo" appears to be populated from the protocols.cfg file and/or XYMONNETSVCS. Is it possible that your protocols.cfg file is empty, or has some syntax error that causes it to be unparseable? The same for XYMONNETSVCS (in xymonserver.cfg)? J
-- "Accept the challenges so that you can feel the exhilaration of victory" - General George Patton
list Vernon Everett
Hi Jeremy As much as I like the idea of modifying the code and recompiling, we need to remember, this is a production system. I get frowned upon when I add --debug just to create a core dump. :-( I doubt changing source code and recompiling is going to get a green light. As for your question about XYMONNETSVC, it's not a variable that's defined anywhere in my config. etc]# grep XYMONNETSVC * return nothing. Regards Vernon
▸
On 11 March 2015 at 09:52, Jeremy Laidman <user-71895fb2e44c@xymon.invalid> wrote:
On 11 March 2015 at 11:37, Vernon Everett <user-b3f8dacb72c8@xymon.invalid> wrote:And even with --no-cache, I am still getting these corrupted rrd files.
:-(
▸
I tried again with --debug (and --no-cache) and it core dumps. Here's the backtrace.libc.so.1`vfprintf+0xec(6c3d0, 514c0, ffbfb3e8, 0, a0ba4, 33e1c) dbgprintf+0xa4(514c0, 0, 51400, 6c3f0, bf, 2ab388) dump_tcp_services+0x74(a0, 1c00, fef37940, 0, 51400, 51400)So dump_tcp_services() calls dbgprintf() (both on lib/netservices.c) which in turn calls vprintf() from libc, but with bad parameters. I've had a look through the code in dump_tcp_services() and I don't know enough C to recognize any problems. But it might be useful to know which call to dbgprintf() is causing the problem. Does the log file for xymond_rrd show any debug output at all? If so, what's the last line that is shown. It might be helpful if you can recompile xymond_rrd with dump_tcp_services() modified. Initially, I would simply try it with "return" added after the first call to dbgprintf(). That is, dump_tcp_services() will output "Service list dump" and return. This might stop the core dumps so that we can get debug output for other parts of the xymond_rrd processing. If adding "return" at that point fixes this core dump, more diagnostic lines would be useful to determine what the problem is. For example, there's a global array called svcinfo that is iterated over, but if the array is empty, it might cause the core dump. So adding a line that checks whether the array is empty and displays the result would help to pin this down. Note that "svcinfo" appears to be populated from the protocols.cfg file and/or XYMONNETSVCS. Is it possible that your protocols.cfg file is empty, or has some syntax error that causes it to be unparseable? The same for XYMONNETSVCS (in xymonserver.cfg)? J
-- "Accept the challenges so that you can feel the exhilaration of victory" - General George Patton
list Jeremy Laidman
▸
On 11 March 2015 at 13:13, Vernon Everett <user-b3f8dacb72c8@xymon.invalid> wrote:
As much as I like the idea of modifying the code and recompiling, we need to remember, this is a production system. I get frowned upon when I add --debug just to create a core dump. :-( I doubt changing source code and recompiling is going to get a green light.
Yep, fair enough. I'm guessing you can't reproduce the core dump on a non-production system.
▸
As for your question about XYMONNETSVC, it's not a variable that's defined anywhere in my config. etc]# grep XYMONNETSVC * return nothing.
Haha, looks like it's an internal environment variable that Xymon uses to pass around the list of xymonnet services. It's mentioned in the man page for xymonnet, but otherwise it appears to be unexposed and intended to be for-internal-use-only. Interesting about the telnet definition. Do you do any telnet testing? If not, can you comment the stanzas for telnet and telnets out of protocols.cfg and try again? Something else to try is to put in a "send" and "expect" lines such as: [telnet] send "" expect "" options banner,telnet port 23 I think this might be the cause of the core dumps. For protocols.cfg stanzas that don't have "send" or "expect" strings, the debug code tries to print such strings, and they turn out to be null pointers or something like that. J
list Vernon Everett
Hi Jeremy I think you got that one nailed. I added send "" expect "" as you requested, and it cores again. This time on the next entry, which is telnets [telnets] options ssl,banner,telnet port 992 Again, no send and expect. Updated telnets, and it crapped out at ldap. [ldap] port 389 [ldaps] options ssl port 636 [rsync] expect "@RSYNCD" options banner port 873 About now, I am getting a little nervous adding send and expect, because unlike telnet and telnets, we are doing ldap and ldaps testing. Any suggestions? I think we have some debug code update recommendations for JC though. :-) A good win, but still doesn't solve my original problem. :-(
▸
Regards
Vernon
On 11 March 2015 at 11:02, Jeremy Laidman <user-71895fb2e44c@xymon.invalid> wrote:
On 11 March 2015 at 13:13, Vernon Everett <user-b3f8dacb72c8@xymon.invalid> wrote:As much as I like the idea of modifying the code and recompiling, we need to remember, this is a production system. I get frowned upon when I add --debug just to create a core dump. :-( I doubt changing source code and recompiling is going to get a green light.Yep, fair enough. I'm guessing you can't reproduce the core dump on a non-production system.As for your question about XYMONNETSVC, it's not a variable that's defined anywhere in my config. etc]# grep XYMONNETSVC * return nothing.Haha, looks like it's an internal environment variable that Xymon uses to pass around the list of xymonnet services. It's mentioned in the man page for xymonnet, but otherwise it appears to be unexposed and intended to be for-internal-use-only. Interesting about the telnet definition. Do you do any telnet testing? If not, can you comment the stanzas for telnet and telnets out of protocols.cfg and try again? Something else to try is to put in a "send" and "expect" lines such as: [telnet] send "" expect "" options banner,telnet port 23 I think this might be the cause of the core dumps. For protocols.cfg stanzas that don't have "send" or "expect" strings, the debug code tries to print such strings, and they turn out to be null pointers or something like that. J
-- "Accept the challenges so that you can feel the exhilaration of victory" - General George Patton
list Jeremy Laidman
On 11 March 2015 at 14:18, Vernon Everett <user-b3f8dacb72c8@xymon.invalid> wrote:
I think you got that one nailed.
You might be overselling this finding a little. But it's progress nonetheless.
▸
I added send "" expect "" as you requested, and it cores again. This time on the next entry, which is telnets
I run v4.3.10, and I can't get it to dump core the same way yours does, when "--debug" is used. Mine says "(null)" when there is no send or expect defined. The source code file containing the code that triggers the core dump is identical to one from 4.3.18. I've also tested it on a 4.3.18 installation (using the RPM package from Terabithia) but also won't fail. Please can you try this: printf "@@data#0/host|||||conn||\ndata host.trends\n\n@@\n" | XYMONTMP=/tmp ~xymon/server/bin/xymond_rrd --rrddir=/dev/null --debug (adjust path to xymond_rrd and xymon username if required) For me, this displays the debug output at least as far as when parsing the protocols.cfg file. If this also dumps core for you then we can work on fixing the core dump without risking your production Xymon config. You could also build a binary and run it from an alternative location so that the production binary isn't touched. I realize this doesn't fix your problem for you. But I'm optimistic that the debug output might point to where your problem is. J
list Vernon Everett
We have 4.3.12 And the command dumps core in the same way. $ printf "@@data#0/host|||||conn||\ndata host.trends\n\n@@\n" | XYMONTMP=/tmp ~xymon/server/bin/xymond_rrd --rrddir=/dev/null --debug 16816 2015-03-11 12:38:02 Opening file /opt/local/xymon/server/etc/rrddefinitions.cfg 16816 2015-03-11 12:38:02 Want msg 1, startpos 0, fillpos 0, endpos -1, usedbytes=0, bufleft=528383 16816 2015-03-11 12:38:02 Got 46 bytes 16816 2015-03-11 12:38:02 /opt/local/xymon/server/bin/xymond_rrd: Got message 0 @@data#0/host|||||conn|| 16816 2015-03-11 12:38:02 startpos 46, fillpos 46, endpos -1 16816 2015-03-11 12:38:02 Transport setup is: 16816 2015-03-11 12:38:02 xymondportnumber = 1984 16816 2015-03-11 12:38:02 xymonproxyhost = NONE 16816 2015-03-11 12:38:02 xymonproxyport = 0 16816 2015-03-11 12:38:02 Recipient listed as '172.21.219.192' 16816 2015-03-11 12:38:02 Standard protocol on port 1984 16816 2015-03-11 12:38:02 Will connect to address 172.21.219.192 port 1984 16816 2015-03-11 12:38:02 Connect status is 0 16816 2015-03-11 12:38:02 Sent 16 bytes 16816 2015-03-11 12:38:02 Read 15849 bytes 16816 2015-03-11 12:38:02 Closing connection 16816 2015-03-11 12:38:02 Opening file /opt/local/xymon/server/etc/analysis.cfg 16816 2015-03-11 12:38:02 Compiling regex ctrldom[A-D] 16816 2015-03-11 12:38:02 Compiling regex infdom[A-B]1 16816 2015-03-11 12:38:02 Compiling regex .+dom.. 16816 2015-03-11 12:38:02 Compiling regex infdom[A-B]1|notesdomD1 16816 2015-03-11 12:38:02 Compiling regex [.:]80$ 16816 2015-03-11 12:38:02 Compiling regex syn.*,%oradom.* 16816 2015-03-11 12:38:02 Compiling regex syn.*|oradom.*|SINFV.* 16816 2015-03-11 12:38:02 Compiling regex [.:]514$ 16816 2015-03-11 12:38:02 Compiling regex [.:]3389$ 16816 2015-03-11 12:38:02 Compiling regex energise.* 16816 2015-03-11 12:38:02 Compiling regex [.:]8989$ 16816 2015-03-11 12:38:02 Compiling regex [.:]80$ 16816 2015-03-11 12:38:02 Compiling regex [.:]4848$ 16816 2015-03-11 12:38:02 Compiling regex [.:]8080$ 16816 2015-03-11 12:38:02 Compiling regex ^/var/crash/.* 16816 2015-03-11 12:38:02 Compiling regex ^/var/core/.* 16816 2015-03-11 12:38:02 Compiling regex [.:]22$ 16816 2015-03-11 12:38:02 Compiling regex sinfct.*|domino.* 16816 2015-03-11 12:38:02 Compiling regex [.:]80$ 16816 2015-03-11 12:38:02 Compiling regex [.:]25$ 16816 2015-03-11 12:38:02 Compiling regex sinfct.*|domino.*|SINFV.*|hermes|sirocco 16816 2015-03-11 12:38:02 Compiling regex SINFV.* 16816 2015-03-11 12:38:02 Compiling regex .* 16816 2015-03-11 12:38:02 Compiling regex ^error.* 16816 2015-03-11 12:38:02 Compiling regex ^error.* 16816 2015-03-11 12:38:02 Compiling regex .*restart/shutdown.*|.*Citrix 16816 2015-03-11 12:38:02 Compiling regex .* 16816 2015-03-11 12:38:02 Compiling regex ^warning.* 16816 2015-03-11 12:38:02 Compiling regex ^warning.* 16816 2015-03-11 12:38:02 Compiling regex .*MetaFrameEvents.*|.*restart/shutdown.*|.*Citrix-Multimedia-Flash.*|.*Printer 16816 2015-03-11 12:38:02 Compiling regex .* 16816 2015-03-11 12:38:02 Compiling regex ^failure.* 16816 2015-03-11 12:38:02 Compiling regex ^failure.* 16816 2015-03-11 12:38:02 Compiling regex ^/var/crash/.* 16816 2015-03-11 12:38:02 Compiling regex ^/var/core/.* 16816 2015-03-11 12:38:02 Compiling regex [.:]22$ 16816 2015-03-11 12:38:02 Compiling regex ^/var/crash/.* 16816 2015-03-11 12:38:02 Compiling regex ^/var/core/.* 16816 2015-03-11 12:38:02 Compiling regex [.:]22$ 16816 2015-03-11 12:38:02 Opening file /opt/local/xymon/server/etc/protocols.cfg 16816 2015-03-11 12:38:02 Service list dump 16816 2015-03-11 12:38:02 Name : ftp 16816 2015-03-11 12:38:02 Sendtext: quit\r\n 16816 2015-03-11 12:38:02 Sendlen : 6 16816 2015-03-11 12:38:02 Exp.text: 220 16816 2015-03-11 12:38:02 Exp.len : 3 16816 2015-03-11 12:38:02 Exp.ofs : 0 16816 2015-03-11 12:38:02 Flags : 1 16816 2015-03-11 12:38:02 Port : 21 16816 2015-03-11 12:38:02 Name : ftps 16816 2015-03-11 12:38:02 Sendtext: quit\r\n 16816 2015-03-11 12:38:02 Sendlen : 6 16816 2015-03-11 12:38:02 Exp.text: 220 16816 2015-03-11 12:38:02 Exp.len : 3 16816 2015-03-11 12:38:02 Exp.ofs : 0 16816 2015-03-11 12:38:02 Flags : 5 16816 2015-03-11 12:38:02 Port : 990 16816 2015-03-11 12:38:02 Name : ssh 16816 2015-03-11 12:38:02 Sendtext: SSH-2.0-OpenSSH_4.1\r\n 16816 2015-03-11 12:38:02 Sendlen : 21 16816 2015-03-11 12:38:02 Exp.text: SSH 16816 2015-03-11 12:38:02 Exp.len : 3 16816 2015-03-11 12:38:02 Exp.ofs : 0 16816 2015-03-11 12:38:02 Flags : 1 16816 2015-03-11 12:38:02 Port : 22 16816 2015-03-11 12:38:02 Name : ssh1 16816 2015-03-11 12:38:02 Sendtext: SSH-2.0-OpenSSH_4.1\r\n 16816 2015-03-11 12:38:02 Sendlen : 21 16816 2015-03-11 12:38:02 Exp.text: SSH 16816 2015-03-11 12:38:02 Exp.len : 3 16816 2015-03-11 12:38:02 Exp.ofs : 0 16816 2015-03-11 12:38:02 Flags : 1 16816 2015-03-11 12:38:02 Port : 22 16816 2015-03-11 12:38:02 Name : ssh2 16816 2015-03-11 12:38:02 Sendtext: SSH-2.0-OpenSSH_4.1\r\n 16816 2015-03-11 12:38:02 Sendlen : 21 16816 2015-03-11 12:38:02 Exp.text: SSH 16816 2015-03-11 12:38:02 Exp.len : 3 16816 2015-03-11 12:38:02 Exp.ofs : 0 16816 2015-03-11 12:38:02 Flags : 1 16816 2015-03-11 12:38:02 Port : 22 16816 2015-03-11 12:38:02 Name : telnet 16816 2015-03-11 12:38:02 Abort (core dumped)
▸
On 11 March 2015 at 12:02, Jeremy Laidman <user-71895fb2e44c@xymon.invalid> wrote:
On 11 March 2015 at 14:18, Vernon Everett <user-b3f8dacb72c8@xymon.invalid> wrote:I think you got that one nailed.You might be overselling this finding a little. But it's progress nonetheless.I added send "" expect "" as you requested, and it cores again. This time on the next entry, which is telnetsI run v4.3.10, and I can't get it to dump core the same way yours does, when "--debug" is used. Mine says "(null)" when there is no send or expect defined. The source code file containing the code that triggers the core dump is identical to one from 4.3.18. I've also tested it on a 4.3.18 installation (using the RPM package from Terabithia) but also won't fail. Please can you try this: printf "@@data#0/host|||||conn||\ndata host.trends\n\n@@\n" | XYMONTMP=/tmp ~xymon/server/bin/xymond_rrd --rrddir=/dev/null --debug (adjust path to xymond_rrd and xymon username if required) For me, this displays the debug output at least as far as when parsing the protocols.cfg file. If this also dumps core for you then we can work on fixing the core dump without risking your production Xymon config. You could also build a binary and run it from an alternative location so that the production binary isn't touched. I realize this doesn't fix your problem for you. But I'm optimistic that the debug output might point to where your problem is. J
-- "Accept the challenges so that you can feel the exhilaration of victory" - General George Patton
list Jeremy Laidman
Yes, apparently, "(null)" is what GNU's printf outputs when asked to print
a string pointer that is null. The POSIX behaviour is not defined, and so
in some systems, it dumps core rather than detecting the illegal string.
This bit of code demonstrates:
#include <stdio.h>
int main() {
printf("test: %s\n",NULL);
}
On Linux and FreeBSD, this prints out "test: (null)" but on Solaris 10 this
dumps core.
Fixing this is beyond my skill level. But I would start looking at the
functions errprintf() and dbgprintf() to see if NULL strings can be
detected and prevented from being passed through to *printf.
But that might be somewhat tricky. Alternatively, updating
dump_tcp_services() to do the same, is probably a much simpler task, but
doesn't safeguard any other code from the same fate.
J
On 11 March 2015 at 17:38, Jeremy Laidman <user-71895fb2e44c@xymon.invalid> wrote:
Even simpler: #include <stdio.h> int main() { printf("test: %s\n",NULL); } On 11 March 2015 at 17:33, Jeremy Laidman <user-71895fb2e44c@xymon.invalid> wrote:On 11 March 2015 at 15:43, Vernon Everett <user-b3f8dacb72c8@xymon.invalid>
▸
wrote:We have 4.3.12 And the command dumps core in the same way.
That's good. Can you try this out? Plop into a text file, eg test.c, then "make test" and then "./test". #include <stdio.h> #include <stdarg.h> void dbgprintf(const char *fmt, ...) { va_list args; va_start(args, fmt); vfprintf(stdout, fmt, args); va_end(args); fflush(stdout); } int main() { dbgprintf("test: %s\n",NULL); } This is the essence of the Xymon code that prints the empty send/expect strings. But I've pared away all of the Xymon code to leave just the call to vprintf() and whatever is needed to make it run. On my systems I get the following output: test: (null) I'm guessing on your system you'll get a core dump. If so, it would suggest a difference in implementations of the standard library functions. If that's the case, we simply (!) need to catch the attempts to print NULL pointers, and substitute our own "(null)" string. Indeed, I just tested this on a Solaris system and it dumped core. On Linux and FreeBSD systems, it printed "(null)". J
list Vernon Everett
[tmp]$ make CC=/usr/local/bin/gcc test /usr/local/bin/gcc -o test test.c [tmp]$ ./test Segmentation Fault (core dumped) As you predicted.
▸
On 11 March 2015 at 14:50, Jeremy Laidman <user-71895fb2e44c@xymon.invalid> wrote:
Yes, apparently, "(null)" is what GNU's printf outputs when asked to print a string pointer that is null. The POSIX behaviour is not defined, and so in some systems, it dumps core rather than detecting the illegal string. This bit of code demonstrates: #include <stdio.h> int main() { printf("test: %s\n",NULL); } On Linux and FreeBSD, this prints out "test: (null)" but on Solaris 10 this dumps core. Fixing this is beyond my skill level. But I would start looking at the functions errprintf() and dbgprintf() to see if NULL strings can be detected and prevented from being passed through to *printf. But that might be somewhat tricky. Alternatively, updating dump_tcp_services() to do the same, is probably a much simpler task, but doesn't safeguard any other code from the same fate. J On 11 March 2015 at 17:38, Jeremy Laidman <user-71895fb2e44c@xymon.invalid> wrote:Even simpler: #include <stdio.h> int main() { printf("test: %s\n",NULL); } On 11 March 2015 at 17:33, Jeremy Laidman <user-71895fb2e44c@xymon.invalid> wrote:On 11 March 2015 at 15:43, Vernon Everett <user-b3f8dacb72c8@xymon.invalid> wrote:We have 4.3.12 And the command dumps core in the same way.That's good. Can you try this out? Plop into a text file, eg test.c, then "make test" and then "./test". #include <stdio.h> #include <stdarg.h> void dbgprintf(const char *fmt, ...) { va_list args; va_start(args, fmt); vfprintf(stdout, fmt, args); va_end(args); fflush(stdout); } int main() { dbgprintf("test: %s\n",NULL); } This is the essence of the Xymon code that prints the empty send/expect strings. But I've pared away all of the Xymon code to leave just the call to vprintf() and whatever is needed to make it run. On my systems I get the following output: test: (null) I'm guessing on your system you'll get a core dump. If so, it would suggest a difference in implementations of the standard library functions. If that's the case, we simply (!) need to catch the attempts to print NULL pointers, and substitute our own "(null)" string. Indeed, I just tested this on a Solaris system and it dumped core. On Linux and FreeBSD systems, it printed "(null)". J
-- "Accept the challenges so that you can feel the exhilaration of victory" - General George Patton
list John Thurston
▸
On 3/10/2015 7:18 PM, Vernon Everett wrote:
Hi Jeremy
I think you got that one nailed.
I added
send ""
expect ""
as you requested, and it cores again.
This time on the next entry, which is telnets
[telnets]
options ssl,banner,telnet
port 992
Again, no send and expect.
Updated telnets, and it crapped out at ldap.
[ldap]
port 389
[ldaps]
options ssl
port 636
[rsync]
expect "@RSYNCD"
options banner
port 873
About now, I am getting a little nervous adding send and expect, because
unlike telnet and telnets, we are doing ldap and ldaps testing.
Any suggestions?
I think we have some debug code update recommendations for JC though. :-)
A good win, but still doesn't solve my original problem. :-(I identified and shared this exact situation in August and September of last year: http://lists.xymon.com/pipermail/xymon/2014-August/040130.html and http://lists.xymon.com/pipermail/xymon/2014-September/040163.html My C isn't good enough to offer a correction for the defect, but at least someone had now duplicated the problem. My solution has been not to enable the --debug option on rrd handler....and hope I don't have a problem I need to debug.
▸
--
Do things because you should, not just because you can.
John Thurston XXX-XXX-XXXX
user-ce4d79d99bab@xymon.invalid
Enterprise Technology Services
Department of Administration
State of Alaska
list Jeremy Laidman
▸
On 12/03/2015 3:47 am, "John Thurston" <user-ce4d79d99bab@xymon.invalid> wrote:
I identified and shared this exact situation in August and September of last year:
Oh, how embarrassment. I even contributed to that thread and didn't remember. I must be getting old!
▸
My C isn't good enough to offer a correction for the defect, but at least someone had now duplicated the problem. My solution has been not to enable the --debug option on rrd handler....and hope I don't have a problem I need to debug.
OK, I'll work on a patch to stop these core dumps and try to get it into the mainstream. I don't think we can make any more progress on the original problem without a working debug mode. J
list Jeremy Laidman
▸
On 11 March 2015 at 14:18, Vernon Everett <user-b3f8dacb72c8@xymon.invalid> wrote:
About now, I am getting a little nervous adding send and expect, because unlike telnet and telnets, we are doing ldap and ldaps testing.
That's understandable. A read through the code suggests that at least in some places, an empty string is equivalent to an undefined string, as the string length (shown in Sendlen in the debug output) is zero in both cases. So until a patch is in place, a work-around might be to define empty "send" and "expect" strings for those that have none.
▸
Any suggestions?I think we have some debug code update recommendations for JC though. :-)
Here's my patch. I'll push this into the dev list for proposed inclusion
in a future release.
--- lib/netservices.c.orig 2012-07-25 01:48:41.000000000 +1000
+++ lib/netservices.c 2015-03-12 11:18:18.000000000 +1100
@@ -328,9 +328,9 @@
dbgprintf("Service list dump\n");
for (i=0; (svcinfo[i].svcname); i++) {
dbgprintf(" Name : %s\n", svcinfo[i].svcname);
- dbgprintf(" Sendtext: %s\n", binview(svcinfo[i].sendtxt,
svcinfo[i].sendlen));
+ dbgprintf(" Sendtext: %s\n",
svcinfo[i].sendtxt!=NULL?binview(svcinfo[i].sendtxt,
svcinfo[i].sendlen):"[null]");
dbgprintf(" Sendlen : %d\n", svcinfo[i].sendlen);
- dbgprintf(" Exp.text: %s\n", binview(svcinfo[i].exptext,
svcinfo[i].explen));
+ dbgprintf(" Exp.text: %s\n",
svcinfo[i].exptext!=NULL?binview(svcinfo[i].exptext,
svcinfo[i].explen):"[null]");
dbgprintf(" Exp.len : %d\n", svcinfo[i].explen);
dbgprintf(" Exp.ofs : %d\n", svcinfo[i].expofs);
dbgprintf(" Flags : %d\n", svcinfo[i].flags);
This produces "[null]" where we would have seen "(null)" on a GNU-based OS,
to differentiate between the two situations.
In the mean time, you could compile a special version of xymond_rrd, and
run it manually on the same data channel as the real one, but have it make
RRD files and log file to a different location. This shouldn't interfere
with your production Xymon. Here's one I prepared earlier that works for
me:
sudo -u xymon mkdir /tmp/my-rrd-data/
sudo -u xymon xymoncmd /bin/sh -c 'XYMONTMP=/tmp;
/usr/lib/xymon/server/bin/xymond_channel --channel=data
--log=/tmp/my-rrd-data.log /path/to/xymond_rrd_debug_patch
--rrddir=/tmp/my-rrd-data/ --debug'
This seems to show some really useful stuff that's relevant to solving your
problem. Some sample debug lines:
15306 2015-03-12 11:36:28 xymond_rrd_debug_patch: Got message 165619
@@data#165619/servername|1426120588.401891|172.16.0.1||servername|vmstat|sunos|ABC
...
15306 2015-03-12 11:36:28 Creating rrd
/tmp/my-rrd-data//servername/vmstat.rrd
15306 2015-03-12 11:36:28 RRD create param 00: 'rrdcreate'
15306 2015-03-12 11:36:28 RRD create param 01:
'/tmp/my-rrd-data//servername/vmstat.rrd'
15306 2015-03-12 11:36:28 RRD create param 02: '-s'
15306 2015-03-12 11:36:28 RRD create param 03: '300'
15306 2015-03-12 11:36:28 RRD create param 04: 'DS:cpu_r:GAUGE:600:0:U'
15306 2015-03-12 11:36:28 RRD create param 05: 'DS:cpu_b:GAUGE:600:0:U'
15306 2015-03-12 11:36:28 RRD create param 06: 'DS:cpu_w:GAUGE:600:0:U'
...
15306 2015-03-12 11:39:42 Got 265 bytes
15306 2015-03-12 11:39:42 xymond_rrd_debug_patch: Got message 165737
@@data#165737/servername|1426120782.080244|172.16.0.2||servername|trends||DEF
15306 2015-03-12 11:39:42 startpos 216644, fillpos 216644, endpos -1
15306 2015-03-12 11:39:42 Flushing
'/servername/tcp.xopiy90404.parameter.rrd' with 1 updates pending, template
'sec'
15306 2015-03-12 11:39:42 Want msg 165738, startpos 216644, fillpos 216644,
endpos -1, usedbytes=0, bufleft=1884603
J
list Japheth Cleaver
▸
On Wed, March 11, 2015 5:51 pm, Jeremy Laidman wrote:
On 11 March 2015 at 14:18, Vernon Everett <user-b3f8dacb72c8@xymon.invalid> wrote:About now, I am getting a little nervous adding send and expect, because unlike telnet and telnets, we are doing ldap and ldaps testing.That's understandable. A read through the code suggests that at least in some places, an empty string is equivalent to an undefined string, as the string length (shown in Sendlen in the debug output) is zero in both cases. So until a patch is in place, a work-around might be to define empty "send" and "expect" strings for those that have none. Any suggestions?I think we have some debug code update recommendations for JC though.
:-)
▸
Here's my patch. I'll push this into the dev list for proposed inclusion
in a future release.
--- lib/netservices.c.orig 2012-07-25 01:48:41.000000000 +1000
+++ lib/netservices.c 2015-03-12 11:18:18.000000000 +1100
@@ -328,9 +328,9 @@
dbgprintf("Service list dump\n");
for (i=0; (svcinfo[i].svcname); i++) {
dbgprintf(" Name : %s\n", svcinfo[i].svcname);
- dbgprintf(" Sendtext: %s\n", binview(svcinfo[i].sendtxt,
svcinfo[i].sendlen));
+ dbgprintf(" Sendtext: %s\n",
svcinfo[i].sendtxt!=NULL?binview(svcinfo[i].sendtxt,
svcinfo[i].sendlen):"[null]");
dbgprintf(" Sendlen : %d\n", svcinfo[i].sendlen);
- dbgprintf(" Exp.text: %s\n", binview(svcinfo[i].exptext,
svcinfo[i].explen));
+ dbgprintf(" Exp.text: %s\n",
svcinfo[i].exptext!=NULL?binview(svcinfo[i].exptext,
svcinfo[i].explen):"[null]");
dbgprintf(" Exp.len : %d\n", svcinfo[i].explen);
dbgprintf(" Exp.ofs : %d\n", svcinfo[i].expofs);
dbgprintf(" Flags : %d\n", svcinfo[i].flags);
This produces "[null]" where we would have seen "(null)" on a GNU-based
OS,
to differentiate between the two situations.
In the mean time, you could compile a special version of xymond_rrd, and
run it manually on the same data channel as the real one, but have it make
RRD files and log file to a different location. This shouldn't interfere
with your production Xymon. Here's one I prepared earlier that works for
me:
sudo -u xymon mkdir /tmp/my-rrd-data/
sudo -u xymon xymoncmd /bin/sh -c 'XYMONTMP=/tmp;
/usr/lib/xymon/server/bin/xymond_channel --channel=data
--log=/tmp/my-rrd-data.log /path/to/xymond_rrd_debug_patch
--rrddir=/tmp/my-rrd-data/ --debug'
This seems to show some really useful stuff that's relevant to solving
your
problem. Some sample debug lines:
15306 2015-03-12 11:36:28 xymond_rrd_debug_patch: Got message 165619
@@data#165619/servername|1426120588.401891|172.16.0.1||servername|vmstat|sunos|ABC
...
15306 2015-03-12 11:36:28 Creating rrd
/tmp/my-rrd-data//servername/vmstat.rrd
15306 2015-03-12 11:36:28 RRD create param 00: 'rrdcreate'
15306 2015-03-12 11:36:28 RRD create param 01:
'/tmp/my-rrd-data//servername/vmstat.rrd'
15306 2015-03-12 11:36:28 RRD create param 02: '-s'
15306 2015-03-12 11:36:28 RRD create param 03: '300'
15306 2015-03-12 11:36:28 RRD create param 04: 'DS:cpu_r:GAUGE:600:0:U'
15306 2015-03-12 11:36:28 RRD create param 05: 'DS:cpu_b:GAUGE:600:0:U'
15306 2015-03-12 11:36:28 RRD create param 06: 'DS:cpu_w:GAUGE:600:0:U'
...
15306 2015-03-12 11:39:42 Got 265 bytes
15306 2015-03-12 11:39:42 xymond_rrd_debug_patch: Got message 165737
@@data#165737/servername|1426120782.080244|172.16.0.2||servername|trends||DEF
15306 2015-03-12 11:39:42 startpos 216644, fillpos 216644, endpos -1
15306 2015-03-12 11:39:42 Flushing
'/servername/tcp.xopiy90404.parameter.rrd' with 1 updates pending,
template
'sec'
15306 2015-03-12 11:39:42 Want msg 165738, startpos 216644, fillpos
216644,
endpos -1, usedbytes=0, bufleft=1884603
J
This is some excellent sleuthing! :)
As I was pouring through the thread (sorry, I've been out the last few
days), I failed to take note of the SPARC-Enterprise-T2000 in the output.
The patch below should fix the immediate issue triggered by debug mode...
letting us move on to the larger oddness. Unfortunately, I have a feeling
there are other occasions where we're relying on GNU's printf(NULL)
printing that out and thus might be caught by this. As I find them, I go
ahead and work to put fixes in.
In the meantime, this will be in 4.3.19 and can be patched directly from
below.
HTH,
-jc
--- lib/netservices.c (revision 7598)
+++ lib/netservices.c (working copy)
@@ -81,9 +81,9 @@
unsigned char *inp, *outp;
int i;
- if (!buf) return NULL;
+ if (result) xfree(result);
+ if (!buf) { result = strdup("[null]"); return result; }
- if (result) xfree(result);
if (buf && (buflen == 0)) buflen = strlen(buf);
result = (char *)malloc(4*buflen + 1); /* Worst case: All binary */
list Vernon Everett
Hi all
I was only back at the client today, and unfortunately have not managed to
get that patch in yet.
(As I mentioned before, it's a production system)
However, I did notice something really odd.
I have focused my attention on the trends graphs, where I get all the extra
values, but it's not happening in the test itself, despite the existence of
the additional rrd files.
Example.
I have something that plots the power usage of the PSUs on a NetApp
e-series.
There are 4 PSUs, output looks like this.
Total power drawn- 487 Watts
Number of trays- 2
Tray power input details-
TRAY ID POWER SUPPLY SERIAL NUMBER INPUT POWER
99 0 145 Watts
99 1 151 Watts
0 0 99 Watts
0 1 92 Watts
All good. And I have a graph with 4 lines. Min, Max, Curr and Avg values
are all there. It looks beautiful.
But go look at the power graph in trends, and it's ugly.
Heaps of additional data lines with no entries. All values are NaN
And mixed in amongst the additional empty graphs, are the 4 valid lines.
I look at the rrd files, and they are all there, even the bad ones.
Here's a few of them.
power,tcpListenDrop.rrd
power,tcpOutAck.rrd
power,tcpOutDataSegs.rrd
power,tcpOutRsts.rrd
power,tcpOutUrg.rrd
power,tcpOutWinProbe.rrd
power,tcpRetransSegs.rrd
power,tcpRtoMax.rrd
power,tcpRttUpdate.rrd
power,tcpTimKeepaliveProbe.rrd
power,tcpTimRetransDrop.rrd
power,Tray0_PSU0.rrd <--- Valid
power,Tray0_PSU1.rrd <--- Valid
power,Tray99_PSU0.rrd <--- Valid
power,Tray99_PSU1.rrd <--- Valid
power,trlogpool.rrd
power,UDP_udpInDatagrams.rrd
power,udpInCksumErrs.rrd
power,udpOutDatagrams.rrd
power,vnet.rrd
So I thought I would check my configs.
In xymonserver
From TEST2RRD= ,power=ncv,
From GRAPHS= ,power::9,
And further down
SPLITNCV_power="*:GAUGE"
And in graphs.cfg
[power]
FNPATTERN power,(.*).rrd
TITLE Database Power Consumption Per Tray PSU
YAXIS Watts
-l 0
DEF:p at RRDIDX@=@RRDFN@:lambda:AVERAGE
LINE2:p at RRDIDX@#@COLOR@:@RRDPARAM@
GPRINT:p at RRDIDX@:LAST: \: %5.1lf (cur)
GPRINT:p at RRDIDX@:MAX: \: %5.1lf (max)
GPRINT:p at RRDIDX@:MIN: \: %5.1lf (min)
GPRINT:p at RRDIDX@:AVERAGE: \: %5.1lf (avg)\n
With luck I will get approval to recompile with the debugging bug-fix, and
we can get more info, but I thought the extra entries in trends, but not in
the test was interesting.
Regards
Vernon
▸
On 13 March 2015 at 15:24, J.C. Cleaver <user-87556346d4af@xymon.invalid> wrote:
On Wed, March 11, 2015 5:51 pm, Jeremy Laidman wrote:On 11 March 2015 at 14:18, Vernon Everett <user-b3f8dacb72c8@xymon.invalid> wrote:About now, I am getting a little nervous adding send and expect, because unlike telnet and telnets, we are doing ldap and ldaps testing.That's understandable. A read through the code suggests that at least in some places, an empty string is equivalent to an undefined string, as the string length (shown in Sendlen in the debug output) is zero in both cases. So until a patch is in place, a work-around might be to define empty "send" and "expect" strings for those that have none. Any suggestions?I think we have some debug code update recommendations for JC though.
:-)
▸
Here's my patch. I'll push this into the dev list for proposed inclusion in a future release. --- lib/netservices.c.orig 2012-07-25 01:48:41.000000000 +1000 +++ lib/netservices.c 2015-03-12 11:18:18.000000000 +1100 @@ -328,9 +328,9 @@ dbgprintf("Service list dump\n"); for (i=0; (svcinfo[i].svcname); i++) { dbgprintf(" Name : %s\n", svcinfo[i].svcname); - dbgprintf(" Sendtext: %s\n", binview(svcinfo[i].sendtxt, svcinfo[i].sendlen)); + dbgprintf(" Sendtext: %s\n", svcinfo[i].sendtxt!=NULL?binview(svcinfo[i].sendtxt, svcinfo[i].sendlen):"[null]"); dbgprintf(" Sendlen : %d\n", svcinfo[i].sendlen); - dbgprintf(" Exp.text: %s\n", binview(svcinfo[i].exptext, svcinfo[i].explen)); + dbgprintf(" Exp.text: %s\n", svcinfo[i].exptext!=NULL?binview(svcinfo[i].exptext, svcinfo[i].explen):"[null]"); dbgprintf(" Exp.len : %d\n", svcinfo[i].explen); dbgprintf(" Exp.ofs : %d\n", svcinfo[i].expofs); dbgprintf(" Flags : %d\n", svcinfo[i].flags); This produces "[null]" where we would have seen "(null)" on a GNU-based OS, to differentiate between the two situations. In the mean time, you could compile a special version of xymond_rrd, and run it manually on the same data channel as the real one, but have it make RRD files and log file to a different location. This shouldn't interfere with your production Xymon. Here's one I prepared earlier that works for me: sudo -u xymon mkdir /tmp/my-rrd-data/ sudo -u xymon xymoncmd /bin/sh -c 'XYMONTMP=/tmp; /usr/lib/xymon/server/bin/xymond_channel --channel=data --log=/tmp/my-rrd-data.log /path/to/xymond_rrd_debug_patch --rrddir=/tmp/my-rrd-data/ --debug' This seems to show some really useful stuff that's relevant to solving your problem. Some sample debug lines: 15306 2015-03-12 11:36:28 xymond_rrd_debug_patch: Got message 165619@@data#165619/servername|1426120588.401891|172.16.0.1||servername|vmstat|sunos|ABC... 15306 2015-03-12 11:36:28 Creating rrd /tmp/my-rrd-data//servername/vmstat.rrd 15306 2015-03-12 11:36:28 RRD create param 00: 'rrdcreate' 15306 2015-03-12 11:36:28 RRD create param 01: '/tmp/my-rrd-data//servername/vmstat.rrd' 15306 2015-03-12 11:36:28 RRD create param 02: '-s' 15306 2015-03-12 11:36:28 RRD create param 03: '300' 15306 2015-03-12 11:36:28 RRD create param 04: 'DS:cpu_r:GAUGE:600:0:U' 15306 2015-03-12 11:36:28 RRD create param 05: 'DS:cpu_b:GAUGE:600:0:U' 15306 2015-03-12 11:36:28 RRD create param 06: 'DS:cpu_w:GAUGE:600:0:U' ... 15306 2015-03-12 11:39:42 Got 265 bytes 15306 2015-03-12 11:39:42 xymond_rrd_debug_patch: Got message 165737@@data#165737/servername|1426120782.080244|172.16.0.2||servername|trends||DEF15306 2015-03-12 11:39:42 startpos 216644, fillpos 216644, endpos -1 15306 2015-03-12 11:39:42 Flushing '/servername/tcp.xopiy90404.parameter.rrd' with 1 updates pending, template 'sec' 15306 2015-03-12 11:39:42 Want msg 165738, startpos 216644, fillpos 216644, endpos -1, usedbytes=0, bufleft=1884603 JThis is some excellent sleuthing! :) As I was pouring through the thread (sorry, I've been out the last few days), I failed to take note of the SPARC-Enterprise-T2000 in the output. The patch below should fix the immediate issue triggered by debug mode... letting us move on to the larger oddness. Unfortunately, I have a feeling there are other occasions where we're relying on GNU's printf(NULL) printing that out and thus might be caught by this. As I find them, I go ahead and work to put fixes in. In the meantime, this will be in 4.3.19 and can be patched directly from below. HTH, -jc --- lib/netservices.c (revision 7598) +++ lib/netservices.c (working copy) @@ -81,9 +81,9 @@ unsigned char *inp, *outp; int i; - if (!buf) return NULL; + if (result) xfree(result); + if (!buf) { result = strdup("[null]"); return result; } - if (result) xfree(result); if (buf && (buflen == 0)) buflen = strlen(buf); result = (char *)malloc(4*buflen + 1); /* Worst case: All binary */
-- "Accept the challenges so that you can feel the exhilaration of victory" - General George Patton
list Jeremy Laidman
Vernon The power status page must refer to a different graph name in graphs.cfg with a different FNPATTERN. Click on the graphs images for each version to get the 4-graph view and compare the URLs. J
▸
On Fri, 20 Mar 2015 19:35 Vernon Everett <user-b3f8dacb72c8@xymon.invalid> wrote:
Hi all I was only back at the client today, and unfortunately have not managed to get that patch in yet. (As I mentioned before, it's a production system) However, I did notice something really odd. I have focused my attention on the trends graphs, where I get all the extra values, but it's not happening in the test itself, despite the existence of the additional rrd files. Example. I have something that plots the power usage of the PSUs on a NetApp e-series. There are 4 PSUs, output looks like this. Total power drawn- 487 Watts Number of trays- 2 Tray power input details- TRAY ID POWER SUPPLY SERIAL NUMBER INPUT POWER 99 0 145 Watts 99 1 151 Watts 0 0 99 Watts 0 1 92 Watts All good. And I have a graph with 4 lines. Min, Max, Curr and Avg values are all there. It looks beautiful. But go look at the power graph in trends, and it's ugly. Heaps of additional data lines with no entries. All values are NaN And mixed in amongst the additional empty graphs, are the 4 valid lines. I look at the rrd files, and they are all there, even the bad ones. Here's a few of them. power,tcpListenDrop.rrd power,tcpOutAck.rrd power,tcpOutDataSegs.rrd power,tcpOutRsts.rrd power,tcpOutUrg.rrd power,tcpOutWinProbe.rrd power,tcpRetransSegs.rrd power,tcpRtoMax.rrd power,tcpRttUpdate.rrd power,tcpTimKeepaliveProbe.rrd power,tcpTimRetransDrop.rrd power,Tray0_PSU0.rrd <--- Valid power,Tray0_PSU1.rrd <--- Valid power,Tray99_PSU0.rrd <--- Valid power,Tray99_PSU1.rrd <--- Valid power,trlogpool.rrd power,UDP_udpInDatagrams.rrd power,udpInCksumErrs.rrd power,udpOutDatagrams.rrd power,vnet.rrd So I thought I would check my configs. In xymonserver From TEST2RRD= ,power=ncv, From GRAPHS= ,power::9, And further down SPLITNCV_power="*:GAUGE" And in graphs.cfg [power] FNPATTERN power,(.*).rrd TITLE Database Power Consumption Per Tray PSU YAXIS Watts -l 0 DEF:p at RRDIDX@=@RRDFN@:lambda:AVERAGE LINE2:p at RRDIDX@#@COLOR@:@RRDPARAM@ GPRINT:p at RRDIDX@:LAST: \: %5.1lf (cur) GPRINT:p at RRDIDX@:MAX: \: %5.1lf (max) GPRINT:p at RRDIDX@:MIN: \: %5.1lf (min) GPRINT:p at RRDIDX@:AVERAGE: \: %5.1lf (avg)\n With luck I will get approval to recompile with the debugging bug-fix, and we can get more info, but I thought the extra entries in trends, but not in the test was interesting. Regards Vernon On 13 March 2015 at 15:24, J.C. Cleaver <user-87556346d4af@xymon.invalid> wrote:On Wed, March 11, 2015 5:51 pm, Jeremy Laidman wrote:On 11 March 2015 at 14:18, Vernon Everett <user-b3f8dacb72c8@xymon.invalid> wrote:About now, I am getting a little nervous adding send and expect, because unlike telnet and telnets, we are doing ldap and ldaps testing.That's understandable. A read through the code suggests that at least in some places, an empty string is equivalent to an undefined string, as the string length (shown in Sendlen in the debug output) is zero in both cases. So until a patch is in place, a work-around might be to define empty "send" and "expect" strings for those that have none. Any suggestions?I think we have some debug code update recommendations for JC though.
:-)
▸
Here's my patch. I'll push this into the dev list for proposed inclusion in a future release. --- lib/netservices.c.orig 2012-07-25 01:48:41.000000000 +1000 +++ lib/netservices.c 2015-03-12 11:18:18.000000000 +1100 @@ -328,9 +328,9 @@ dbgprintf("Service list dump\n"); for (i=0; (svcinfo[i].svcname); i++) { dbgprintf(" Name : %s\n", svcinfo[i].svcname); - dbgprintf(" Sendtext: %s\n", binview(svcinfo[i].sendtxt, svcinfo[i].sendlen)); + dbgprintf(" Sendtext: %s\n", svcinfo[i].sendtxt!=NULL?binview(svcinfo[i].sendtxt, svcinfo[i].sendlen):"[null]"); dbgprintf(" Sendlen : %d\n", svcinfo[i].sendlen); - dbgprintf(" Exp.text: %s\n", binview(svcinfo[i].exptext, svcinfo[i].explen)); + dbgprintf(" Exp.text: %s\n", svcinfo[i].exptext!=NULL?binview(svcinfo[i].exptext, svcinfo[i].explen):"[null]"); dbgprintf(" Exp.len : %d\n", svcinfo[i].explen); dbgprintf(" Exp.ofs : %d\n", svcinfo[i].expofs); dbgprintf(" Flags : %d\n", svcinfo[i].flags); This produces "[null]" where we would have seen "(null)" on a GNU-based OS, to differentiate between the two situations. In the mean time, you could compile a special version of xymond_rrd, and run it manually on the same data channel as the real one, but have it make RRD files and log file to a different location. This shouldn't interfere with your production Xymon. Here's one I prepared earlier that works for me: sudo -u xymon mkdir /tmp/my-rrd-data/ sudo -u xymon xymoncmd /bin/sh -c 'XYMONTMP=/tmp; /usr/lib/xymon/server/bin/xymond_channel --channel=data --log=/tmp/my-rrd-data.log /path/to/xymond_rrd_debug_patch --rrddir=/tmp/my-rrd-data/ --debug' This seems to show some really useful stuff that's relevant to solving your problem. Some sample debug lines: 15306 2015-03-12 11:36:28 xymond_rrd_debug_patch: Got message 165619@@data#165619/servername|1426120588.401891|172.16.0.1||servername|vmstat|sunos|ABC... 15306 2015-03-12 11:36:28 Creating rrd /tmp/my-rrd-data//servername/vmstat.rrd 15306 2015-03-12 11:36:28 RRD create param 00: 'rrdcreate' 15306 2015-03-12 11:36:28 RRD create param 01: '/tmp/my-rrd-data//servername/vmstat.rrd' 15306 2015-03-12 11:36:28 RRD create param 02: '-s' 15306 2015-03-12 11:36:28 RRD create param 03: '300' 15306 2015-03-12 11:36:28 RRD create param 04: 'DS:cpu_r:GAUGE:600:0:U' 15306 2015-03-12 11:36:28 RRD create param 05: 'DS:cpu_b:GAUGE:600:0:U' 15306 2015-03-12 11:36:28 RRD create param 06: 'DS:cpu_w:GAUGE:600:0:U' ... 15306 2015-03-12 11:39:42 Got 265 bytes 15306 2015-03-12 11:39:42 xymond_rrd_debug_patch: Got message 165737@@data#165737/servername|1426120782.080244|172.16.0.2||servername|trends||DEF15306 2015-03-12 11:39:42 startpos 216644, fillpos 216644, endpos -1 15306 2015-03-12 11:39:42 Flushing '/servername/tcp.xopiy90404.parameter.rrd' with 1 updates pending, template 'sec' 15306 2015-03-12 11:39:42 Want msg 165738, startpos 216644, fillpos 216644, endpos -1, usedbytes=0, bufleft=1884603 JThis is some excellent sleuthing! :) As I was pouring through the thread (sorry, I've been out the last few days), I failed to take note of the SPARC-Enterprise-T2000 in the output. The patch below should fix the immediate issue triggered by debug mode... letting us move on to the larger oddness. Unfortunately, I have a feeling there are other occasions where we're relying on GNU's printf(NULL) printing that out and thus might be caught by this. As I find them, I go ahead and work to put fixes in. In the meantime, this will be in 4.3.19 and can be patched directly from below. HTH, -jc --- lib/netservices.c (revision 7598) +++ lib/netservices.c (working copy) @@ -81,9 +81,9 @@ unsigned char *inp, *outp; int i; - if (!buf) return NULL; + if (result) xfree(result); + if (!buf) { result = strdup("[null]"); return result; } - if (result) xfree(result); if (buf && (buflen == 0)) buflen = strlen(buf); result = (char *)malloc(4*buflen + 1); /* Worst case: All binary */-- "Accept the challenges so that you can feel the exhilaration of victory" - General George Patton
list Vernon Everett
Hi Jeremy That thought occurred to me, but I checked. There is only one [power] entry in the graphs.cfg file. And I put it there for this particular test. Would have made this one too easy if it was that. :-) Regards Vernon
▸
On 20 March 2015 at 16:43, Jeremy Laidman <user-71895fb2e44c@xymon.invalid> wrote:
Vernon The power status page must refer to a different graph name in graphs.cfg with a different FNPATTERN. Click on the graphs images for each version to get the 4-graph view and compare the URLs. J On Fri, 20 Mar 2015 19:35 Vernon Everett <user-b3f8dacb72c8@xymon.invalid> wrote:Hi all I was only back at the client today, and unfortunately have not managed to get that patch in yet. (As I mentioned before, it's a production system) However, I did notice something really odd. I have focused my attention on the trends graphs, where I get all the extra values, but it's not happening in the test itself, despite the existence of the additional rrd files. Example. I have something that plots the power usage of the PSUs on a NetApp e-series. There are 4 PSUs, output looks like this. Total power drawn- 487 Watts Number of trays- 2 Tray power input details- TRAY ID POWER SUPPLY SERIAL NUMBER INPUT POWER 99 0 145 Watts 99 1 151 Watts 0 0 99 Watts 0 1 92 Watts All good. And I have a graph with 4 lines. Min, Max, Curr and Avg values are all there. It looks beautiful. But go look at the power graph in trends, and it's ugly. Heaps of additional data lines with no entries. All values are NaN And mixed in amongst the additional empty graphs, are the 4 valid lines. I look at the rrd files, and they are all there, even the bad ones. Here's a few of them. power,tcpListenDrop.rrd power,tcpOutAck.rrd power,tcpOutDataSegs.rrd power,tcpOutRsts.rrd power,tcpOutUrg.rrd power,tcpOutWinProbe.rrd power,tcpRetransSegs.rrd power,tcpRtoMax.rrd power,tcpRttUpdate.rrd power,tcpTimKeepaliveProbe.rrd power,tcpTimRetransDrop.rrd power,Tray0_PSU0.rrd <--- Valid power,Tray0_PSU1.rrd <--- Valid power,Tray99_PSU0.rrd <--- Valid power,Tray99_PSU1.rrd <--- Valid power,trlogpool.rrd power,UDP_udpInDatagrams.rrd power,udpInCksumErrs.rrd power,udpOutDatagrams.rrd power,vnet.rrd So I thought I would check my configs. In xymonserver From TEST2RRD= ,power=ncv, From GRAPHS= ,power::9, And further down SPLITNCV_power="*:GAUGE" And in graphs.cfg [power] FNPATTERN power,(.*).rrd TITLE Database Power Consumption Per Tray PSU YAXIS Watts -l 0 DEF:p at RRDIDX@=@RRDFN@:lambda:AVERAGE LINE2:p at RRDIDX@#@COLOR@:@RRDPARAM@ GPRINT:p at RRDIDX@:LAST: \: %5.1lf (cur) GPRINT:p at RRDIDX@:MAX: \: %5.1lf (max) GPRINT:p at RRDIDX@:MIN: \: %5.1lf (min) GPRINT:p at RRDIDX@:AVERAGE: \: %5.1lf (avg)\n With luck I will get approval to recompile with the debugging bug-fix, and we can get more info, but I thought the extra entries in trends, but not in the test was interesting. Regards Vernon On 13 March 2015 at 15:24, J.C. Cleaver <user-87556346d4af@xymon.invalid> wrote:On Wed, March 11, 2015 5:51 pm, Jeremy Laidman wrote:On 11 March 2015 at 14:18, Vernon Everett <user-b3f8dacb72c8@xymon.invalid> wrote:About now, I am getting a little nervous adding send and expect, because unlike telnet and telnets, we are doing ldap and ldaps testing.That's understandable. A read through the code suggests that at least in some places, an empty string is equivalent to an undefined string, as the string length (shown in Sendlen in the debug output) is zero in both cases. So until a patch is in place, a work-around might be to define empty "send" and "expect" strings for those that have none. Any suggestions?I think we have some debug code update recommendations for JC though.
:-)
▸
Here's my patch. I'll push this into the dev list for proposed inclusion in a future release. --- lib/netservices.c.orig 2012-07-25 01:48:41.000000000 +1000 +++ lib/netservices.c 2015-03-12 11:18:18.000000000 +1100 @@ -328,9 +328,9 @@ dbgprintf("Service list dump\n"); for (i=0; (svcinfo[i].svcname); i++) { dbgprintf(" Name : %s\n", svcinfo[i].svcname); - dbgprintf(" Sendtext: %s\n", binview(svcinfo[i].sendtxt, svcinfo[i].sendlen)); + dbgprintf(" Sendtext: %s\n", svcinfo[i].sendtxt!=NULL?binview(svcinfo[i].sendtxt, svcinfo[i].sendlen):"[null]"); dbgprintf(" Sendlen : %d\n", svcinfo[i].sendlen); - dbgprintf(" Exp.text: %s\n", binview(svcinfo[i].exptext, svcinfo[i].explen)); + dbgprintf(" Exp.text: %s\n", svcinfo[i].exptext!=NULL?binview(svcinfo[i].exptext, svcinfo[i].explen):"[null]"); dbgprintf(" Exp.len : %d\n", svcinfo[i].explen); dbgprintf(" Exp.ofs : %d\n", svcinfo[i].expofs); dbgprintf(" Flags : %d\n", svcinfo[i].flags); This produces "[null]" where we would have seen "(null)" on a GNU-based OS, to differentiate between the two situations. In the mean time, you could compile a special version of xymond_rrd, and run it manually on the same data channel as the real one, but have it make RRD files and log file to a different location. This shouldn't interfere with your production Xymon. Here's one I prepared earlier that works for me: sudo -u xymon mkdir /tmp/my-rrd-data/ sudo -u xymon xymoncmd /bin/sh -c 'XYMONTMP=/tmp; /usr/lib/xymon/server/bin/xymond_channel --channel=data --log=/tmp/my-rrd-data.log /path/to/xymond_rrd_debug_patch --rrddir=/tmp/my-rrd-data/ --debug' This seems to show some really useful stuff that's relevant to solving your problem. Some sample debug lines: 15306 2015-03-12 11:36:28 xymond_rrd_debug_patch: Got message 165619@@data#165619/servername|1426120588.401891|172.16.0.1||servername|vmstat|sunos|ABC... 15306 2015-03-12 11:36:28 Creating rrd /tmp/my-rrd-data//servername/vmstat.rrd 15306 2015-03-12 11:36:28 RRD create param 00: 'rrdcreate' 15306 2015-03-12 11:36:28 RRD create param 01: '/tmp/my-rrd-data//servername/vmstat.rrd' 15306 2015-03-12 11:36:28 RRD create param 02: '-s' 15306 2015-03-12 11:36:28 RRD create param 03: '300' 15306 2015-03-12 11:36:28 RRD create param 04: 'DS:cpu_r:GAUGE:600:0:U' 15306 2015-03-12 11:36:28 RRD create param 05: 'DS:cpu_b:GAUGE:600:0:U' 15306 2015-03-12 11:36:28 RRD create param 06: 'DS:cpu_w:GAUGE:600:0:U' ... 15306 2015-03-12 11:39:42 Got 265 bytes 15306 2015-03-12 11:39:42 xymond_rrd_debug_patch: Got message 165737@@data#165737/servername|1426120782.080244|172.16.0.2||servername|trends||DEF15306 2015-03-12 11:39:42 startpos 216644, fillpos 216644, endpos -1 15306 2015-03-12 11:39:42 Flushing '/servername/tcp.xopiy90404.parameter.rrd' with 1 updates pending, template 'sec' 15306 2015-03-12 11:39:42 Want msg 165738, startpos 216644, fillpos 216644, endpos -1, usedbytes=0, bufleft=1884603 JThis is some excellent sleuthing! :) As I was pouring through the thread (sorry, I've been out the last few days), I failed to take note of the SPARC-Enterprise-T2000 in the output. The patch below should fix the immediate issue triggered by debug mode... letting us move on to the larger oddness. Unfortunately, I have a feeling there are other occasions where we're relying on GNU's printf(NULL) printing that out and thus might be caught by this. As I find them, I go ahead and work to put fixes in. In the meantime, this will be in 4.3.19 and can be patched directly from below. HTH, -jc --- lib/netservices.c (revision 7598) +++ lib/netservices.c (working copy) @@ -81,9 +81,9 @@ unsigned char *inp, *outp; int i; - if (!buf) return NULL; + if (result) xfree(result); + if (!buf) { result = strdup("[null]"); return result; } - if (result) xfree(result); if (buf && (buflen == 0)) buflen = strlen(buf); result = (char *)malloc(4*buflen + 1); /* Worst case: All binary */-- "Accept the challenges so that you can feel the exhilaration of victory" - General George Patton
-- "Accept the challenges so that you can feel the exhilaration of victory" - General George Patton
list Jeremy Laidman
So the URLs are different? But both have service=power in the URLs?
▸
On Sat, 21 Mar 2015 10:16 Vernon Everett <user-b3f8dacb72c8@xymon.invalid> wrote:
Hi Jeremy That thought occurred to me, but I checked. There is only one [power] entry in the graphs.cfg file. And I put it there for this particular test. Would have made this one too easy if it was that. :-) Regards Vernon On 20 March 2015 at 16:43, Jeremy Laidman <user-71895fb2e44c@xymon.invalid> wrote:Vernon The power status page must refer to a different graph name in graphs.cfg with a different FNPATTERN. Click on the graphs images for each version to get the 4-graph view and compare the URLs. J On Fri, 20 Mar 2015 19:35 Vernon Everett <user-b3f8dacb72c8@xymon.invalid> wrote:Hi all I was only back at the client today, and unfortunately have not managed to get that patch in yet. (As I mentioned before, it's a production system) However, I did notice something really odd. I have focused my attention on the trends graphs, where I get all the extra values, but it's not happening in the test itself, despite the existence of the additional rrd files. Example. I have something that plots the power usage of the PSUs on a NetApp e-series. There are 4 PSUs, output looks like this. Total power drawn- 487 Watts Number of trays- 2 Tray power input details- TRAY ID POWER SUPPLY SERIAL NUMBER INPUT POWER 99 0 145 Watts 99 1 151 Watts 0 0 99 Watts 0 1 92 Watts All good. And I have a graph with 4 lines. Min, Max, Curr and Avg values are all there. It looks beautiful. But go look at the power graph in trends, and it's ugly. Heaps of additional data lines with no entries. All values are NaN And mixed in amongst the additional empty graphs, are the 4 valid lines. I look at the rrd files, and they are all there, even the bad ones. Here's a few of them. power,tcpListenDrop.rrd power,tcpOutAck.rrd power,tcpOutDataSegs.rrd power,tcpOutRsts.rrd power,tcpOutUrg.rrd power,tcpOutWinProbe.rrd power,tcpRetransSegs.rrd power,tcpRtoMax.rrd power,tcpRttUpdate.rrd power,tcpTimKeepaliveProbe.rrd power,tcpTimRetransDrop.rrd power,Tray0_PSU0.rrd <--- Valid power,Tray0_PSU1.rrd <--- Valid power,Tray99_PSU0.rrd <--- Valid power,Tray99_PSU1.rrd <--- Valid power,trlogpool.rrd power,UDP_udpInDatagrams.rrd power,udpInCksumErrs.rrd power,udpOutDatagrams.rrd power,vnet.rrd So I thought I would check my configs. In xymonserver From TEST2RRD= ,power=ncv, From GRAPHS= ,power::9, And further down SPLITNCV_power="*:GAUGE" And in graphs.cfg [power] FNPATTERN power,(.*).rrd TITLE Database Power Consumption Per Tray PSU YAXIS Watts -l 0 DEF:p at RRDIDX@=@RRDFN@:lambda:AVERAGE LINE2:p at RRDIDX@#@COLOR@:@RRDPARAM@ GPRINT:p at RRDIDX@:LAST: \: %5.1lf (cur) GPRINT:p at RRDIDX@:MAX: \: %5.1lf (max) GPRINT:p at RRDIDX@:MIN: \: %5.1lf (min) GPRINT:p at RRDIDX@:AVERAGE: \: %5.1lf (avg)\n With luck I will get approval to recompile with the debugging bug-fix, and we can get more info, but I thought the extra entries in trends, but not in the test was interesting. Regards Vernon On 13 March 2015 at 15:24, J.C. Cleaver <user-87556346d4af@xymon.invalid> wrote:On Wed, March 11, 2015 5:51 pm, Jeremy Laidman wrote:On 11 March 2015 at 14:18, Vernon Everett <user-b3f8dacb72c8@xymon.invalid> wrote:About now, I am getting a little nervous adding send and expect, because unlike telnet and telnets, we are doing ldap and ldaps testing.That's understandable. A read through the code suggests that at least in some places, an empty string is equivalent to an undefined string, as the string length (shown in Sendlen in the debug output) is zero in both cases. So until a patch is in place, a work-around might be to define empty "send" and "expect" strings for those that have none. Any suggestions?I think we have some debug code update recommendations for JC though.
:-)
▸
Here's my patch. I'll push this into the dev list for proposed inclusion in a future release. --- lib/netservices.c.orig 2012-07-25 01:48:41.000000000 +1000 +++ lib/netservices.c 2015-03-12 11:18:18.000000000 +1100 @@ -328,9 +328,9 @@ dbgprintf("Service list dump\n"); for (i=0; (svcinfo[i].svcname); i++) { dbgprintf(" Name : %s\n", svcinfo[i].svcname); - dbgprintf(" Sendtext: %s\n", binview(svcinfo[i].sendtxt, svcinfo[i].sendlen)); + dbgprintf(" Sendtext: %s\n", svcinfo[i].sendtxt!=NULL?binview(svcinfo[i].sendtxt, svcinfo[i].sendlen):"[null]"); dbgprintf(" Sendlen : %d\n", svcinfo[i].sendlen); - dbgprintf(" Exp.text: %s\n", binview(svcinfo[i].exptext, svcinfo[i].explen)); + dbgprintf(" Exp.text: %s\n", svcinfo[i].exptext!=NULL?binview(svcinfo[i].exptext, svcinfo[i].explen):"[null]"); dbgprintf(" Exp.len : %d\n", svcinfo[i].explen); dbgprintf(" Exp.ofs : %d\n", svcinfo[i].expofs); dbgprintf(" Flags : %d\n", svcinfo[i].flags); This produces "[null]" where we would have seen "(null)" on aGNU-basedOS, to differentiate between the two situations. In the mean time, you could compile a special version of xymond_rrd, and run it manually on the same data channel as the real one, but have it make RRD files and log file to a different location. This shouldn't interfere with your production Xymon. Here's one I prepared earlier that works for me: sudo -u xymon mkdir /tmp/my-rrd-data/ sudo -u xymon xymoncmd /bin/sh -c 'XYMONTMP=/tmp; /usr/lib/xymon/server/bin/xymond_channel --channel=data --log=/tmp/my-rrd-data.log /path/to/xymond_rrd_debug_patch --rrddir=/tmp/my-rrd-data/ --debug' This seems to show some really useful stuff that's relevant to solving your problem. Some sample debug lines: 15306 2015-03-12 11:36:28 xymond_rrd_debug_patch: Got message 165619@@data#165619/servername|1426120588.401891|172.16.0.1||servername|vmstat|sunos|ABC... 15306 2015-03-12 11:36:28 Creating rrd /tmp/my-rrd-data//servername/vmstat.rrd 15306 2015-03-12 11:36:28 RRD create param 00: 'rrdcreate' 15306 2015-03-12 11:36:28 RRD create param 01: '/tmp/my-rrd-data//servername/vmstat.rrd' 15306 2015-03-12 11:36:28 RRD create param 02: '-s' 15306 2015-03-12 11:36:28 RRD create param 03: '300' 15306 2015-03-12 11:36:28 RRD create param 04:'DS:cpu_r:GAUGE:600:0:U'15306 2015-03-12 11:36:28 RRD create param 05:'DS:cpu_b:GAUGE:600:0:U'15306 2015-03-12 11:36:28 RRD create param 06:'DS:cpu_w:GAUGE:600:0:U'... 15306 2015-03-12 11:39:42 Got 265 bytes 15306 2015-03-12 11:39:42 xymond_rrd_debug_patch: Got message 165737@@data#165737/servername|1426120782.080244|172.16.0.2||servername|trends||DEF15306 2015-03-12 11:39:42 startpos 216644, fillpos 216644, endpos -1 15306 2015-03-12 11:39:42 Flushing '/servername/tcp.xopiy90404.parameter.rrd' with 1 updates pending, template 'sec' 15306 2015-03-12 11:39:42 Want msg 165738, startpos 216644, fillpos 216644, endpos -1, usedbytes=0, bufleft=1884603 JThis is some excellent sleuthing! :) As I was pouring through the thread (sorry, I've been out the last few days), I failed to take note of the SPARC-Enterprise-T2000 in the output. The patch below should fix the immediate issue triggered by debug mode... letting us move on to the larger oddness. Unfortunately, I have a feeling there are other occasions where we're relying on GNU's printf(NULL) printing that out and thus might be caught by this. As I find them, I go ahead and work to put fixes in. In the meantime, this will be in 4.3.19 and can be patched directly from below. HTH, -jc --- lib/netservices.c (revision 7598) +++ lib/netservices.c (working copy) @@ -81,9 +81,9 @@ unsigned char *inp, *outp; int i; - if (!buf) return NULL; + if (result) xfree(result); + if (!buf) { result = strdup("[null]"); return result; } - if (result) xfree(result); if (buf && (buflen == 0)) buflen = strlen(buf); result = (char *)malloc(4*buflen + 1); /* Worst case: All binary */-- "Accept the challenges so that you can feel the exhilaration of victory" - General George Patton-- "Accept the challenges so that you can feel the exhilaration of victory" - General George Patton
list Vernon Everett
Very confused now. In the test graph, showing the history graphs, the URL contains service=ncv:power And in the history graphs in the status, it's this. service=power That doesn't confuse me as much as what the graphs look like now. Both the test and the trends graph now contain the spurious values. Yesterday, they were only in the trends graph. A need to get that debug output fixed. Regards Vernon
▸
On 21 March 2015 at 17:48, Jeremy Laidman <user-71895fb2e44c@xymon.invalid> wrote:
So the URLs are different? But both have service=power in the URLs? On Sat, 21 Mar 2015 10:16 Vernon Everett <user-b3f8dacb72c8@xymon.invalid> wrote:Hi Jeremy That thought occurred to me, but I checked. There is only one [power] entry in the graphs.cfg file. And I put it there for this particular test. Would have made this one too easy if it was that. :-) Regards Vernon On 20 March 2015 at 16:43, Jeremy Laidman <user-71895fb2e44c@xymon.invalid> wrote:Vernon The power status page must refer to a different graph name in graphs.cfg with a different FNPATTERN. Click on the graphs images for each version to get the 4-graph view and compare the URLs. J On Fri, 20 Mar 2015 19:35 Vernon Everett <user-b3f8dacb72c8@xymon.invalid> wrote:Hi all I was only back at the client today, and unfortunately have not managed to get that patch in yet. (As I mentioned before, it's a production system) However, I did notice something really odd. I have focused my attention on the trends graphs, where I get all the extra values, but it's not happening in the test itself, despite the existence of the additional rrd files. Example. I have something that plots the power usage of the PSUs on a NetApp e-series. There are 4 PSUs, output looks like this. Total power drawn- 487 Watts Number of trays- 2 Tray power input details- TRAY ID POWER SUPPLY SERIAL NUMBER INPUT POWER 99 0 145 Watts 99 1 151 Watts 0 0 99 Watts 0 1 92 Watts All good. And I have a graph with 4 lines. Min, Max, Curr and Avg values are all there. It looks beautiful. But go look at the power graph in trends, and it's ugly. Heaps of additional data lines with no entries. All values are NaN And mixed in amongst the additional empty graphs, are the 4 valid lines. I look at the rrd files, and they are all there, even the bad ones. Here's a few of them. power,tcpListenDrop.rrd power,tcpOutAck.rrd power,tcpOutDataSegs.rrd power,tcpOutRsts.rrd power,tcpOutUrg.rrd power,tcpOutWinProbe.rrd power,tcpRetransSegs.rrd power,tcpRtoMax.rrd power,tcpRttUpdate.rrd power,tcpTimKeepaliveProbe.rrd power,tcpTimRetransDrop.rrd power,Tray0_PSU0.rrd <--- Valid power,Tray0_PSU1.rrd <--- Valid power,Tray99_PSU0.rrd <--- Valid power,Tray99_PSU1.rrd <--- Valid power,trlogpool.rrd power,UDP_udpInDatagrams.rrd power,udpInCksumErrs.rrd power,udpOutDatagrams.rrd power,vnet.rrd So I thought I would check my configs. In xymonserver From TEST2RRD= ,power=ncv, From GRAPHS= ,power::9, And further down SPLITNCV_power="*:GAUGE" And in graphs.cfg [power] FNPATTERN power,(.*).rrd TITLE Database Power Consumption Per Tray PSU YAXIS Watts -l 0 DEF:p at RRDIDX@=@RRDFN@:lambda:AVERAGE LINE2:p at RRDIDX@#@COLOR@:@RRDPARAM@ GPRINT:p at RRDIDX@:LAST: \: %5.1lf (cur) GPRINT:p at RRDIDX@:MAX: \: %5.1lf (max) GPRINT:p at RRDIDX@:MIN: \: %5.1lf (min) GPRINT:p at RRDIDX@:AVERAGE: \: %5.1lf (avg)\n With luck I will get approval to recompile with the debugging bug-fix, and we can get more info, but I thought the extra entries in trends, but not in the test was interesting. Regards Vernon On 13 March 2015 at 15:24, J.C. Cleaver <user-87556346d4af@xymon.invalid> wrote:On Wed, March 11, 2015 5:51 pm, Jeremy Laidman wrote:On 11 March 2015 at 14:18, Vernon Everett <user-b3f8dacb72c8@xymon.invalid> wrote:About now, I am getting a little nervous adding send and expect, because unlike telnet and telnets, we are doing ldap and ldaps testing.That's understandable. A read through the code suggests that at least in some places, an empty string is equivalent to an undefined string, as the string length (shown in Sendlen in the debug output) is zero in both cases. So until a patch is in place, a work-around might be to define empty "send" and "expect" strings for those that have none. Any suggestions?I think we have some debug code update recommendations for JC though.
:-)
▸
Here's my patch. I'll push this into the dev list for proposed inclusion in a future release. --- lib/netservices.c.orig 2012-07-25 01:48:41.000000000 +1000 +++ lib/netservices.c 2015-03-12 11:18:18.000000000 +1100 @@ -328,9 +328,9 @@ dbgprintf("Service list dump\n"); for (i=0; (svcinfo[i].svcname); i++) { dbgprintf(" Name : %s\n", svcinfo[i].svcname); - dbgprintf(" Sendtext: %s\n", binview(svcinfo[i].sendtxt, svcinfo[i].sendlen)); + dbgprintf(" Sendtext: %s\n", svcinfo[i].sendtxt!=NULL?binview(svcinfo[i].sendtxt, svcinfo[i].sendlen):"[null]"); dbgprintf(" Sendlen : %d\n", svcinfo[i].sendlen); - dbgprintf(" Exp.text: %s\n", binview(svcinfo[i].exptext, svcinfo[i].explen)); + dbgprintf(" Exp.text: %s\n", svcinfo[i].exptext!=NULL?binview(svcinfo[i].exptext, svcinfo[i].explen):"[null]"); dbgprintf(" Exp.len : %d\n", svcinfo[i].explen); dbgprintf(" Exp.ofs : %d\n", svcinfo[i].expofs); dbgprintf(" Flags : %d\n", svcinfo[i].flags); This produces "[null]" where we would have seen "(null)" on aGNU-basedOS, to differentiate between the two situations. In the mean time, you could compile a special version of xymond_rrd, and run it manually on the same data channel as the real one, but have it make RRD files and log file to a different location. This shouldn't interfere with your production Xymon. Here's one I prepared earlier that works for me: sudo -u xymon mkdir /tmp/my-rrd-data/ sudo -u xymon xymoncmd /bin/sh -c 'XYMONTMP=/tmp; /usr/lib/xymon/server/bin/xymond_channel --channel=data --log=/tmp/my-rrd-data.log /path/to/xymond_rrd_debug_patch --rrddir=/tmp/my-rrd-data/ --debug' This seems to show some really useful stuff that's relevant to solving your problem. Some sample debug lines: 15306 2015-03-12 11:36:28 xymond_rrd_debug_patch: Got message 165619@@data#165619/servername|1426120588.401891|172.16.0.1||servername|vmstat|sunos|ABC... 15306 2015-03-12 11:36:28 Creating rrd /tmp/my-rrd-data//servername/vmstat.rrd 15306 2015-03-12 11:36:28 RRD create param 00: 'rrdcreate' 15306 2015-03-12 11:36:28 RRD create param 01: '/tmp/my-rrd-data//servername/vmstat.rrd' 15306 2015-03-12 11:36:28 RRD create param 02: '-s' 15306 2015-03-12 11:36:28 RRD create param 03: '300' 15306 2015-03-12 11:36:28 RRD create param 04:'DS:cpu_r:GAUGE:600:0:U'15306 2015-03-12 11:36:28 RRD create param 05:'DS:cpu_b:GAUGE:600:0:U'15306 2015-03-12 11:36:28 RRD create param 06:'DS:cpu_w:GAUGE:600:0:U'... 15306 2015-03-12 11:39:42 Got 265 bytes 15306 2015-03-12 11:39:42 xymond_rrd_debug_patch: Got message 165737@@data#165737/servername|1426120782.080244|172.16.0.2||servername|trends||DEF15306 2015-03-12 11:39:42 startpos 216644, fillpos 216644, endpos -1 15306 2015-03-12 11:39:42 Flushing '/servername/tcp.xopiy90404.parameter.rrd' with 1 updates pending, template 'sec' 15306 2015-03-12 11:39:42 Want msg 165738, startpos 216644, fillpos 216644, endpos -1, usedbytes=0, bufleft=1884603 JThis is some excellent sleuthing! :) As I was pouring through the thread (sorry, I've been out the last few days), I failed to take note of the SPARC-Enterprise-T2000 in the output. The patch below should fix the immediate issue triggered by debug mode... letting us move on to the larger oddness. Unfortunately, I have a feeling there are other occasions where we're relying on GNU's printf(NULL) printing that out and thus might be caught by this. As I find them, I go ahead and work to put fixes in. In the meantime, this will be in 4.3.19 and can be patched directly from below. HTH, -jc --- lib/netservices.c (revision 7598) +++ lib/netservices.c (working copy) @@ -81,9 +81,9 @@ unsigned char *inp, *outp; int i; - if (!buf) return NULL; + if (result) xfree(result); + if (!buf) { result = strdup("[null]"); return result; } - if (result) xfree(result); if (buf && (buflen == 0)) buflen = strlen(buf); result = (char *)malloc(4*buflen + 1); /* Worst case: All binary */-- "Accept the challenges so that you can feel the exhilaration of victory" - General George Patton-- "Accept the challenges so that you can feel the exhilaration of victory" - General George Patton
-- "Accept the challenges so that you can feel the exhilaration of victory" - General George Patton
list Galen Johnson
Stupid question but can't you just set up a test Xymon server and have the client system point to both servers? That way you can update the test instance and do whatever you need to tweak it without impacting prod. On Sat, Mar 21, 2015 at 6:31 AM, Vernon Everett <user-b3f8dacb72c8@xymon.invalid>
▸
wrote:
Very confused now. In the test graph, showing the history graphs, the URL contains service=ncv:power And in the history graphs in the status, it's this. service=power That doesn't confuse me as much as what the graphs look like now. Both the test and the trends graph now contain the spurious values. Yesterday, they were only in the trends graph. A need to get that debug output fixed. Regards Vernon On 21 March 2015 at 17:48, Jeremy Laidman <user-71895fb2e44c@xymon.invalid> wrote:So the URLs are different? But both have service=power in the URLs? On Sat, 21 Mar 2015 10:16 Vernon Everett <user-b3f8dacb72c8@xymon.invalid> wrote:Hi Jeremy That thought occurred to me, but I checked. There is only one [power] entry in the graphs.cfg file. And I put it there for this particular test. Would have made this one too easy if it was that. :-) Regards Vernon On 20 March 2015 at 16:43, Jeremy Laidman <user-71895fb2e44c@xymon.invalid> wrote:Vernon The power status page must refer to a different graph name in graphs.cfg with a different FNPATTERN. Click on the graphs images for each version to get the 4-graph view and compare the URLs. J On Fri, 20 Mar 2015 19:35 Vernon Everett <user-b3f8dacb72c8@xymon.invalid> wrote:Hi all I was only back at the client today, and unfortunately have not managed to get that patch in yet. (As I mentioned before, it's a production system) However, I did notice something really odd. I have focused my attention on the trends graphs, where I get all the extra values, but it's not happening in the test itself, despite the existence of the additional rrd files. Example. I have something that plots the power usage of the PSUs on a NetApp e-series. There are 4 PSUs, output looks like this. Total power drawn- 487 Watts Number of trays- 2 Tray power input details- TRAY ID POWER SUPPLY SERIAL NUMBER INPUT POWER 99 0 145 Watts 99 1 151 Watts 0 0 99 Watts 0 1 92 Watts All good. And I have a graph with 4 lines. Min, Max, Curr and Avg values are all there. It looks beautiful. But go look at the power graph in trends, and it's ugly. Heaps of additional data lines with no entries. All values are NaN And mixed in amongst the additional empty graphs, are the 4 valid lines. I look at the rrd files, and they are all there, even the bad ones. Here's a few of them. power,tcpListenDrop.rrd power,tcpOutAck.rrd power,tcpOutDataSegs.rrd power,tcpOutRsts.rrd power,tcpOutUrg.rrd power,tcpOutWinProbe.rrd power,tcpRetransSegs.rrd power,tcpRtoMax.rrd power,tcpRttUpdate.rrd power,tcpTimKeepaliveProbe.rrd power,tcpTimRetransDrop.rrd power,Tray0_PSU0.rrd <--- Valid power,Tray0_PSU1.rrd <--- Valid power,Tray99_PSU0.rrd <--- Valid power,Tray99_PSU1.rrd <--- Valid power,trlogpool.rrd power,UDP_udpInDatagrams.rrd power,udpInCksumErrs.rrd power,udpOutDatagrams.rrd power,vnet.rrd So I thought I would check my configs. In xymonserver From TEST2RRD= ,power=ncv, From GRAPHS= ,power::9, And further down SPLITNCV_power="*:GAUGE" And in graphs.cfg [power] FNPATTERN power,(.*).rrd TITLE Database Power Consumption Per Tray PSU YAXIS Watts -l 0 DEF:p at RRDIDX@=@RRDFN@:lambda:AVERAGE LINE2:p at RRDIDX@#@COLOR@:@RRDPARAM@ GPRINT:p at RRDIDX@:LAST: \: %5.1lf (cur) GPRINT:p at RRDIDX@:MAX: \: %5.1lf (max) GPRINT:p at RRDIDX@:MIN: \: %5.1lf (min) GPRINT:p at RRDIDX@:AVERAGE: \: %5.1lf (avg)\n With luck I will get approval to recompile with the debugging bug-fix, and we can get more info, but I thought the extra entries in trends, but not in the test was interesting. Regards Vernon On 13 March 2015 at 15:24, J.C. Cleaver <user-87556346d4af@xymon.invalid> wrote:On Wed, March 11, 2015 5:51 pm, Jeremy Laidman wrote:On 11 March 2015 at 14:18, Vernon Everett <user-b3f8dacb72c8@xymon.invalid wrote:About now, I am getting a little nervous adding send and expect, because unlike telnet and telnets, we are doing ldap and ldaps testing.That's understandable. A read through the code suggests that at least in some places, an empty string is equivalent to an undefined string, as the string length (shown in Sendlen in the debug output) is zero in both cases. So until a patch is in place, a work-around might be to define empty "send" and "expect" strings for those that have none. Any suggestions?I think we have some debug code update recommendations for JC though.
:-)
▸
Here's my patch. I'll push this into the dev list for proposed inclusion in a future release. --- lib/netservices.c.orig 2012-07-25 01:48:41.000000000 +1000 +++ lib/netservices.c 2015-03-12 11:18:18.000000000 +1100 @@ -328,9 +328,9 @@ dbgprintf("Service list dump\n"); for (i=0; (svcinfo[i].svcname); i++) { dbgprintf(" Name : %s\n", svcinfo[i].svcname); - dbgprintf(" Sendtext: %s\n", binview(svcinfo[i].sendtxt, svcinfo[i].sendlen)); + dbgprintf(" Sendtext: %s\n", svcinfo[i].sendtxt!=NULL?binview(svcinfo[i].sendtxt, svcinfo[i].sendlen):"[null]"); dbgprintf(" Sendlen : %d\n", svcinfo[i].sendlen); - dbgprintf(" Exp.text: %s\n", binview(svcinfo[i].exptext, svcinfo[i].explen)); + dbgprintf(" Exp.text: %s\n", svcinfo[i].exptext!=NULL?binview(svcinfo[i].exptext, svcinfo[i].explen):"[null]"); dbgprintf(" Exp.len : %d\n", svcinfo[i].explen); dbgprintf(" Exp.ofs : %d\n", svcinfo[i].expofs); dbgprintf(" Flags : %d\n", svcinfo[i].flags); This produces "[null]" where we would have seen "(null)" on aGNU-basedOS, to differentiate between the two situations. In the mean time, you could compile a special version of xymond_rrd, and run it manually on the same data channel as the real one, but have it make RRD files and log file to a different location. This shouldn't interfere with your production Xymon. Here's one I prepared earlier that works for me: sudo -u xymon mkdir /tmp/my-rrd-data/ sudo -u xymon xymoncmd /bin/sh -c 'XYMONTMP=/tmp; /usr/lib/xymon/server/bin/xymond_channel --channel=data --log=/tmp/my-rrd-data.log /path/to/xymond_rrd_debug_patch --rrddir=/tmp/my-rrd-data/ --debug' This seems to show some really useful stuff that's relevant to solving your problem. Some sample debug lines: 15306 2015-03-12 11:36:28 xymond_rrd_debug_patch: Got message 165619@@data#165619/servername|1426120588.401891|172.16.0.1||servername|vmstat|sunos|ABC... 15306 2015-03-12 11:36:28 Creating rrd /tmp/my-rrd-data//servername/vmstat.rrd 15306 2015-03-12 11:36:28 RRD create param 00: 'rrdcreate' 15306 2015-03-12 11:36:28 RRD create param 01: '/tmp/my-rrd-data//servername/vmstat.rrd' 15306 2015-03-12 11:36:28 RRD create param 02: '-s' 15306 2015-03-12 11:36:28 RRD create param 03: '300' 15306 2015-03-12 11:36:28 RRD create param 04:'DS:cpu_r:GAUGE:600:0:U'15306 2015-03-12 11:36:28 RRD create param 05:'DS:cpu_b:GAUGE:600:0:U'15306 2015-03-12 11:36:28 RRD create param 06:'DS:cpu_w:GAUGE:600:0:U'... 15306 2015-03-12 11:39:42 Got 265 bytes 15306 2015-03-12 11:39:42 xymond_rrd_debug_patch: Got message 165737@@data#165737/servername|1426120782.080244|172.16.0.2||servername|trends||DEF15306 2015-03-12 11:39:42 startpos 216644, fillpos 216644, endpos -1 15306 2015-03-12 11:39:42 Flushing '/servername/tcp.xopiy90404.parameter.rrd' with 1 updates pending, template 'sec' 15306 2015-03-12 11:39:42 Want msg 165738, startpos 216644, fillpos 216644, endpos -1, usedbytes=0, bufleft=1884603 JThis is some excellent sleuthing! :) As I was pouring through the thread (sorry, I've been out the last few days), I failed to take note of the SPARC-Enterprise-T2000 in the output. The patch below should fix the immediate issue triggered by debug mode... letting us move on to the larger oddness. Unfortunately, I have a feeling there are other occasions where we're relying on GNU's printf(NULL) printing that out and thus might be caught by this. As I find them, I go ahead and work to put fixes in. In the meantime, this will be in 4.3.19 and can be patched directly from below. HTH, -jc --- lib/netservices.c (revision 7598) +++ lib/netservices.c (working copy) @@ -81,9 +81,9 @@ unsigned char *inp, *outp; int i; - if (!buf) return NULL; + if (result) xfree(result); + if (!buf) { result = strdup("[null]"); return result; } - if (result) xfree(result); if (buf && (buflen == 0)) buflen = strlen(buf); result = (char *)malloc(4*buflen + 1); /* Worst case: All binary */-- "Accept the challenges so that you can feel the exhilaration of victory" - General George Patton-- "Accept the challenges so that you can feel the exhilaration of victory" - General George Patton-- "Accept the challenges so that you can feel the exhilaration of victory" - General George Patton
list Vernon Everett
I was hoping to get approval to patch the production system, but change control at $CLIENT is sclerotic on a good day. Your suggestion is probably the way I will do it. Back there on Wednesday. Watch this space. Cheers
▸
Vernon
On 21 March 2015 at 22:40, Galen Johnson <user-fc632e705d24@xymon.invalid> wrote:
Stupid question but can't you just set up a test Xymon server and have the client system point to both servers? That way you can update the test instance and do whatever you need to tweak it without impacting prod. On Sat, Mar 21, 2015 at 6:31 AM, Vernon Everett <user-b3f8dacb72c8@xymon.invalid> wrote:Very confused now. In the test graph, showing the history graphs, the URL contains service=ncv:power And in the history graphs in the status, it's this. service=power That doesn't confuse me as much as what the graphs look like now. Both the test and the trends graph now contain the spurious values. Yesterday, they were only in the trends graph. A need to get that debug output fixed. Regards Vernon On 21 March 2015 at 17:48, Jeremy Laidman <user-71895fb2e44c@xymon.invalid> wrote:So the URLs are different? But both have service=power in the URLs? On Sat, 21 Mar 2015 10:16 Vernon Everett <user-b3f8dacb72c8@xymon.invalid> wrote:Hi Jeremy That thought occurred to me, but I checked. There is only one [power] entry in the graphs.cfg file. And I put it there for this particular test. Would have made this one too easy if it was that. :-) Regards Vernon On 20 March 2015 at 16:43, Jeremy Laidman <user-71895fb2e44c@xymon.invalid> wrote:Vernon The power status page must refer to a different graph name in graphs.cfg with a different FNPATTERN. Click on the graphs images for each version to get the 4-graph view and compare the URLs. J On Fri, 20 Mar 2015 19:35 Vernon Everett <user-b3f8dacb72c8@xymon.invalid> wrote:Hi all I was only back at the client today, and unfortunately have not managed to get that patch in yet. (As I mentioned before, it's a production system) However, I did notice something really odd. I have focused my attention on the trends graphs, where I get all the extra values, but it's not happening in the test itself, despite the existence of the additional rrd files. Example. I have something that plots the power usage of the PSUs on a NetApp e-series. There are 4 PSUs, output looks like this. Total power drawn- 487 Watts Number of trays- 2 Tray power input details- TRAY ID POWER SUPPLY SERIAL NUMBER INPUT POWER 99 0 145 Watts 99 1 151 Watts 0 0 99 Watts 0 1 92 Watts All good. And I have a graph with 4 lines. Min, Max, Curr and Avg values are all there. It looks beautiful. But go look at the power graph in trends, and it's ugly. Heaps of additional data lines with no entries. All values are NaN And mixed in amongst the additional empty graphs, are the 4 valid lines. I look at the rrd files, and they are all there, even the bad ones. Here's a few of them. power,tcpListenDrop.rrd power,tcpOutAck.rrd power,tcpOutDataSegs.rrd power,tcpOutRsts.rrd power,tcpOutUrg.rrd power,tcpOutWinProbe.rrd power,tcpRetransSegs.rrd power,tcpRtoMax.rrd power,tcpRttUpdate.rrd power,tcpTimKeepaliveProbe.rrd power,tcpTimRetransDrop.rrd power,Tray0_PSU0.rrd <--- Valid power,Tray0_PSU1.rrd <--- Valid power,Tray99_PSU0.rrd <--- Valid power,Tray99_PSU1.rrd <--- Valid power,trlogpool.rrd power,UDP_udpInDatagrams.rrd power,udpInCksumErrs.rrd power,udpOutDatagrams.rrd power,vnet.rrd So I thought I would check my configs. In xymonserver From TEST2RRD= ,power=ncv, From GRAPHS= ,power::9, And further down SPLITNCV_power="*:GAUGE" And in graphs.cfg [power] FNPATTERN power,(.*).rrd TITLE Database Power Consumption Per Tray PSU YAXIS Watts -l 0 DEF:p at RRDIDX@=@RRDFN@:lambda:AVERAGE LINE2:p at RRDIDX@#@COLOR@:@RRDPARAM@ GPRINT:p at RRDIDX@:LAST: \: %5.1lf (cur) GPRINT:p at RRDIDX@:MAX: \: %5.1lf (max) GPRINT:p at RRDIDX@:MIN: \: %5.1lf (min) GPRINT:p at RRDIDX@:AVERAGE: \: %5.1lf (avg)\n With luck I will get approval to recompile with the debugging bug-fix, and we can get more info, but I thought the extra entries in trends, but not in the test was interesting. Regards Vernon On 13 March 2015 at 15:24, J.C. Cleaver <user-87556346d4af@xymon.invalid> wrote:On Wed, March 11, 2015 5:51 pm, Jeremy Laidman wrote:On 11 March 2015 at 14:18, Vernon Everett < user-b3f8dacb72c8@xymon.invalid> wrote:About now, I am getting a little nervous adding send and expect, because unlike telnet and telnets, we are doing ldap and ldaps testing.That's understandable. A read through the code suggests that at least in some places, an empty string is equivalent to an undefined string, as the string length (shown in Sendlen in the debug output) is zero in both cases. So until a patch is in place, a work-around might be to define empty "send" and "expect" strings for those that have none. Any suggestions?I think we have some debug code update recommendations for JC though.
:-)
▸
Here's my patch. I'll push this into the dev list for proposed inclusion in a future release. --- lib/netservices.c.orig 2012-07-25 01:48:41.000000000 +1000 +++ lib/netservices.c 2015-03-12 11:18:18.000000000 +1100 @@ -328,9 +328,9 @@ dbgprintf("Service list dump\n"); for (i=0; (svcinfo[i].svcname); i++) { dbgprintf(" Name : %s\n", svcinfo[i].svcname); - dbgprintf(" Sendtext: %s\n", binview(svcinfo[i].sendtxt, svcinfo[i].sendlen)); + dbgprintf(" Sendtext: %s\n", svcinfo[i].sendtxt!=NULL?binview(svcinfo[i].sendtxt, svcinfo[i].sendlen):"[null]"); dbgprintf(" Sendlen : %d\n", svcinfo[i].sendlen); - dbgprintf(" Exp.text: %s\n", binview(svcinfo[i].exptext, svcinfo[i].explen)); + dbgprintf(" Exp.text: %s\n", svcinfo[i].exptext!=NULL?binview(svcinfo[i].exptext, svcinfo[i].explen):"[null]"); dbgprintf(" Exp.len : %d\n", svcinfo[i].explen); dbgprintf(" Exp.ofs : %d\n", svcinfo[i].expofs); dbgprintf(" Flags : %d\n", svcinfo[i].flags); This produces "[null]" where we would have seen "(null)" on aGNU-basedOS, to differentiate between the two situations. In the mean time, you could compile a special version of xymond_rrd, and run it manually on the same data channel as the real one, but have it make RRD files and log file to a different location. This shouldn't interfere with your production Xymon. Here's one I prepared earlier that works for me: sudo -u xymon mkdir /tmp/my-rrd-data/ sudo -u xymon xymoncmd /bin/sh -c 'XYMONTMP=/tmp; /usr/lib/xymon/server/bin/xymond_channel --channel=data --log=/tmp/my-rrd-data.log /path/to/xymond_rrd_debug_patch --rrddir=/tmp/my-rrd-data/ --debug' This seems to show some really useful stuff that's relevant to solving your problem. Some sample debug lines: 15306 2015-03-12 11:36:28 xymond_rrd_debug_patch: Got message165619@@data#165619/servername|1426120588.401891|172.16.0.1||servername|vmstat|sunos|ABC... 15306 2015-03-12 11:36:28 Creating rrd /tmp/my-rrd-data//servername/vmstat.rrd 15306 2015-03-12 11:36:28 RRD create param 00: 'rrdcreate' 15306 2015-03-12 11:36:28 RRD create param 01: '/tmp/my-rrd-data//servername/vmstat.rrd' 15306 2015-03-12 11:36:28 RRD create param 02: '-s' 15306 2015-03-12 11:36:28 RRD create param 03: '300' 15306 2015-03-12 11:36:28 RRD create param 04:'DS:cpu_r:GAUGE:600:0:U'15306 2015-03-12 11:36:28 RRD create param 05:'DS:cpu_b:GAUGE:600:0:U'15306 2015-03-12 11:36:28 RRD create param 06:'DS:cpu_w:GAUGE:600:0:U'... 15306 2015-03-12 11:39:42 Got 265 bytes 15306 2015-03-12 11:39:42 xymond_rrd_debug_patch: Got message165737@@data#165737/servername|1426120782.080244|172.16.0.2||servername|trends||DEF15306 2015-03-12 11:39:42 startpos 216644, fillpos 216644, endpos-115306 2015-03-12 11:39:42 Flushing '/servername/tcp.xopiy90404.parameter.rrd' with 1 updates pending, template 'sec' 15306 2015-03-12 11:39:42 Want msg 165738, startpos 216644, fillpos 216644, endpos -1, usedbytes=0, bufleft=1884603 JThis is some excellent sleuthing! :) As I was pouring through the thread (sorry, I've been out the last few days), I failed to take note of the SPARC-Enterprise-T2000 in the output. The patch below should fix the immediate issue triggered by debug mode... letting us move on to the larger oddness. Unfortunately, I have a feeling there are other occasions where we're relying on GNU's printf(NULL) printing that out and thus might be caught by this. As I find them, I go ahead and work to put fixes in. In the meantime, this will be in 4.3.19 and can be patched directly from below. HTH, -jc --- lib/netservices.c (revision 7598) +++ lib/netservices.c (working copy) @@ -81,9 +81,9 @@ unsigned char *inp, *outp; int i; - if (!buf) return NULL; + if (result) xfree(result); + if (!buf) { result = strdup("[null]"); return result; } - if (result) xfree(result); if (buf && (buflen == 0)) buflen = strlen(buf); result = (char *)malloc(4*buflen + 1); /* Worst case: All binary */-- "Accept the challenges so that you can feel the exhilaration of victory" - General George Patton-- "Accept the challenges so that you can feel the exhilaration of victory" - General George Patton-- "Accept the challenges so that you can feel the exhilaration of victory" - General George Patton
-- "Accept the challenges so that you can feel the exhilaration of victory" - General George Patton
list Vernon Everett
Hi all Patched and reinstalled. But it didn't go quite as well as expected. Lots of shit broke. Then I remembered it overwrites most of the config files. Which will do that. There were a few minutes of panic there, until I remembered it makes backups of everything. :-) Copied the backup configs back, and it started to look good again. Whew! And, it looks like we have debug data. Now to catch it when it goes wrong. Regards Vernon On 25 March 2015 at 13:07, Jeremy Laidman <user-71895fb2e44c@xymon.invalid> wrote:
On 25 March 2015 at 15:54, Vernon Everett <user-b3f8dacb72c8@xymon.invalid> wrote:Ah! There are two patches for this. Yours, and JC's. I have been trying to apply JC's Attempting yours fails too. :-(Mine would possibly fail if JC's has already been applied, and vice versa. Both patches do a similar thing, just at different places in the code execution path. Mine catches the null pointer and avoids calling binview(), whereas JC's catches the null pointer inside binview(). I suspect JC's patch will end up going into a future release of Xymon, so I'd probably go with his.
▸
J
-- "Accept the challenges so that you can feel the exhilaration of victory" - General George Patton