Xymon Mailing List Archive search

rrd logs and graphs

40 messages in this thread

list Vernon Everett · Fri, 14 Nov 2014 11:43:53 +0800 ·
Hi all

Am busy trying to investigate a curious problem with rrd graphs, and I
stumbled on something else I don't understand, and was hoping somebody out
there could help.

As part of my investigation, I added --debug to the [rrdstatus] and
[rrddata] entries on the server tasks.cfg
And the logs started showing heaps of the message
2014-11-14 10:41:36 Peer not up, flushing message queue
What is that?
It doesn't look right to me.

I might as well explain why I am doing this.
I am running an experimental script, zresources.ksh
Its supposed to allow me to graph the resources being used by individual
zones on Solaris.
Here's part of the code
$XYMON $XYMSRV "status $MACHINE.zmem green $(date) $OUT "
$XYMON $XYMSRV "data $MACHINE.zmem $COLOR $(echo; cat "$TEMPFILE.mem"
;echo;echo)"
$XYMON $XYMSRV "data $MACHINE.zcpu $COLOR $(echo; cat "$TEMPFILE.cpu"
;echo;echo)"
$XYMON $XYMSRV "data $MACHINE.zswap $COLOR $(echo; cat "$TEMPFILE.swap"
;echo;echo)"
echo $XYMON $XYMSRV "status $MACHINE.zmem green $(date) $OUT "
echo $XYMON $XYMSRV "data $MACHINE.zmem $COLOR $(echo; cat "$TEMPFILE.mem"
;echo;echo)"
echo $XYMON $XYMSRV "data $MACHINE.zcpu $COLOR $(echo; cat "$TEMPFILE.cpu"
;echo;echo)"
echo $XYMON $XYMSRV "data $MACHINE.zswap $COLOR $(echo; cat
"$TEMPFILE.swap" ;echo;echo)"
cat $TEMPFILE.swap
cat $TEMPFILE.cpu
cat $TEMPFILE.mem
The first 4 lines are "real" and the rest I added for debug.
It's supposed to send data containing the zone names, and the amount of
resources it's using.
global : 293
zone1 : 575
zone2 : 899
zone3 : 754
All good, it makes nice pretty graphs and what have you. And I expect and
get rrd files as follows
zmem,global.rrd zcpu,global.rrd zswap,global.rrd
zmem,zone1.rrd zcpu,zone.rrd zswap,zone1.rrd
etc.etc.

Except every now and then, I get something like
zmem,c2t0d1.rrd
I get others too, like zmem,rmt_0 and sometimes even with database SIDs
too. And not just for zmem, for zcpu and zswap with about equal frequency.
And often, the database SID is not even a database running on that host!
I look inside the rrd file, with rrdtool dump, and there will usualy be
only a single data point.
Looking in the client logs, which contains the output of the debug code
above, there is no indication that the client ever sent any data using
anything else but the zone names.
So where are these spurious data points coming from?

Has anybody seen anything like this?

Regards
Vernon


-- 
"Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton
list Jeremy Laidman · Tue, 24 Feb 2015 16:14:48 +1100 ·
On 14 November 2014 at 14:43, Vernon Everett <user-b3f8dacb72c8@xymon.invalid>
quoted from Vernon Everett
wrote:
Am busy trying to investigate a curious problem with rrd graphs, and I
stumbled on something else I don't understand, and was hoping somebody out
there could help.

As part of my investigation, I added --debug to the [rrdstatus] and
[rrddata] entries on the server tasks.cfg
And the logs started showing heaps of the message
2014-11-14 10:41:36 Peer not up, flushing message queue
What is that?
It doesn't look right to me.
It's usually normal.  See Henrik's response to a similar question:

http://lists.xymon.com/archive/2014-April/039461.html
quoted from Vernon Everett

Except every now and then, I get something like
zmem,c2t0d1.rrd
Has anybody seen anything like this?
Yes.  It's puzzling, but rare enough that I haven't had time to investigate.

J
list Vernon Everett · Tue, 24 Feb 2015 13:46:26 +0800 ·
I am getting those sporadic .rrd files in spades. :-(
Sometimes, only a single data point in the file. But enough files, and your
graphs start to look like crap.

Tomorrow I am off to a client where it's happening all the time.
What can I send you to assist with investigating?

I am trying to figure out if it's a bug in Xymon, or a bug in my script.
So far I have no evidence to support it being either.

Regards
Vernon


On 24 February 2015 at 13:14, Jeremy Laidman <user-71895fb2e44c@xymon.invalid>
quoted from Jeremy Laidman
wrote:
On 14 November 2014 at 14:43, Vernon Everett <user-b3f8dacb72c8@xymon.invalid>
wrote:
Am busy trying to investigate a curious problem with rrd graphs, and I
stumbled on something else I don't understand, and was hoping somebody out
there could help.

As part of my investigation, I added --debug to the [rrdstatus] and
[rrddata] entries on the server tasks.cfg
And the logs started showing heaps of the message
2014-11-14 10:41:36 Peer not up, flushing message queue
What is that?
It doesn't look right to me.
It's usually normal.  See Henrik's response to a similar question:

http://lists.xymon.com/archive/2014-April/039461.html

Except every now and then, I get something like
zmem,c2t0d1.rrd
Has anybody seen anything like this?
Yes.  It's puzzling, but rare enough that I haven't had time to
investigate.

J

-- 
"Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton
list Jeremy Laidman · Tue, 24 Feb 2015 17:26:57 +1100 ·
I'm assuming you've checked your debug output from your script to see if
the $TEMPFILE.* file contents look OK.

Perhaps run your own instance of "xymond_channel --channel=data" to capture
the messages as they come from xymond to xymond_rrd.  This will generate a
lot of output, so you'll want to use "--filter" and perhaps "grep" to trim
it down.

You could also run snoop/tcpdump at the same time and try to capture the
data message as it arrives at your Xymon server.  If you have lots of Xymon
traffic it might be better to do so on the client side.

The trick is to get a snapshot at the time that the RRD file is created,
without collecting so much data that you run out of disk!  So doing things
like this:

while true; do tcpdump -w dump.out -n -c 10000 dest port 1984 and host
blabla; gzip dump.out; mv dump.out.gz dump.out-`date +%s`; done

This will capture 10k of packets at a time, then compress and rotate.

You can also run xymond in a host-specific debug mode, by appending
"--dbghost=HOSTNAME".  That will spit out all the traffic into
/tmp/xymond.dbg for analysis.  Again, you might need to periodically rotate
that file and signal xymond to re-open output files (I'm guessing a HUP
signal might do this, or just kill the process and have xymonlaunch restart
it).

The path the data take would be:

[script] -> [xymon client] -> [TCP/1984] -> [xymond] -> [xymond_channel] ->
[xymond_rrd] -> [rrd file]

What we want to do is to watch the traffic/messages to determine which of
these components is causing the problem.  My first step would be to try to
isolate whether it's a client or server problem, hence watching the traffic
with tcpdump/snoop.  If the traffic is transmitted over the wire in the
correct form, then I'd look at what xymond gives to xymond_channel.  And so
on.  Once we can identify the process that creates the phantom entity, we
can look for the root cause and then work-arounds/solutions.

J


On 24 February 2015 at 16:46, Vernon Everett <user-b3f8dacb72c8@xymon.invalid>
quoted from Vernon Everett
wrote:
I am getting those sporadic .rrd files in spades. :-(
Sometimes, only a single data point in the file. But enough files, and
your graphs start to look like crap.

Tomorrow I am off to a client where it's happening all the time.
What can I send you to assist with investigating?

I am trying to figure out if it's a bug in Xymon, or a bug in my script.
So far I have no evidence to support it being either.

Regards
Vernon


On 24 February 2015 at 13:14, Jeremy Laidman <user-71895fb2e44c@xymon.invalid>
wrote:
On 14 November 2014 at 14:43, Vernon Everett <user-b3f8dacb72c8@xymon.invalid>
wrote:
Am busy trying to investigate a curious problem with rrd graphs, and I
stumbled on something else I don't understand, and was hoping somebody out
there could help.

As part of my investigation, I added --debug to the [rrdstatus] and
[rrddata] entries on the server tasks.cfg
And the logs started showing heaps of the message
2014-11-14 10:41:36 Peer not up, flushing message queue
What is that?
It doesn't look right to me.
It's usually normal.  See Henrik's response to a similar question:

http://lists.xymon.com/archive/2014-April/039461.html

Except every now and then, I get something like
zmem,c2t0d1.rrd
Has anybody seen anything like this?
Yes.  It's puzzling, but rare enough that I haven't had time to
investigate.

J

--
"Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton
list John Thurston · Tue, 24 Feb 2015 08:28:34 -0900 ·
quoted from Vernon Everett
On 2/23/2015 8:46 PM, Vernon Everett wrote:
I am getting those sporadic .rrd files in spades. :-(
Sometimes, only a single data point in the file. But enough files, and
your graphs start to look like crap.
It is my observation that the rrd listener is extremely brittle. If a badly formatted message of the "data" type arrives, it will kill the rrd listener. This leads to gaps in all of the data charts, not just for the client which sent the garbage message. If I remember correctly, and "empty" message is something which will not parse and kills the listener.

When the rrd listener crashes, it leaves a tmp file behind (which causes other problems). I've written a script which I run from time to time. It hunts down the orphaned tmp files and kills them off. The pertinent part of it is identifying the orphans:
rrdfiles=~/server/tmp/rrdctl; comm -13 <(pgrep -u xymon xymond_rrd | sort -n) <(for i in `ls -1 ${rrdfiles}.* | sort -n`;do echo ${i##*.};done)
Take a look and see how many orphans you have, and how frequently they are occurring. Maybe you can identify the culprit.

-- 
    Do things because you should, not just because you can.

John Thurston    XXX-XXX-XXXX
user-ce4d79d99bab@xymon.invalid
Enterprise Technology Services
Department of Administration
State of Alaska
list John Thurston · Tue, 24 Feb 2015 08:51:35 -0900 ·
quoted from Vernon Everett
On 2/23/2015 8:46 PM, Vernon Everett wrote:
I am getting those sporadic .rrd files in spades. :-(
Sometimes, only a single data point in the file. But enough files, and
your graphs start to look like crap.

Tomorrow I am off to a client where it's happening all the time.
What can I send you to assist with investigating?
I was looking through my rrd notes. Back in August and September of 2014 
I was chasing an RRD failure. While trying to find the cause, I enabled 
the --debug option on the rrd daemon which caused the daemon to core dump.

http://lists.xymon.com/archive/2014-August/040113.html

So be careful what steps you take to chase down your problems. You may 
make them worse :)
quoted from John Thurston

-- 
    Do things because you should, not just because you can.

John Thurston    XXX-XXX-XXXX
user-ce4d79d99bab@xymon.invalid
Enterprise Technology Services
Department of Administration
State of Alaska
list Vernon Everett · Wed, 25 Feb 2015 11:40:58 +0800 ·
Hi Jeremy

Added some debug code to my script. Here's an extract.
      DATA=$(cat $TMPFILE.drvperf | awk '{ print $1" : "$2 }') # Current IO
latency
      $XYMON $XYMSRV "data $ENAME.e-series-dcuriolat $(echo; echo; echo
"$DATA"; echo)"
echo      $XYMON $XYMSRV "data $ENAME.e-series-dcuriolat $(echo; echo; echo
"$DATA"; echo)"
      DATA=$(cat $TMPFILE.drvperf | awk '{ print $1" : "$3 }') # Max IO
latency
      $XYMON $XYMSRV "data $ENAME.e-series-dmaxiolat $(echo; echo; echo
"$DATA"; echo)"
echo      $XYMON $XYMSRV "data $ENAME.e-series-dmaxiolat $(echo; echo; echo
"$DATA"; echo)"
      DATA=$(cat $TMPFILE.drvperf | awk '{ print $1" : "$3 }') # Avg IO
latency
      $XYMON $XYMSRV "data $ENAME.e-series-davgiolat $(echo; echo; echo
"$DATA"; echo)"
echo      $XYMON $XYMSRV "data $ENAME.e-series-davgiolat $(echo; echo; echo
"$DATA"; echo)"

And I managed to get a couple of bizarre data files.
e-series-dcuriolat,icmpOutParmProbs.rrd
e-series-dcuriolat,icmpOutRedirects.rrd
e-series-dcuriolat,ipv6InTruncatedPkts.rrd
e-series-dcuriolat,ipv6OutFragFails.rrd
e-series-dcuriolat,UDP_udpInDatagrams.rrd
e-series-dcuriolat,udpInCksumErrs.rrd

And if I grep in my log file for icmp or any of those terms, I come up with
nothing.
So I am guessing it's not coming from the client.

I want to try the snoop, but this client script is running on the server,
as a client script.
It collects data from a bunch of NetApp E-series devices, and sends it to
the server in the normal way.
So you can imagine what the snoop data is going to look like.
But I will give it a go, and see if there is something in it.

As for debugging the rrd tasks, John was right.
Adding --debug to the rrd config causes it to crash.
Then I just het heaps of this.
2015-02-25 11:31:07 Peer not up, flushing message queue
2015-02-25 11:31:07 Peer not up, flushing message queue
2015-02-25 11:31:07 Peer not up, flushing message queue
And the occasional
19073 2015-02-25 11:31:14 2015-02-25 11:31:15 Child process 19073 died:
Signal 6

But I think I am reasonably happy that the strange data isn't coming from
the client script.
Martin Flemming is a list member in Germany (think) who is helping me test
this script.
I will ask him if he's seeing the same issues. If not, I think we can rule
out the script.

Regards
Vernon


On 24 February 2015 at 14:26, Jeremy Laidman <user-71895fb2e44c@xymon.invalid>
quoted from Jeremy Laidman
wrote:
I'm assuming you've checked your debug output from your script to see if
the $TEMPFILE.* file contents look OK.

Perhaps run your own instance of "xymond_channel --channel=data" to
capture the messages as they come from xymond to xymond_rrd.  This will
generate a lot of output, so you'll want to use "--filter" and perhaps
"grep" to trim it down.

You could also run snoop/tcpdump at the same time and try to capture the
data message as it arrives at your Xymon server.  If you have lots of Xymon
traffic it might be better to do so on the client side.

The trick is to get a snapshot at the time that the RRD file is created,
without collecting so much data that you run out of disk!  So doing things
like this:

while true; do tcpdump -w dump.out -n -c 10000 dest port 1984 and host
blabla; gzip dump.out; mv dump.out.gz dump.out-`date +%s`; done

This will capture 10k of packets at a time, then compress and rotate.

You can also run xymond in a host-specific debug mode, by appending
"--dbghost=HOSTNAME".  That will spit out all the traffic into
/tmp/xymond.dbg for analysis.  Again, you might need to periodically rotate
that file and signal xymond to re-open output files (I'm guessing a HUP
signal might do this, or just kill the process and have xymonlaunch restart
it).

The path the data take would be:

[script] -> [xymon client] -> [TCP/1984] -> [xymond] -> [xymond_channel]
-> [xymond_rrd] -> [rrd file]

What we want to do is to watch the traffic/messages to determine which of
these components is causing the problem.  My first step would be to try to
isolate whether it's a client or server problem, hence watching the traffic
with tcpdump/snoop.  If the traffic is transmitted over the wire in the
correct form, then I'd look at what xymond gives to xymond_channel.  And so
on.  Once we can identify the process that creates the phantom entity, we
can look for the root cause and then work-arounds/solutions.

J


On 24 February 2015 at 16:46, Vernon Everett <user-b3f8dacb72c8@xymon.invalid>
wrote:
I am getting those sporadic .rrd files in spades. :-(
Sometimes, only a single data point in the file. But enough files, and
your graphs start to look like crap.

Tomorrow I am off to a client where it's happening all the time.
What can I send you to assist with investigating?

I am trying to figure out if it's a bug in Xymon, or a bug in my script.
So far I have no evidence to support it being either.

Regards
Vernon


On 24 February 2015 at 13:14, Jeremy Laidman <user-71895fb2e44c@xymon.invalid>
wrote:
On 14 November 2014 at 14:43, Vernon Everett <user-b3f8dacb72c8@xymon.invalid>
wrote:
Am busy trying to investigate a curious problem with rrd graphs, and I
stumbled on something else I don't understand, and was hoping somebody out
there could help.

As part of my investigation, I added --debug to the [rrdstatus] and
[rrddata] entries on the server tasks.cfg
And the logs started showing heaps of the message
2014-11-14 10:41:36 Peer not up, flushing message queue
What is that?
It doesn't look right to me.
It's usually normal.  See Henrik's response to a similar question:

http://lists.xymon.com/archive/2014-April/039461.html

Except every now and then, I get something like
zmem,c2t0d1.rrd
Has anybody seen anything like this?
Yes.  It's puzzling, but rare enough that I haven't had time to
investigate.

J

--
"Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton
-- 
"Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton
list Jeremy Laidman · Wed, 25 Feb 2015 15:30:18 +1100 ·
Yes, I'd be surprised if it's a problem in the script.

On 25 February 2015 at 14:40, Vernon Everett <user-b3f8dacb72c8@xymon.invalid>
wrote:
e-series-dcuriolat,ipv6OutFragFails.rrd
e-series-dcuriolat,UDP_udpInDatagrams.rrd
e-series-dcuriolat,udpInCksumErrs.rrd
These all look to be coming from the [netstat] client data.  Perhaps have a
look at the server's client data, scan through to the [netstat] section,
and see if there are an unusual characters or corruption, either within
[netstat] or the sections before and after it.

I'm not sure what you'd be looking for, but worth a look.

J
list Vernon Everett · Wed, 25 Feb 2015 16:16:13 +0800 ·
Hmm, that theory makes a lot of sense.
Since I have also seen files like
e-series-dcuriolat,vnet.rrd
which looks like it comes from the ifstat section.

But of course, that would be too easy. :-)
I am also seeing
e-series-dmaxiolat,syndev72.rrd
And syndev72 is the hostname of a system completely unrelated to the test
where I am seeing the issues.
There are a few others like that which have hostnames in them.

I am also seeing the likes of
e-series-dcuriolat,vdc1.rrd
There are only 2 hosts throwing out "vdc?" data points, and they also has
nothing to do with the e-series.
The vdc? is from another custom script, which logs and graphs the % busy of
LUNs on 2 servers that are sensitive to disk latency fluctuation.

These hosts all have nothing at all to do with the storage arrays being
monitored, which makes me think the client data might be a red herring.

I have asked Martin if he's seeing the same issue, but he hasn't got back
to me yet.

The Xymon version I have here is 4.3.12, but I have also seen it at my
other customers on version 4.3.17 and 4.3.18 (although to a far lesser
degree)

Regards
Vernon


On 25 February 2015 at 12:30, Jeremy Laidman <user-71895fb2e44c@xymon.invalid>
quoted from Jeremy Laidman
wrote:
Yes, I'd be surprised if it's a problem in the script.

On 25 February 2015 at 14:40, Vernon Everett <user-b3f8dacb72c8@xymon.invalid>
wrote:
e-series-dcuriolat,ipv6OutFragFails.rrd
e-series-dcuriolat,UDP_udpInDatagrams.rrd
e-series-dcuriolat,udpInCksumErrs.rrd
These all look to be coming from the [netstat] client data.  Perhaps have
a look at the server's client data, scan through to the [netstat] section,
and see if there are an unusual characters or corruption, either within
[netstat] or the sections before and after it.

I'm not sure what you'd be looking for, but worth a look.

J

-- 
"Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton
list Jeremy Laidman · Wed, 25 Feb 2015 21:06:57 +1100 ·
On 25 February 2015 at 19:16, Vernon Everett <user-b3f8dacb72c8@xymon.invalid>
quoted from Vernon Everett
wrote:
These hosts all have nothing at all to do with the storage arrays being
monitored, which makes me think the client data might be a red herring.

Yup, makes sense.

My best guess is memory corruption within xymond.  So let's see if the
corruption is visible in the messages being passed between xymond and
xymond_channel.  If we see corrupt messages in there, we can start to delve
into the source code to see if there's a bug somewhere.  Are you able to
run your own instance of xymond_channel?  Maybe something like this:

sudo -u xymon xymoncmd xymond_channel --channel=data --filter=zmem cat

One you get an idea what it looks like, change "cat" for something like
"egrep -A5 ^@" to get only the first 5 lines.  Also, redirect to a file
until you notice a dodgy RRD file and then kill the process.

Did you try running xymond with "--dbghost=HOSTNAME" ?  It might be too
voluminous, but might be worth a try, if you can manage to snag the traffic
at the right time.

J
list Vernon Everett · Thu, 26 Feb 2015 15:09:24 +0800 ·
Only back at that client again on Wednesday.
But I do have remote access, so will see what I can do this evening.


On 25 February 2015 at 18:06, Jeremy Laidman <user-71895fb2e44c@xymon.invalid>
quoted from Jeremy Laidman
wrote:
On 25 February 2015 at 19:16, Vernon Everett <user-b3f8dacb72c8@xymon.invalid>
wrote:
These hosts all have nothing at all to do with the storage arrays being
monitored, which makes me think the client data might be a red herring.

Yup, makes sense.

My best guess is memory corruption within xymond.  So let's see if the
corruption is visible in the messages being passed between xymond and
xymond_channel.  If we see corrupt messages in there, we can start to delve
into the source code to see if there's a bug somewhere.  Are you able to
run your own instance of xymond_channel?  Maybe something like this:

sudo -u xymon xymoncmd xymond_channel --channel=data --filter=zmem cat

One you get an idea what it looks like, change "cat" for something like
"egrep -A5 ^@" to get only the first 5 lines.  Also, redirect to a file
until you notice a dodgy RRD file and then kill the process.

Did you try running xymond with "--dbghost=HOSTNAME" ?  It might be too
voluminous, but might be worth a try, if you can manage to snag the traffic
at the right time.

J

-- 
"Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton
list Vernon Everett · Wed, 4 Mar 2015 09:40:58 +0800 ·
Hi all

Back at the customer's site, and back on this problem.

I just captured data as this happened.
Ran the command you suggested, slightly modified for our environment. I
also redirected it to a file for analysis.
Here's what I ran, with error output.
./xymoncmd xymond_channel  --channel=data --filter=e-series cat >
/var/tmp/xymon.out
2015-03-04 08:45:22 Using default environment file
/opt/local/xymon/server/etc/xymonserver.cfg
2015-03-04 08:45:58 Peer not up, flushing message queue
2015-03-04 09:05:21 Gave up waiting for GOCLIENT to go low.

What is that GOCLIENT thing?
It might be relevant, since it occurred just *after *some errant data files
were created. (Note timestamp)
The errant data files are
-rw-r--r--   1 xymon    xymon      19580 Mar  4 09:02
SINFSPNA01/e-series-davgiolat,subversionsize.rrd
-rw-r--r--   1 xymon    xymon      19580 Mar  4 09:02
SINFSPNA01/e-series-davgiolat,subversionrss.rrd
-rw-r--r--   1 xymon    xymon      19580 Mar  4 09:02
SINFSPNA01/e-series-davgiolat,subversionmemory.rrd
-rw-r--r--   1 xymon    xymon      19580 Mar  4 09:02
SINFSPNA01/e-series-davgiolat,subversioncpu.rrd
-rw-r--r--   1 xymon    xymon      19580 Mar  4 09:02
SINFSPNA01/e-series-davgiolat,subversion.rrd
-rw-r--r--   1 xymon    xymon      19580 Mar  4 09:02
SINFSPNA01/e-series-davgiolat,energisesize.rrd

This is supposed to graph the average IO latency of the disks in the
e-series, so we expect output to look like this.
-rw-r--r--   1 xymon    xymon      19580 Mar  4 09:24
SINFSPNA01/e-series-davgiolat,Tray99_Slot1.rrd
-rw-r--r--   1 xymon    xymon      19580 Mar  4 09:24
SINFSPNA01/e-series-davgiolat,Tray0_Slot1.rrd
-rw-r--r--   1 xymon    xymon      19580 Mar  4 09:22
SINFSPNA03/e-series-davgiolat,Tray99_Slot12.rrd
-rw-r--r--   1 xymon    xymon      19580 Mar  4 09:22
SINFSPNA03/e-series-davgiolat,Tray0_Slot12.rrd
-rw-r--r--   1 xymon    xymon      19580 Mar  4 09:21
SINFSPNA01/e-series-davgiolat,Tray99_Slot8.rrd
-rw-r--r--   1 xymon    xymon      19580 Mar  4 09:21
SINFSPNA01/e-series-davgiolat,Tray99_Slot7.rrd
-rw-r--r--   1 xymon    xymon      19580 Mar  4 09:21
SINFSPNA01/e-series-davgiolat,Tray99_Slot6.rrd

The subversion and energize are actually host names for completely
unrelated servers.
The subversionsize and subversionrsss and similar data points are being
sent for another host, but are not related to the e-series graphs.

In the output file, /var/tmp/xymon.out from
./xymoncmd xymond_channel  --channel=data --filter=e-series cat >
/var/tmp/xymon.out
there is no mention of the subversion or energise stuff either.

Does this narrow the field at all?
Based on Jeremy's earlier email, it looks like the issue is in xymond_rrd,
unless that GOCLIENT error can tell us something more?
quoted from Vernon Everett

Regards
Vernon


On 25 February 2015 at 18:06, Jeremy Laidman <user-71895fb2e44c@xymon.invalid>
wrote:
On 25 February 2015 at 19:16, Vernon Everett <user-b3f8dacb72c8@xymon.invalid>
wrote:
These hosts all have nothing at all to do with the storage arrays being
monitored, which makes me think the client data might be a red herring.

Yup, makes sense.

My best guess is memory corruption within xymond.  So let's see if the
corruption is visible in the messages being passed between xymond and
xymond_channel.  If we see corrupt messages in there, we can start to delve
into the source code to see if there's a bug somewhere.  Are you able to
run your own instance of xymond_channel?  Maybe something like this:

sudo -u xymon xymoncmd xymond_channel --channel=data --filter=zmem cat

One you get an idea what it looks like, change "cat" for something like
"egrep -A5 ^@" to get only the first 5 lines.  Also, redirect to a file
until you notice a dodgy RRD file and then kill the process.

Did you try running xymond with "--dbghost=HOSTNAME" ?  It might be too
voluminous, but might be worth a try, if you can manage to snag the traffic
at the right time.

J

-- 
"Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton
list Jeremy Laidman · Wed, 4 Mar 2015 17:03:02 +1100 ·
quoted from Vernon Everett
On 4 March 2015 at 12:40, Vernon Everett <user-b3f8dacb72c8@xymon.invalid> wrote:
Here's what I ran, with error output.
./xymoncmd xymond_channel  --channel=data --filter=e-series cat >
/var/tmp/xymon.out
2015-03-04 08:45:22 Using default environment file
/opt/local/xymon/server/etc/xymonserver.cfg
2015-03-04 08:45:58 Peer not up, flushing message queue
2015-03-04 09:05:21 Gave up waiting for GOCLIENT to go low.

What is that GOCLIENT thing?
From what I can understand, it's a semaphore shared between xymond and all
of the xymond_channel instances.  When there are several channel readers,
they all get sent the message address, and as each one accepts the message,
she decrements GOCLIENT.  When GOCLIENT is zero, it means all readers have
received (and probably copied) the message, and the memory can be freed.
Each reader waits until GOCLIENT goes back to zero before waiting for the
next message.

There's a timeout of 1 second that xymond_channel waits for GOCLIENT to go
back to zero.  If the time is exceeded in a channel reader, it means
another reader is taking too long to handle a message, and so the first
reader gives up, logs the error you saw, and carries on with the next
message loop.  I'm not sure if this is a sign of trouble.  Or it might be
normal when you're running your own instance of xymond_channel.  Or it
might be a side-effect of the "cat" command blocking when writing to your
output file due to a high message rate and contention on whatever
filesystem has /var/tmp/.

There's a description of how GOCLIENT works in the file new-daemon.txt, in
the source code.
quoted from Vernon Everett

In the output file, /var/tmp/xymon.out from
./xymoncmd xymond_channel  --channel=data --filter=e-series cat >
/var/tmp/xymon.out
there is no mention of the subversion or energise stuff either.
Does it have mention of the correct data set names?  We can't draw any
conclusions if it's not collecting the data we expect.

Did any of the RRD files skip an update at the time the new rogue files
were created?  Do these files match up with entries in xymon.out?  Or
anything interesting at the same time as the rogue entries were created?

If you're seeing correct entries in xymon.out, but not the bogus entries,
then I'm inclined to agree that xymond_rrd is at fault, and is possibly
using memory it's not supposed to.  I wonder if running xymond_rrd with
"--no-cache" might have an effect.  Obviously, it's better if you can cache
updates to the RRD files, but it might narrow down the region of code
that's responsible.

This is not conclusive.  It's possible that when you have two instances of
xymond_channel, only one is corrupting data names, and it just so happened
that it was the one being used by xymond_rrd.  Could be that another time
you would see your extra reader getting the bogus entries.  That's the
problem with using a second instance for analysis, rather than somehow
getting the analysis happening on the one that writes to the RRD files.

On the other hand, if you ran two instances of xymond_rrd, both on the same
data channel, and if both instances create the bogus RRD files, then you
know that you can probably use the second instance to narrow down the
fault, without impacting the creation of RRD files for real work.

Are you still running xymond_rrd with "--debug"?  Did this show anything
interesting when the bogus RRD files were created?

What version of Xymon are you running?  Did this start happening after an
upgrade?  I wonder if it's a bug with some versions but not others.

J
list Vernon Everett · Wed, 4 Mar 2015 15:02:55 +0800 ·
Thanks for your assistance Jeremy.
Quite a bit to digest.

Everything in the xymon.out file I am collecting looks exactly like I would
expect it to look.
There is a line, prefixed by @@data#<sequence> (or at least I think it's a
sequence number),followed by pipe-separated data that looks like the host
name, time stamp, IP address, an empty field, host name again, the rrd-file
prefix, another blank field and the last field is other.
I then get a blank line, and the data looks like what I am trying to send.


Looks like we might need to check with JC for more on that GOCLIENT thing.
I just find it odd that it happened about the same time as the corruption.
I haven't seen it again today, and haven't seen any other corruption either.

At this site, we are running Xymon 4.3.12, but I have seen similar
behaviour (although not to such an extent) elsewhere with 4.3.17, and I
think I also saw it with 4.3.18, but I no longer have access to that site.

I am not seeing any lost data points in the other graphs. But that could be
difficult to spot.
Will run a few rrdtool dumps, and look for gaps at that timestamp. Let you
know what I find.

As for the --debug option, it caused xymond_rrd to crash and burn, dumping
cores as we go.
It gets ugly.
Earlier in this thread, John Thurston mentioned this behaviour too.
It also creates a red xymond_rrd button on the xymon server, but the
results are not overly helpful.
- Program crashedFatal signal caught!

Don't think it started after an upgrade.
Something I did notice, the problem appears to be limited to data only,
used to display graphs in trends.
I am not seeing this for data when there is a status and data component.
Or at least I haven't seen it yet.

What are the implications of running with "--no-cache"?
I have implemented this by adding "--no-cache" but if it's going to have a
long-term impact, I don't want to leave it that way indefinitely.

Regards
Vernon
quoted from Jeremy Laidman


On 4 March 2015 at 14:03, Jeremy Laidman <user-71895fb2e44c@xymon.invalid> wrote:
On 4 March 2015 at 12:40, Vernon Everett <user-b3f8dacb72c8@xymon.invalid> wrote:
Here's what I ran, with error output.
./xymoncmd xymond_channel  --channel=data --filter=e-series cat >
/var/tmp/xymon.out
2015-03-04 08:45:22 Using default environment file
/opt/local/xymon/server/etc/xymonserver.cfg
2015-03-04 08:45:58 Peer not up, flushing message queue
2015-03-04 09:05:21 Gave up waiting for GOCLIENT to go low.

What is that GOCLIENT thing?
From what I can understand, it's a semaphore shared between xymond and all
of the xymond_channel instances.  When there are several channel readers,
they all get sent the message address, and as each one accepts the message,
she decrements GOCLIENT.  When GOCLIENT is zero, it means all readers have
received (and probably copied) the message, and the memory can be freed.
Each reader waits until GOCLIENT goes back to zero before waiting for the
next message.

There's a timeout of 1 second that xymond_channel waits for GOCLIENT to go
back to zero.  If the time is exceeded in a channel reader, it means
another reader is taking too long to handle a message, and so the first
reader gives up, logs the error you saw, and carries on with the next
message loop.  I'm not sure if this is a sign of trouble.  Or it might be
normal when you're running your own instance of xymond_channel.  Or it
might be a side-effect of the "cat" command blocking when writing to your
output file due to a high message rate and contention on whatever
filesystem has /var/tmp/.

There's a description of how GOCLIENT works in the file new-daemon.txt, in
the source code.

In the output file, /var/tmp/xymon.out from
./xymoncmd xymond_channel  --channel=data --filter=e-series cat >
/var/tmp/xymon.out
there is no mention of the subversion or energise stuff either.
Does it have mention of the correct data set names?  We can't draw any
conclusions if it's not collecting the data we expect.

Did any of the RRD files skip an update at the time the new rogue files
were created?  Do these files match up with entries in xymon.out?  Or
anything interesting at the same time as the rogue entries were created?

If you're seeing correct entries in xymon.out, but not the bogus entries,
then I'm inclined to agree that xymond_rrd is at fault, and is possibly
using memory it's not supposed to.  I wonder if running xymond_rrd with
"--no-cache" might have an effect.  Obviously, it's better if you can cache
updates to the RRD files, but it might narrow down the region of code
that's responsible.

This is not conclusive.  It's possible that when you have two instances of
xymond_channel, only one is corrupting data names, and it just so happened
that it was the one being used by xymond_rrd.  Could be that another time
you would see your extra reader getting the bogus entries.  That's the
problem with using a second instance for analysis, rather than somehow
getting the analysis happening on the one that writes to the RRD files.

On the other hand, if you ran two instances of xymond_rrd, both on the
same data channel, and if both instances create the bogus RRD files, then
you know that you can probably use the second instance to narrow down the
fault, without impacting the creation of RRD files for real work.

Are you still running xymond_rrd with "--debug"?  Did this show anything
interesting when the bogus RRD files were created?

What version of Xymon are you running?  Did this start happening after an
upgrade?  I wonder if it's a bug with some versions but not others.

J

-- 
"Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton
list Jeremy Laidman · Thu, 5 Mar 2015 09:52:38 +1100 ·
quoted from Vernon Everett
On 04/03/2015 6:02 PM, "Vernon Everett" <user-b3f8dacb72c8@xymon.invalid> wrote:
Looks like we might need to check with JC for more on that GOCLIENT thing.
I just find it odd that it happened about the same time as the corruption.
I haven't seen it again today, and haven't seen any other corruption
either.
If there's a correlation it might help us work out where the fault is. But
it might be only a symptom.
quoted from Vernon Everett
As for the --debug option, it caused xymond_rrd to crash and burn,
dumping cores as we go.
Could be that thensame bug causing the crash during debug is also causing
the corrupt filename. Have you analyzed the core dumps?
quoted from Vernon Everett
Don't think it started after an upgrade.
Something I did notice, the problem appears to be limited to data only,
used to display graphs in trends.
I am not seeing this for data when there is a status and data component.
Or at least I haven't seen it yet.
Interesting. Maybe you could modify your script to send a status also?
What are the implications of running with "--no-cache"?
I think then cache is only to save on disk I/O. If your disks aren't
"running hot" (ie queueing requests long enough to cause problems) then
youre probably OK to leave it.
quoted from Vernon Everett
I have implemented this by adding "--no-cache" but if it's going to have
a long-term impact, I don't want to leave it that way indefinitely.
Some people have reported solving some other problems by disabling the
cache, with presumably no I'll effects.  For example
http://lists.xymon.com/archive/2009-November/026572.html. It's a
performance improvement that hasn't always been there and so is probably
not necessary for average size monitoring service. Also I think that if you
have operations staff that refresh lots of graphs (hence forcing the cache
to flush) you get minimal benefit from the cache.

But it all depends on how many RRD updates per second your server needs to
make, and how fast your disks and filesystem can update. If you have enough
RAM, your OS might have sufficient buffers to not need the cache anyway.

J
list Japheth Cleaver · Wed, 4 Mar 2015 17:11:19 -0800 ·
quoted from Jeremy Laidman
On Wed, March 4, 2015 2:52 pm, Jeremy Laidman wrote:
On 04/03/2015 6:02 PM, "Vernon Everett" <user-b3f8dacb72c8@xymon.invalid> wrote:
Looks like we might need to check with JC for more on that GOCLIENT
thing.
I just find it odd that it happened about the same time as the
corruption.
I haven't seen it again today, and haven't seen any other corruption
either.
If there's a correlation it might help us work out where the fault is. But
it might be only a symptom.
As for the --debug option, it caused xymond_rrd to crash and burn,
dumping cores as we go.
Could be that thensame bug causing the crash during debug is also causing
the corrupt filename. Have you analyzed the core dumps?
GOCLIENT is indeed the means by which xymond_channel listeners communicate
with xymond for the picking up of messages over SysV IPC. I believe the
messages there are just a side effect of it re-launching the channel
listener pipe to xymond_rrd.


The cache routines in xymond_rrd should be stable at this point. Can you
send a backtrace in from one of the cores? I'm curious where things could
be acting up here.


Regards,

-jc
list Vernon Everett · Wed, 11 Mar 2015 08:37:54 +0800 ·
Hi all

And even with --no-cache, I am still getting these corrupted rrd files.
I tried again with --debug (and --no-cache) and it core dumps.

Here's the backtrace.
# adb /zones/smcconsole/root/opt/local/xymon/server/bin/xymond_rrd
./core_smcconsole_xymond_rrd_61_61_1426033626_20899
core file = ./core_smcconsole_xymond_rrd_61_61_1426033626_20899 -- program
``
/zones/smcconsole/root/opt/local/xymon/server/bin/xymond_rrd'' on platform
SUNW,SPARC-Enterprise-T2000
SIGABRT: Abort
$c
libc.so.1`_lwp_kill+8(6, 0, 0, 3a9a0, ffffffff, 6)
libc.so.1`abort+0x110(0, 1, feebafec, ffb3c, fef35518, 0)
sigsegv_handler+0x30(b, 0, ffbfa160, 0, fe8a2a00, ffbfa160)
libc.so.1`__sighndlr+0xc(b, 0, ffbfa160, 3a970, 0, 1)
libc.so.1`call_user_handler+0x3b8(b, 0, 0, 0, fe8a2a00, ffbfa160)
libc.so.1`sigacthandler+0x60(b, 0, ffbfa160, ffbfb290, 0, fee91924)
libc.so.1`strlen+0x50(514cf, ffbfb3ec, ffbfa9d1, 0, 0, 0)
libc.so.1`vfprintf+0xec(6c3d0, 514c0, ffbfb3e8, 0, a0ba4, 33e1c)
dbgprintf+0xa4(514c0, 0, 51400, 6c3f0, bf, 2ab388)
dump_tcp_services+0x74(a0, 1c00, fef37940, 0, 51400, 51400)
init_tcp_services+0x91c(6a400, 51400, 51400, 6c3f0, bf, 23)
rrd_setup+0x15c(93314, ffbfcfc4, 63c00, ffbfcf50, 6a400, 6a400)
find_xymon_rrd+4(93314, 511e8, 54ff8bda, 0, 932f3, 2e)
main+0x948(93314, ffbfcfc4, 63c00, ffbfcf50, 54ff8bda, 93374)
_start+0x5c(0, 0, 0, 0, 0, 0)

Anything else I can offer that will assist?

Regards
Vernon
quoted from Japheth Cleaver


On 5 March 2015 at 09:11, J.C. Cleaver <user-87556346d4af@xymon.invalid> wrote:
On Wed, March 4, 2015 2:52 pm, Jeremy Laidman wrote:
On 04/03/2015 6:02 PM, "Vernon Everett" <user-b3f8dacb72c8@xymon.invalid>
wrote:
Looks like we might need to check with JC for more on that GOCLIENT
thing.
I just find it odd that it happened about the same time as the
corruption.
I haven't seen it again today, and haven't seen any other corruption
either.
If there's a correlation it might help us work out where the fault is.
But
it might be only a symptom.
As for the --debug option, it caused xymond_rrd to crash and burn,
dumping cores as we go.
Could be that thensame bug causing the crash during debug is also causing
the corrupt filename. Have you analyzed the core dumps?
GOCLIENT is indeed the means by which xymond_channel listeners communicate
with xymond for the picking up of messages over SysV IPC. I believe the
messages there are just a side effect of it re-launching the channel
listener pipe to xymond_rrd.


The cache routines in xymond_rrd should be stable at this point. Can you
send a backtrace in from one of the cores? I'm curious where things could
be acting up here.


Regards,

-jc

-- 
"Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton
list Vernon Everett · Wed, 11 Mar 2015 09:13:47 +0800 ·
If it helps, here's the same core, with pstack.

# pstack ./core_smcconsole_xymond_rrd_61_61_1426033626_20899
core './core_smcconsole_xymond_rrd_61_61_1426033626_20899' of 20899:
xymond_rrd --debug --rrddir=/opt/local/xymon/data/rrd --no-cache
 feebebc4 _lwp_kill (6, 0, 0, 3a9a0, ffffffff, 6) + 8
 fee329b0 abort    (0, 1, feebafec, ffb3c, fef35518, 0) + 110
 0003a9a0 sigsegv_handler (b, 0, ffbfa160, 0, fe8a2a00, ffbfa160) + 30
 feebafec __sighndlr (b, 0, ffbfa160, 3a970, 0, 1) + c
 feeaf69c call_user_handler (b, 0, 0, 0, fe8a2a00, ffbfa160) + 3b8
 feeaf884 sigacthandler (b, 0, ffbfa160, ffbfb290, 0, fee91924) + 60
 --- called from signal handler with signal 11 (SIGSEGV) ---
 fee22d10 strlen   (514cf, ffbfb3ec, ffbfa9d1, 0, 0, 0) + 50
 fee91924 vfprintf (6c3d0, 514c0, ffbfb3e8, 0, a0ba4, 33e1c) + ec
 0002fb4c dbgprintf (514c0, 0, 51400, 6c3f0, bf, 2ab388) + a4
 00033e1c dump_tcp_services (a0, 1c00, fef37940, 0, 51400, 51400) + 74
 000347f0 init_tcp_services (6a400, 51400, 51400, 6c3f0, bf, 23) + 91c
 00030188 rrd_setup (93314, ffbfcfc4, 63c00, ffbfcf50, 6a400, 6a400) + 15c
 00030518 find_xymon_rrd (93314, 511e8, 54ff8bda, 0, 932f3, 2e) + 4
 00016614 main     (93314, ffbfcfc4, 63c00, ffbfcf50, 54ff8bda, 93374) + 948
 00015a40 _start   (0, 0, 0, 0, 0, 0) + 5c
quoted from Vernon Everett


On 11 March 2015 at 08:37, Vernon Everett <user-b3f8dacb72c8@xymon.invalid> wrote:
Hi all

And even with --no-cache, I am still getting these corrupted rrd files.
I tried again with --debug (and --no-cache) and it core dumps.

Here's the backtrace.
# adb /zones/smcconsole/root/opt/local/xymon/server/bin/xymond_rrd
./core_smcconsole_xymond_rrd_61_61_1426033626_20899
core file = ./core_smcconsole_xymond_rrd_61_61_1426033626_20899 -- program
``
/zones/smcconsole/root/opt/local/xymon/server/bin/xymond_rrd'' on platform
SUNW,SPARC-Enterprise-T2000
SIGABRT: Abort
$c
libc.so.1`_lwp_kill+8(6, 0, 0, 3a9a0, ffffffff, 6)
libc.so.1`abort+0x110(0, 1, feebafec, ffb3c, fef35518, 0)
sigsegv_handler+0x30(b, 0, ffbfa160, 0, fe8a2a00, ffbfa160)
libc.so.1`__sighndlr+0xc(b, 0, ffbfa160, 3a970, 0, 1)
libc.so.1`call_user_handler+0x3b8(b, 0, 0, 0, fe8a2a00, ffbfa160)
libc.so.1`sigacthandler+0x60(b, 0, ffbfa160, ffbfb290, 0, fee91924)
libc.so.1`strlen+0x50(514cf, ffbfb3ec, ffbfa9d1, 0, 0, 0)
libc.so.1`vfprintf+0xec(6c3d0, 514c0, ffbfb3e8, 0, a0ba4, 33e1c)
dbgprintf+0xa4(514c0, 0, 51400, 6c3f0, bf, 2ab388)
dump_tcp_services+0x74(a0, 1c00, fef37940, 0, 51400, 51400)
init_tcp_services+0x91c(6a400, 51400, 51400, 6c3f0, bf, 23)
rrd_setup+0x15c(93314, ffbfcfc4, 63c00, ffbfcf50, 6a400, 6a400)
find_xymon_rrd+4(93314, 511e8, 54ff8bda, 0, 932f3, 2e)
main+0x948(93314, ffbfcfc4, 63c00, ffbfcf50, 54ff8bda, 93374)
_start+0x5c(0, 0, 0, 0, 0, 0)

Anything else I can offer that will assist?

Regards
Vernon


On 5 March 2015 at 09:11, J.C. Cleaver <user-87556346d4af@xymon.invalid> wrote:
On Wed, March 4, 2015 2:52 pm, Jeremy Laidman wrote:
On 04/03/2015 6:02 PM, "Vernon Everett" <user-b3f8dacb72c8@xymon.invalid>
wrote:
Looks like we might need to check with JC for more on that GOCLIENT
thing.
I just find it odd that it happened about the same time as the
corruption.
I haven't seen it again today, and haven't seen any other corruption
either.
If there's a correlation it might help us work out where the fault is.
But
it might be only a symptom.
As for the --debug option, it caused xymond_rrd to crash and burn,
dumping cores as we go.
Could be that thensame bug causing the crash during debug is also
causing
the corrupt filename. Have you analyzed the core dumps?
GOCLIENT is indeed the means by which xymond_channel listeners communicate
with xymond for the picking up of messages over SysV IPC. I believe the
messages there are just a side effect of it re-launching the channel
listener pipe to xymond_rrd.


The cache routines in xymond_rrd should be stable at this point. Can you
send a backtrace in from one of the cores? I'm curious where things could
be acting up here.


Regards,

-jc

--
"Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton
-- 
"Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton
list Vernon Everett · Wed, 11 Mar 2015 09:45:06 +0800 ·
This might help too.

# pflags ./core_smcconsole_xymond_rrd_61_61_1426033626_20899
core './core_smcconsole_xymond_rrd_61_61_1426033626_20899' of 20899:
xymond_rrd --debug --rrddir=/opt/local/xymon/data/rrd --no-cache
        data model = _ILP32  flags = MSACCT|MSFORK
        sigpend = 0x00020000,0x00000000
 /1:    flags = 0
        sigmask = 0xffffbefc,0x0000ffff  cursig = SIGABRT
quoted from Vernon Everett


On 11 March 2015 at 09:13, Vernon Everett <user-b3f8dacb72c8@xymon.invalid> wrote:
If it helps, here's the same core, with pstack.

# pstack ./core_smcconsole_xymond_rrd_61_61_1426033626_20899
core './core_smcconsole_xymond_rrd_61_61_1426033626_20899' of 20899:
xymond_rrd --debug --rrddir=/opt/local/xymon/data/rrd --no-cache
 feebebc4 _lwp_kill (6, 0, 0, 3a9a0, ffffffff, 6) + 8
 fee329b0 abort    (0, 1, feebafec, ffb3c, fef35518, 0) + 110
 0003a9a0 sigsegv_handler (b, 0, ffbfa160, 0, fe8a2a00, ffbfa160) + 30
 feebafec __sighndlr (b, 0, ffbfa160, 3a970, 0, 1) + c
 feeaf69c call_user_handler (b, 0, 0, 0, fe8a2a00, ffbfa160) + 3b8
 feeaf884 sigacthandler (b, 0, ffbfa160, ffbfb290, 0, fee91924) + 60
 --- called from signal handler with signal 11 (SIGSEGV) ---
 fee22d10 strlen   (514cf, ffbfb3ec, ffbfa9d1, 0, 0, 0) + 50
 fee91924 vfprintf (6c3d0, 514c0, ffbfb3e8, 0, a0ba4, 33e1c) + ec
 0002fb4c dbgprintf (514c0, 0, 51400, 6c3f0, bf, 2ab388) + a4
 00033e1c dump_tcp_services (a0, 1c00, fef37940, 0, 51400, 51400) + 74
 000347f0 init_tcp_services (6a400, 51400, 51400, 6c3f0, bf, 23) + 91c
 00030188 rrd_setup (93314, ffbfcfc4, 63c00, ffbfcf50, 6a400, 6a400) + 15c
 00030518 find_xymon_rrd (93314, 511e8, 54ff8bda, 0, 932f3, 2e) + 4
 00016614 main     (93314, ffbfcfc4, 63c00, ffbfcf50, 54ff8bda, 93374) +
948
 00015a40 _start   (0, 0, 0, 0, 0, 0) + 5c


On 11 March 2015 at 08:37, Vernon Everett <user-b3f8dacb72c8@xymon.invalid>
wrote:
Hi all

And even with --no-cache, I am still getting these corrupted rrd files.
I tried again with --debug (and --no-cache) and it core dumps.

Here's the backtrace.
# adb /zones/smcconsole/root/opt/local/xymon/server/bin/xymond_rrd
./core_smcconsole_xymond_rrd_61_61_1426033626_20899
core file = ./core_smcconsole_xymond_rrd_61_61_1426033626_20899 --
program ``
/zones/smcconsole/root/opt/local/xymon/server/bin/xymond_rrd'' on
platform SUNW,SPARC-Enterprise-T2000
SIGABRT: Abort
$c
libc.so.1`_lwp_kill+8(6, 0, 0, 3a9a0, ffffffff, 6)
libc.so.1`abort+0x110(0, 1, feebafec, ffb3c, fef35518, 0)
sigsegv_handler+0x30(b, 0, ffbfa160, 0, fe8a2a00, ffbfa160)
libc.so.1`__sighndlr+0xc(b, 0, ffbfa160, 3a970, 0, 1)
libc.so.1`call_user_handler+0x3b8(b, 0, 0, 0, fe8a2a00, ffbfa160)
libc.so.1`sigacthandler+0x60(b, 0, ffbfa160, ffbfb290, 0, fee91924)
libc.so.1`strlen+0x50(514cf, ffbfb3ec, ffbfa9d1, 0, 0, 0)
libc.so.1`vfprintf+0xec(6c3d0, 514c0, ffbfb3e8, 0, a0ba4, 33e1c)
dbgprintf+0xa4(514c0, 0, 51400, 6c3f0, bf, 2ab388)
dump_tcp_services+0x74(a0, 1c00, fef37940, 0, 51400, 51400)
init_tcp_services+0x91c(6a400, 51400, 51400, 6c3f0, bf, 23)
rrd_setup+0x15c(93314, ffbfcfc4, 63c00, ffbfcf50, 6a400, 6a400)
find_xymon_rrd+4(93314, 511e8, 54ff8bda, 0, 932f3, 2e)
main+0x948(93314, ffbfcfc4, 63c00, ffbfcf50, 54ff8bda, 93374)
_start+0x5c(0, 0, 0, 0, 0, 0)

Anything else I can offer that will assist?

Regards
Vernon


On 5 March 2015 at 09:11, J.C. Cleaver <user-87556346d4af@xymon.invalid> wrote:
On Wed, March 4, 2015 2:52 pm, Jeremy Laidman wrote:
On 04/03/2015 6:02 PM, "Vernon Everett" <user-b3f8dacb72c8@xymon.invalid>
wrote:
Looks like we might need to check with JC for more on that GOCLIENT
thing.
I just find it odd that it happened about the same time as the
corruption.
I haven't seen it again today, and haven't seen any other corruption
either.
If there's a correlation it might help us work out where the fault is.
But
it might be only a symptom.
As for the --debug option, it caused xymond_rrd to crash and burn,
dumping cores as we go.
Could be that thensame bug causing the crash during debug is also
causing
the corrupt filename. Have you analyzed the core dumps?
GOCLIENT is indeed the means by which xymond_channel listeners
communicate
with xymond for the picking up of messages over SysV IPC. I believe the
messages there are just a side effect of it re-launching the channel
listener pipe to xymond_rrd.


The cache routines in xymond_rrd should be stable at this point. Can you
send a backtrace in from one of the cores? I'm curious where things could
be acting up here.


Regards,

-jc

--
"Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton
--
"Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton
-- 
"Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton
list Jeremy Laidman · Wed, 11 Mar 2015 12:52:14 +1100 ·
quoted from Vernon Everett
On 11 March 2015 at 11:37, Vernon Everett <user-b3f8dacb72c8@xymon.invalid> wrote:
And even with --no-cache, I am still getting these corrupted rrd files.
:-(
quoted from Vernon Everett

I tried again with --debug (and --no-cache) and it core dumps.

Here's the backtrace.
libc.so.1`vfprintf+0xec(6c3d0, 514c0, ffbfb3e8, 0, a0ba4, 33e1c)
dbgprintf+0xa4(514c0, 0, 51400, 6c3f0, bf, 2ab388)
dump_tcp_services+0x74(a0, 1c00, fef37940, 0, 51400, 51400)
So dump_tcp_services() calls dbgprintf() (both on lib/netservices.c) which
in turn calls vprintf() from libc, but with bad parameters.  I've had a
look through the code in dump_tcp_services() and I don't know enough C to
recognize any problems.  But it might be useful to know which call to
dbgprintf() is causing the problem.

Does the log file for xymond_rrd show any debug output at all?  If so,
what's the last line that is shown.

It might be helpful if you can recompile xymond_rrd with
dump_tcp_services() modified.  Initially, I would simply try it with
"return" added after the first call to dbgprintf().  That is,
dump_tcp_services() will output "Service list dump" and return.  This might
stop the core dumps so that we can get debug output for other parts of the
xymond_rrd processing.

If adding "return" at that point fixes this core dump, more diagnostic
lines would be useful to determine what the problem is.  For example,
there's a global array called svcinfo that is iterated over, but if the
array is empty, it might cause the core dump.  So adding a line that checks
whether the array is empty and displays the result would help to pin this
down.

Note that "svcinfo" appears to be populated from the protocols.cfg file
and/or XYMONNETSVCS.  Is it possible that your protocols.cfg file is empty,
or has some syntax error that causes it to be unparseable?  The same for
XYMONNETSVCS (in xymonserver.cfg)?

J
list Vernon Everett · Wed, 11 Mar 2015 10:03:09 +0800 ·
Looking at the logs below, it appears something might be broken in the
[telnet] definition.
Here's the first few lines from my protocols.cfg file
[ftp]
   send "quit\r\n"
   expect "220"
   options banner
   port 21

[ftps]
   send "quit\r\n"
   expect "220"
   options ssl,banner
   port 990

[ssh|ssh1|ssh2]
   send "SSH-2.0-OpenSSH_4.1\r\n"
   expect "SSH"
   options banner
   port 22

[telnet]
   options banner,telnet
   port 23


From rrd-data.log
5614 2015-03-11 08:29:07 Opening file
/opt/local/xymon/server/etc/rrddefinitions.cfg
5614 2015-03-11 08:29:07 Want msg 1, startpos 0, fillpos 0, endpos -1,
usedbytes=0, bufleft=2101247
5614 2015-03-11 08:29:07 Got 403 bytes
5614 2015-03-11 08:29:07 xymond_rrd: Got message 849930
@@data#849930/transfer|1426033746.620045|172.21.219.192||transfer|iostatdisk|sunos|unix/UnixInf
5614 2015-03-11 08:29:07 startpos 403, fillpos 403, endpos -1
5614 2015-03-11 08:29:07 Transport setup is:
5614 2015-03-11 08:29:07 xymondportnumber = 1984
5614 2015-03-11 08:29:07 xymonproxyhost = NONE
5614 2015-03-11 08:29:07 xymonproxyport = 0
5614 2015-03-11 08:29:07 Recipient listed as '172.21.219.192'
5614 2015-03-11 08:29:07 Standard protocol on port 1984
5614 2015-03-11 08:29:07 Will connect to address 172.21.219.192 port 1984
5614 2015-03-11 08:29:07 Connect status is 0
5614 2015-03-11 08:29:07 Sent 16 bytes
5614 2015-03-11 08:29:07 Read 8192 bytes
5614 2015-03-11 08:29:07 Read 7657 bytes
5614 2015-03-11 08:29:07 Closing connection
5614 2015-03-11 08:29:07 Opening file
/opt/local/xymon/server/etc/analysis.cfg
5614 2015-03-11 08:29:07 Compiling regex ctrldom[A-D]
5614 2015-03-11 08:29:07 Compiling regex infdom[A-B]1
5614 2015-03-11 08:29:07 Compiling regex .+dom..
5614 2015-03-11 08:29:07 Compiling regex infdom[A-B]1|notesdomD1
5614 2015-03-11 08:29:07 Compiling regex [.:]80$
5614 2015-03-11 08:29:07 Compiling regex syn.*,%oradom.*
5614 2015-03-11 08:29:07 Compiling regex syn.*|oradom.*|SINFV.*
5614 2015-03-11 08:29:07 Compiling regex [.:]514$
5614 2015-03-11 08:29:07 Compiling regex [.:]3389$
5614 2015-03-11 08:29:07 Compiling regex energise.*
5614 2015-03-11 08:29:07 Compiling regex [.:]8989$
5614 2015-03-11 08:29:07 Compiling regex [.:]80$
5614 2015-03-11 08:29:07 Compiling regex [.:]4848$
5614 2015-03-11 08:29:07 Compiling regex [.:]8080$
5614 2015-03-11 08:29:07 Compiling regex ^/var/crash/.*
5614 2015-03-11 08:29:07 Compiling regex ^/var/core/.*
5614 2015-03-11 08:29:07 Compiling regex [.:]22$
5614 2015-03-11 08:29:07 Compiling regex sinfct.*|domino.*
5614 2015-03-11 08:29:07 Compiling regex [.:]80$
5614 2015-03-11 08:29:07 Compiling regex [.:]25$
5614 2015-03-11 08:29:07 Compiling regex
sinfct.*|domino.*|SINFV.*|hermes|sirocco
5614 2015-03-11 08:29:07 Compiling regex SINFV.*
5614 2015-03-11 08:29:07 Compiling regex .*
5614 2015-03-11 08:29:07 Compiling regex ^error.*
5614 2015-03-11 08:29:07 Compiling regex ^error.*
5614 2015-03-11 08:29:07 Compiling regex .*restart/shutdown.*|.*Citrix
5614 2015-03-11 08:29:07 Compiling regex .*
5614 2015-03-11 08:29:07 Compiling regex ^warning.*
5614 2015-03-11 08:29:07 Compiling regex ^warning.*
5614 2015-03-11 08:29:07 Compiling regex
.*MetaFrameEvents.*|.*restart/shutdown.*|.*Citrix-Multimedia-Flash.*|.*Printer
5614 2015-03-11 08:29:07 Compiling regex .*
5614 2015-03-11 08:29:07 Compiling regex ^failure.*
5614 2015-03-11 08:29:07 Compiling regex ^failure.*
5614 2015-03-11 08:29:07 Compiling regex ^/var/crash/.*
5614 2015-03-11 08:29:07 Compiling regex ^/var/core/.*
5614 2015-03-11 08:29:07 Compiling regex [.:]22$
5614 2015-03-11 08:29:07 Compiling regex ^/var/crash/.*
5614 2015-03-11 08:29:07 Compiling regex ^/var/core/.*
5614 2015-03-11 08:29:07 Compiling regex [.:]22$
5614 2015-03-11 08:29:07 Opening file
/opt/local/xymon/server/etc/protocols.cfg
5614 2015-03-11 08:29:07 Service list dump
5614 2015-03-11 08:29:07  Name      : ftp
5614 2015-03-11 08:29:07    Sendtext: quit\r\n
5614 2015-03-11 08:29:07    Sendlen : 6
5614 2015-03-11 08:29:07    Exp.text: 220
5614 2015-03-11 08:29:07    Exp.len : 3
5614 2015-03-11 08:29:07    Exp.ofs : 0
5614 2015-03-11 08:29:07    Flags   : 1
5614 2015-03-11 08:29:07    Port    : 21
5614 2015-03-11 08:29:07  Name      : ftps
5614 2015-03-11 08:29:07    Sendtext: quit\r\n
5614 2015-03-11 08:29:07    Sendlen : 6
5614 2015-03-11 08:29:07    Exp.text: 220
5614 2015-03-11 08:29:07    Exp.len : 3
5614 2015-03-11 08:29:07    Exp.ofs : 0
5614 2015-03-11 08:29:07    Flags   : 5
5614 2015-03-11 08:29:07    Port    : 990
5614 2015-03-11 08:29:07  Name      : ssh
5614 2015-03-11 08:29:07    Sendtext: SSH-2.0-OpenSSH_4.1\r\n
5614 2015-03-11 08:29:07    Sendlen : 21
5614 2015-03-11 08:29:07    Exp.text: SSH
5614 2015-03-11 08:29:07    Exp.len : 3
5614 2015-03-11 08:29:07    Exp.ofs : 0
5614 2015-03-11 08:29:07    Flags   : 1
5614 2015-03-11 08:29:07    Port    : 22
5614 2015-03-11 08:29:07  Name      : ssh1
5614 2015-03-11 08:29:07    Sendtext: SSH-2.0-OpenSSH_4.1\r\n
5614 2015-03-11 08:29:07    Sendlen : 21
5614 2015-03-11 08:29:07    Exp.text: SSH
5614 2015-03-11 08:29:07    Exp.len : 3
5614 2015-03-11 08:29:07    Exp.ofs : 0
5614 2015-03-11 08:29:07    Flags   : 1
5614 2015-03-11 08:29:07    Port    : 22
5614 2015-03-11 08:29:07  Name      : ssh2
5614 2015-03-11 08:29:07    Sendtext: SSH-2.0-OpenSSH_4.1\r\n
5614 2015-03-11 08:29:07    Sendlen : 21
5614 2015-03-11 08:29:07    Exp.text: SSH
5614 2015-03-11 08:29:07    Exp.len : 3
5614 2015-03-11 08:29:07    Exp.ofs : 0
5614 2015-03-11 08:29:07    Flags   : 1
5614 2015-03-11 08:29:07    Port    : 22
5614 2015-03-11 08:29:07  Name      : telnet
5614 2015-03-11 08:29:07 2015-03-11 08:29:08 Child process 5614 died:
Signal 6

From rrd-status.log
5376 2015-03-11 08:29:06 Opening file
/opt/local/xymon/server/etc/rrddefinitions.cfg
5376 2015-03-11 08:29:06 Want msg 1, startpos 0, fillpos 0, endpos -1,
usedbytes=0, bufleft=2101247
5376 2015-03-11 08:29:06 Got 6075 bytes
5376 2015-03-11 08:29:06 xymond_rrd: Got message 621101
@@status#621101/sinfctap01|1426033745.004291|172.21.198.73||sinfctap01|svcs|1426035545|green||green|1417272435|0||0||1426033594|sunos|unix/appzones|0|
5376 2015-03-11 08:29:06 startpos 6075, fillpos 6075, endpos -1
5376 2015-03-11 08:29:06 Transport setup is:
5376 2015-03-11 08:29:06 xymondportnumber = 1984
5376 2015-03-11 08:29:06 xymonproxyhost = NONE
5376 2015-03-11 08:29:06 xymonproxyport = 0
5376 2015-03-11 08:29:06 Recipient listed as '172.21.219.192'
5376 2015-03-11 08:29:06 Standard protocol on port 1984
5376 2015-03-11 08:29:06 Will connect to address 172.21.219.192 port 1984
5376 2015-03-11 08:29:06 Connect status is 0
5376 2015-03-11 08:29:06 Sent 16 bytes
5376 2015-03-11 08:29:06 Read 8192 bytes
5376 2015-03-11 08:29:06 Read 7657 bytes
5376 2015-03-11 08:29:06 Closing connection
5376 2015-03-11 08:29:06 Opening file
/opt/local/xymon/server/etc/analysis.cfg
5376 2015-03-11 08:29:06 Compiling regex ctrldom[A-D]
5376 2015-03-11 08:29:06 Compiling regex infdom[A-B]1
5376 2015-03-11 08:29:06 Compiling regex .+dom..
5376 2015-03-11 08:29:06 Compiling regex infdom[A-B]1|notesdomD1
5376 2015-03-11 08:29:06 Compiling regex [.:]80$
5376 2015-03-11 08:29:06 Compiling regex syn.*,%oradom.*
5376 2015-03-11 08:29:06 Compiling regex syn.*|oradom.*|SINFV.*
5376 2015-03-11 08:29:06 Compiling regex [.:]514$
5376 2015-03-11 08:29:06 Compiling regex [.:]3389$
5376 2015-03-11 08:29:06 Compiling regex energise.*
5376 2015-03-11 08:29:06 Compiling regex [.:]8989$
5376 2015-03-11 08:29:06 Compiling regex [.:]80$
5376 2015-03-11 08:29:06 Compiling regex [.:]4848$
5376 2015-03-11 08:29:06 Compiling regex [.:]8080$
5376 2015-03-11 08:29:06 Compiling regex ^/var/crash/.*
5376 2015-03-11 08:29:06 Compiling regex ^/var/core/.*
5376 2015-03-11 08:29:06 Compiling regex [.:]22$
5376 2015-03-11 08:29:06 Compiling regex sinfct.*|domino.*
5376 2015-03-11 08:29:06 Compiling regex [.:]80$
5376 2015-03-11 08:29:06 Compiling regex [.:]25$
5376 2015-03-11 08:29:06 Compiling regex
sinfct.*|domino.*|SINFV.*|hermes|sirocco
5376 2015-03-11 08:29:06 Compiling regex SINFV.*
5376 2015-03-11 08:29:06 Compiling regex .*
5376 2015-03-11 08:29:06 Compiling regex ^error.*
5376 2015-03-11 08:29:06 Compiling regex ^error.*
5376 2015-03-11 08:29:06 Compiling regex .*restart/shutdown.*|.*Citrix
5376 2015-03-11 08:29:06 Compiling regex .*
5376 2015-03-11 08:29:06 Compiling regex ^warning.*
5376 2015-03-11 08:29:06 Compiling regex ^warning.*
5376 2015-03-11 08:29:06 Compiling regex
.*MetaFrameEvents.*|.*restart/shutdown.*|.*Citrix-Multimedia-Flash.*|.*Printer
5376 2015-03-11 08:29:06 Compiling regex .*
5376 2015-03-11 08:29:06 Compiling regex ^failure.*
5376 2015-03-11 08:29:06 Compiling regex ^failure.*
5376 2015-03-11 08:29:06 Compiling regex ^/var/crash/.*
5376 2015-03-11 08:29:06 Compiling regex ^/var/core/.*
5376 2015-03-11 08:29:06 Compiling regex [.:]22$
5376 2015-03-11 08:29:06 Compiling regex ^/var/crash/.*
5376 2015-03-11 08:29:06 Compiling regex ^/var/core/.*
5376 2015-03-11 08:29:06 Compiling regex [.:]22$
5376 2015-03-11 08:29:06 Opening file
/opt/local/xymon/server/etc/protocols.cfg
5376 2015-03-11 08:29:06 Service list dump
5376 2015-03-11 08:29:06  Name      : ftp
5376 2015-03-11 08:29:06    Sendtext: quit\r\n
5376 2015-03-11 08:29:06    Sendlen : 6
5376 2015-03-11 08:29:06    Exp.text: 220
5376 2015-03-11 08:29:06    Exp.len : 3
5376 2015-03-11 08:29:06    Exp.ofs : 0
5376 2015-03-11 08:29:06    Flags   : 1
5376 2015-03-11 08:29:06    Port    : 21
5376 2015-03-11 08:29:06  Name      : ftps
5376 2015-03-11 08:29:06    Sendtext: quit\r\n
5376 2015-03-11 08:29:06    Sendlen : 6
5376 2015-03-11 08:29:06    Exp.text: 220
5376 2015-03-11 08:29:06    Exp.len : 3
5376 2015-03-11 08:29:06    Exp.ofs : 0
5376 2015-03-11 08:29:06    Flags   : 5
5376 2015-03-11 08:29:06    Port    : 990
5376 2015-03-11 08:29:06  Name      : ssh
5376 2015-03-11 08:29:06    Sendtext: SSH-2.0-OpenSSH_4.1\r\n
5376 2015-03-11 08:29:06    Sendlen : 21
5376 2015-03-11 08:29:06    Exp.text: SSH
5376 2015-03-11 08:29:06    Exp.len : 3
5376 2015-03-11 08:29:06    Exp.ofs : 0
5376 2015-03-11 08:29:06    Flags   : 1
5376 2015-03-11 08:29:06    Port    : 22
5376 2015-03-11 08:29:06  Name      : ssh1
5376 2015-03-11 08:29:06    Sendtext: SSH-2.0-OpenSSH_4.1\r\n
5376 2015-03-11 08:29:06    Sendlen : 21
5376 2015-03-11 08:29:06    Exp.text: SSH
5376 2015-03-11 08:29:06    Exp.len : 3
5376 2015-03-11 08:29:06    Exp.ofs : 0
5376 2015-03-11 08:29:06    Flags   : 1
5376 2015-03-11 08:29:06    Port    : 22
5376 2015-03-11 08:29:06  Name      : ssh2
5376 2015-03-11 08:29:06    Sendtext: SSH-2.0-OpenSSH_4.1\r\n
5376 2015-03-11 08:29:06    Sendlen : 21
5376 2015-03-11 08:29:06    Exp.text: SSH
5376 2015-03-11 08:29:06    Exp.len : 3
5376 2015-03-11 08:29:06    Exp.ofs : 0
5376 2015-03-11 08:29:06    Flags   : 1
5376 2015-03-11 08:29:06    Port    : 22
5376 2015-03-11 08:29:06  Name      : telnet
5376 2015-03-11 08:29:06 2015-03-11 08:29:07 Peer at 0.0.0.0:0 failed:
Broken pipe
2015-03-11 08:29:07 Child process 5376 died: Signal 6
quoted from Jeremy Laidman


On 11 March 2015 at 09:52, Jeremy Laidman <user-71895fb2e44c@xymon.invalid> wrote:
On 11 March 2015 at 11:37, Vernon Everett <user-b3f8dacb72c8@xymon.invalid>
wrote:
And even with --no-cache, I am still getting these corrupted rrd files.
:-(
quoted from Jeremy Laidman

I tried again with --debug (and --no-cache) and it core dumps.

Here's the backtrace.
libc.so.1`vfprintf+0xec(6c3d0, 514c0, ffbfb3e8, 0, a0ba4, 33e1c)
dbgprintf+0xa4(514c0, 0, 51400, 6c3f0, bf, 2ab388)
dump_tcp_services+0x74(a0, 1c00, fef37940, 0, 51400, 51400)
So dump_tcp_services() calls dbgprintf() (both on lib/netservices.c) which
in turn calls vprintf() from libc, but with bad parameters.  I've had a
look through the code in dump_tcp_services() and I don't know enough C to
recognize any problems.  But it might be useful to know which call to
dbgprintf() is causing the problem.

Does the log file for xymond_rrd show any debug output at all?  If so,
what's the last line that is shown.

It might be helpful if you can recompile xymond_rrd with
dump_tcp_services() modified.  Initially, I would simply try it with
"return" added after the first call to dbgprintf().  That is,
dump_tcp_services() will output "Service list dump" and return.  This might
stop the core dumps so that we can get debug output for other parts of the
xymond_rrd processing.

If adding "return" at that point fixes this core dump, more diagnostic
lines would be useful to determine what the problem is.  For example,
there's a global array called svcinfo that is iterated over, but if the
array is empty, it might cause the core dump.  So adding a line that checks
whether the array is empty and displays the result would help to pin this
down.

Note that "svcinfo" appears to be populated from the protocols.cfg file
and/or XYMONNETSVCS.  Is it possible that your protocols.cfg file is empty,
or has some syntax error that causes it to be unparseable?  The same for
XYMONNETSVCS (in xymonserver.cfg)?

J

-- 
"Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton
list Vernon Everett · Wed, 11 Mar 2015 10:13:08 +0800 ·
Hi Jeremy

As much as I like the idea of modifying the code and recompiling, we need
to remember, this is a production system.
I get frowned upon when I add --debug just to create a core dump. :-(

I doubt changing source code and recompiling is going to get a green light.

As for your question about XYMONNETSVC, it's not a variable that's defined
anywhere in my config.
etc]# grep XYMONNETSVC *
return nothing.

Regards
Vernon
quoted from Jeremy Laidman


On 11 March 2015 at 09:52, Jeremy Laidman <user-71895fb2e44c@xymon.invalid> wrote:
On 11 March 2015 at 11:37, Vernon Everett <user-b3f8dacb72c8@xymon.invalid>
wrote:
And even with --no-cache, I am still getting these corrupted rrd files.
:-(
quoted from Vernon Everett

I tried again with --debug (and --no-cache) and it core dumps.

Here's the backtrace.
libc.so.1`vfprintf+0xec(6c3d0, 514c0, ffbfb3e8, 0, a0ba4, 33e1c)
dbgprintf+0xa4(514c0, 0, 51400, 6c3f0, bf, 2ab388)
dump_tcp_services+0x74(a0, 1c00, fef37940, 0, 51400, 51400)
So dump_tcp_services() calls dbgprintf() (both on lib/netservices.c) which
in turn calls vprintf() from libc, but with bad parameters.  I've had a
look through the code in dump_tcp_services() and I don't know enough C to
recognize any problems.  But it might be useful to know which call to
dbgprintf() is causing the problem.

Does the log file for xymond_rrd show any debug output at all?  If so,
what's the last line that is shown.

It might be helpful if you can recompile xymond_rrd with
dump_tcp_services() modified.  Initially, I would simply try it with
"return" added after the first call to dbgprintf().  That is,
dump_tcp_services() will output "Service list dump" and return.  This might
stop the core dumps so that we can get debug output for other parts of the
xymond_rrd processing.

If adding "return" at that point fixes this core dump, more diagnostic
lines would be useful to determine what the problem is.  For example,
there's a global array called svcinfo that is iterated over, but if the
array is empty, it might cause the core dump.  So adding a line that checks
whether the array is empty and displays the result would help to pin this
down.

Note that "svcinfo" appears to be populated from the protocols.cfg file
and/or XYMONNETSVCS.  Is it possible that your protocols.cfg file is empty,
or has some syntax error that causes it to be unparseable?  The same for
XYMONNETSVCS (in xymonserver.cfg)?

J

-- 
"Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton
list Jeremy Laidman · Wed, 11 Mar 2015 14:02:24 +1100 ·
quoted from Vernon Everett
On 11 March 2015 at 13:13, Vernon Everett <user-b3f8dacb72c8@xymon.invalid> wrote:
As much as I like the idea of modifying the code and recompiling, we need
to remember, this is a production system.
I get frowned upon when I add --debug just to create a core dump. :-(

I doubt changing source code and recompiling is going to get a green light.
Yep, fair enough.  I'm guessing you can't reproduce the core dump on a
non-production system.
quoted from Vernon Everett

As for your question about XYMONNETSVC, it's not a variable that's defined
anywhere in my config.
etc]# grep XYMONNETSVC *
return nothing.
Haha, looks like it's an internal environment variable that Xymon uses to
pass around the list of xymonnet services.  It's mentioned in the man page
for xymonnet, but otherwise it appears to be unexposed and intended to be
for-internal-use-only.

Interesting about the telnet definition.  Do you do any telnet testing?  If
not, can you comment the stanzas for telnet and telnets out of
protocols.cfg and try again?

Something else to try is to put in a "send" and "expect" lines such as:

[telnet]
  send ""
  expect ""
  options banner,telnet
  port 23

I think this might be the cause of the core dumps.  For protocols.cfg
stanzas that don't have "send" or "expect" strings, the debug code tries to
print such strings, and they turn out to be null pointers or something like
that.

J
list Vernon Everett · Wed, 11 Mar 2015 11:18:50 +0800 ·
Hi Jeremy

I think you got that one nailed.
I added
  send ""
  expect ""
as you requested, and it cores again.
This time on the next entry, which is telnets
[telnets]
   options ssl,banner,telnet
   port 992
Again, no send and expect.
Updated telnets, and it crapped out at ldap.
[ldap]
   port 389

[ldaps]
   options ssl
   port 636

[rsync]
   expect "@RSYNCD"
   options banner
   port 873
About now, I am getting a little nervous adding send and expect, because
unlike telnet and telnets, we are doing ldap and ldaps testing.

Any suggestions?
I think we have some debug code update recommendations for JC though. :-)
A good win, but still doesn't solve my original problem. :-(
quoted from Jeremy Laidman

Regards
Vernon


On 11 March 2015 at 11:02, Jeremy Laidman <user-71895fb2e44c@xymon.invalid> wrote:
On 11 March 2015 at 13:13, Vernon Everett <user-b3f8dacb72c8@xymon.invalid>
wrote:
As much as I like the idea of modifying the code and recompiling, we need
to remember, this is a production system.
I get frowned upon when I add --debug just to create a core dump. :-(

I doubt changing source code and recompiling is going to get a green
light.
Yep, fair enough.  I'm guessing you can't reproduce the core dump on a
non-production system.

As for your question about XYMONNETSVC, it's not a variable that's
defined anywhere in my config.
etc]# grep XYMONNETSVC *
return nothing.
Haha, looks like it's an internal environment variable that Xymon uses to
pass around the list of xymonnet services.  It's mentioned in the man page
for xymonnet, but otherwise it appears to be unexposed and intended to be
for-internal-use-only.

Interesting about the telnet definition.  Do you do any telnet testing?
If not, can you comment the stanzas for telnet and telnets out of
protocols.cfg and try again?

Something else to try is to put in a "send" and "expect" lines such as:

[telnet]
  send ""
  expect ""
  options banner,telnet
  port 23

I think this might be the cause of the core dumps.  For protocols.cfg
stanzas that don't have "send" or "expect" strings, the debug code tries to
print such strings, and they turn out to be null pointers or something like
that.

J

-- 
"Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton
list Jeremy Laidman · Wed, 11 Mar 2015 15:02:51 +1100 ·
On 11 March 2015 at 14:18, Vernon Everett <user-b3f8dacb72c8@xymon.invalid> wrote:
I think you got that one nailed.
You might be overselling this finding a little.  But it's progress
nonetheless.
quoted from Vernon Everett

I added
  send ""
  expect ""
as you requested, and it cores again.
This time on the next entry, which is telnets
I run v4.3.10, and I can't get it to dump core the same way yours does,
when "--debug" is used.  Mine says "(null)" when there is no send or expect
defined.  The source code file containing the code that triggers the core
dump is identical to one from 4.3.18.  I've also tested it on a 4.3.18
installation (using the RPM package from Terabithia) but also won't fail.

Please can you try this:

printf "@@data#0/host|||||conn||\ndata host.trends\n\n@@\n" | XYMONTMP=/tmp
~xymon/server/bin/xymond_rrd --rrddir=/dev/null --debug

(adjust path to xymond_rrd and xymon username if required)

For me, this displays the debug output at least as far as when parsing the
protocols.cfg file.  If this also dumps core for you then we can work on
fixing the core dump without risking your production Xymon config.  You
could also build a binary and run it from an alternative location so that
the production binary isn't touched.

I realize this doesn't fix your problem for you.  But I'm optimistic that
the debug output might point to where your problem is.

J
list Vernon Everett · Wed, 11 Mar 2015 12:43:18 +0800 ·
We have 4.3.12
And the command dumps core in the same way.


$ printf "@@data#0/host|||||conn||\ndata host.trends\n\n@@\n" |
XYMONTMP=/tmp ~xymon/server/bin/xymond_rrd --rrddir=/dev/null --debug
16816 2015-03-11 12:38:02 Opening file
/opt/local/xymon/server/etc/rrddefinitions.cfg
16816 2015-03-11 12:38:02 Want msg 1, startpos 0, fillpos 0, endpos -1,
usedbytes=0, bufleft=528383
16816 2015-03-11 12:38:02 Got 46 bytes
16816 2015-03-11 12:38:02 /opt/local/xymon/server/bin/xymond_rrd: Got
message 0 @@data#0/host|||||conn||
16816 2015-03-11 12:38:02 startpos 46, fillpos 46, endpos -1
16816 2015-03-11 12:38:02 Transport setup is:
16816 2015-03-11 12:38:02 xymondportnumber = 1984
16816 2015-03-11 12:38:02 xymonproxyhost = NONE
16816 2015-03-11 12:38:02 xymonproxyport = 0
16816 2015-03-11 12:38:02 Recipient listed as '172.21.219.192'
16816 2015-03-11 12:38:02 Standard protocol on port 1984
16816 2015-03-11 12:38:02 Will connect to address 172.21.219.192 port 1984
16816 2015-03-11 12:38:02 Connect status is 0
16816 2015-03-11 12:38:02 Sent 16 bytes
16816 2015-03-11 12:38:02 Read 15849 bytes
16816 2015-03-11 12:38:02 Closing connection
16816 2015-03-11 12:38:02 Opening file
/opt/local/xymon/server/etc/analysis.cfg
16816 2015-03-11 12:38:02 Compiling regex ctrldom[A-D]
16816 2015-03-11 12:38:02 Compiling regex infdom[A-B]1
16816 2015-03-11 12:38:02 Compiling regex .+dom..
16816 2015-03-11 12:38:02 Compiling regex infdom[A-B]1|notesdomD1
16816 2015-03-11 12:38:02 Compiling regex [.:]80$
16816 2015-03-11 12:38:02 Compiling regex syn.*,%oradom.*
16816 2015-03-11 12:38:02 Compiling regex syn.*|oradom.*|SINFV.*
16816 2015-03-11 12:38:02 Compiling regex [.:]514$
16816 2015-03-11 12:38:02 Compiling regex [.:]3389$
16816 2015-03-11 12:38:02 Compiling regex energise.*
16816 2015-03-11 12:38:02 Compiling regex [.:]8989$
16816 2015-03-11 12:38:02 Compiling regex [.:]80$
16816 2015-03-11 12:38:02 Compiling regex [.:]4848$
16816 2015-03-11 12:38:02 Compiling regex [.:]8080$
16816 2015-03-11 12:38:02 Compiling regex ^/var/crash/.*
16816 2015-03-11 12:38:02 Compiling regex ^/var/core/.*
16816 2015-03-11 12:38:02 Compiling regex [.:]22$
16816 2015-03-11 12:38:02 Compiling regex sinfct.*|domino.*
16816 2015-03-11 12:38:02 Compiling regex [.:]80$
16816 2015-03-11 12:38:02 Compiling regex [.:]25$
16816 2015-03-11 12:38:02 Compiling regex
sinfct.*|domino.*|SINFV.*|hermes|sirocco
16816 2015-03-11 12:38:02 Compiling regex SINFV.*
16816 2015-03-11 12:38:02 Compiling regex .*
16816 2015-03-11 12:38:02 Compiling regex ^error.*
16816 2015-03-11 12:38:02 Compiling regex ^error.*
16816 2015-03-11 12:38:02 Compiling regex .*restart/shutdown.*|.*Citrix
16816 2015-03-11 12:38:02 Compiling regex .*
16816 2015-03-11 12:38:02 Compiling regex ^warning.*
16816 2015-03-11 12:38:02 Compiling regex ^warning.*
16816 2015-03-11 12:38:02 Compiling regex
.*MetaFrameEvents.*|.*restart/shutdown.*|.*Citrix-Multimedia-Flash.*|.*Printer
16816 2015-03-11 12:38:02 Compiling regex .*
16816 2015-03-11 12:38:02 Compiling regex ^failure.*
16816 2015-03-11 12:38:02 Compiling regex ^failure.*
16816 2015-03-11 12:38:02 Compiling regex ^/var/crash/.*
16816 2015-03-11 12:38:02 Compiling regex ^/var/core/.*
16816 2015-03-11 12:38:02 Compiling regex [.:]22$
16816 2015-03-11 12:38:02 Compiling regex ^/var/crash/.*
16816 2015-03-11 12:38:02 Compiling regex ^/var/core/.*
16816 2015-03-11 12:38:02 Compiling regex [.:]22$
16816 2015-03-11 12:38:02 Opening file
/opt/local/xymon/server/etc/protocols.cfg
16816 2015-03-11 12:38:02 Service list dump
16816 2015-03-11 12:38:02  Name      : ftp
16816 2015-03-11 12:38:02    Sendtext: quit\r\n
16816 2015-03-11 12:38:02    Sendlen : 6
16816 2015-03-11 12:38:02    Exp.text: 220
16816 2015-03-11 12:38:02    Exp.len : 3
16816 2015-03-11 12:38:02    Exp.ofs : 0
16816 2015-03-11 12:38:02    Flags   : 1
16816 2015-03-11 12:38:02    Port    : 21
16816 2015-03-11 12:38:02  Name      : ftps
16816 2015-03-11 12:38:02    Sendtext: quit\r\n
16816 2015-03-11 12:38:02    Sendlen : 6
16816 2015-03-11 12:38:02    Exp.text: 220
16816 2015-03-11 12:38:02    Exp.len : 3
16816 2015-03-11 12:38:02    Exp.ofs : 0
16816 2015-03-11 12:38:02    Flags   : 5
16816 2015-03-11 12:38:02    Port    : 990
16816 2015-03-11 12:38:02  Name      : ssh
16816 2015-03-11 12:38:02    Sendtext: SSH-2.0-OpenSSH_4.1\r\n
16816 2015-03-11 12:38:02    Sendlen : 21
16816 2015-03-11 12:38:02    Exp.text: SSH
16816 2015-03-11 12:38:02    Exp.len : 3
16816 2015-03-11 12:38:02    Exp.ofs : 0
16816 2015-03-11 12:38:02    Flags   : 1
16816 2015-03-11 12:38:02    Port    : 22
16816 2015-03-11 12:38:02  Name      : ssh1
16816 2015-03-11 12:38:02    Sendtext: SSH-2.0-OpenSSH_4.1\r\n
16816 2015-03-11 12:38:02    Sendlen : 21
16816 2015-03-11 12:38:02    Exp.text: SSH
16816 2015-03-11 12:38:02    Exp.len : 3
16816 2015-03-11 12:38:02    Exp.ofs : 0
16816 2015-03-11 12:38:02    Flags   : 1
16816 2015-03-11 12:38:02    Port    : 22
16816 2015-03-11 12:38:02  Name      : ssh2
16816 2015-03-11 12:38:02    Sendtext: SSH-2.0-OpenSSH_4.1\r\n
16816 2015-03-11 12:38:02    Sendlen : 21
16816 2015-03-11 12:38:02    Exp.text: SSH
16816 2015-03-11 12:38:02    Exp.len : 3
16816 2015-03-11 12:38:02    Exp.ofs : 0
16816 2015-03-11 12:38:02    Flags   : 1
16816 2015-03-11 12:38:02    Port    : 22
16816 2015-03-11 12:38:02  Name      : telnet
16816 2015-03-11 12:38:02 Abort (core dumped)
quoted from Jeremy Laidman


On 11 March 2015 at 12:02, Jeremy Laidman <user-71895fb2e44c@xymon.invalid> wrote:
On 11 March 2015 at 14:18, Vernon Everett <user-b3f8dacb72c8@xymon.invalid>
wrote:
I think you got that one nailed.
You might be overselling this finding a little.  But it's progress
nonetheless.

I added
  send ""
  expect ""
as you requested, and it cores again.
This time on the next entry, which is telnets
I run v4.3.10, and I can't get it to dump core the same way yours does,
when "--debug" is used.  Mine says "(null)" when there is no send or expect
defined.  The source code file containing the code that triggers the core
dump is identical to one from 4.3.18.  I've also tested it on a 4.3.18
installation (using the RPM package from Terabithia) but also won't fail.

Please can you try this:

printf "@@data#0/host|||||conn||\ndata host.trends\n\n@@\n" |
XYMONTMP=/tmp ~xymon/server/bin/xymond_rrd --rrddir=/dev/null --debug

(adjust path to xymond_rrd and xymon username if required)

For me, this displays the debug output at least as far as when parsing the
protocols.cfg file.  If this also dumps core for you then we can work on
fixing the core dump without risking your production Xymon config.  You
could also build a binary and run it from an alternative location so that
the production binary isn't touched.

I realize this doesn't fix your problem for you.  But I'm optimistic that
the debug output might point to where your problem is.

J

-- 
"Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton
list Jeremy Laidman · Wed, 11 Mar 2015 17:50:05 +1100 ·
Yes, apparently, "(null)" is what GNU's printf outputs when asked to print
a string pointer that is null.  The POSIX behaviour is not defined, and so
in some systems, it dumps core rather than detecting the illegal string.

This bit of code demonstrates:

#include <stdio.h>
int main() {
        printf("test: %s\n",NULL);
}

On Linux and FreeBSD, this prints out "test: (null)" but on Solaris 10 this
dumps core.

Fixing this is beyond my skill level.  But I would start looking at the
functions errprintf() and dbgprintf() to see if NULL strings can be
detected and prevented from being passed through to *printf.

But that might be somewhat tricky.  Alternatively, updating
dump_tcp_services() to do the same, is probably a much simpler task, but
doesn't safeguard any other code from the same fate.

J


On 11 March 2015 at 17:38, Jeremy Laidman <user-71895fb2e44c@xymon.invalid> wrote:
Even simpler:

#include <stdio.h>
int main() {
        printf("test: %s\n",NULL);
}


On 11 March 2015 at 17:33, Jeremy Laidman <user-71895fb2e44c@xymon.invalid>
wrote:
On 11 March 2015 at 15:43, Vernon Everett <user-b3f8dacb72c8@xymon.invalid>
quoted from Vernon Everett
wrote:
We have 4.3.12
And the command dumps core in the same way.
That's good.

Can you try this out?  Plop into a text file, eg test.c, then "make test"
and then "./test".

#include <stdio.h>
#include <stdarg.h>
void dbgprintf(const char *fmt, ...)
{
        va_list args;
        va_start(args, fmt);
        vfprintf(stdout, fmt, args);
        va_end(args);
        fflush(stdout);
}
int main() {
        dbgprintf("test: %s\n",NULL);
}

This is the essence of the Xymon code that prints the empty send/expect
strings.  But I've pared away all of the Xymon code to leave just the call
to vprintf() and whatever is needed to make it run.

On my systems I get the following output:

test: (null)

I'm guessing on your system you'll get a core dump.  If so, it would
suggest a difference in implementations of the standard library functions.
If that's the case, we simply (!) need to catch the attempts to print NULL
pointers, and substitute our own "(null)" string.

Indeed, I just tested this on a Solaris system and it dumped core.  On
Linux and FreeBSD systems, it printed "(null)".

J

list Vernon Everett · Wed, 11 Mar 2015 15:19:42 +0800 ·
[tmp]$ make CC=/usr/local/bin/gcc test
/usr/local/bin/gcc    -o test test.c
[tmp]$ ./test
Segmentation Fault (core dumped)

As you predicted.
quoted from Jeremy Laidman

On 11 March 2015 at 14:50, Jeremy Laidman <user-71895fb2e44c@xymon.invalid> wrote:
Yes, apparently, "(null)" is what GNU's printf outputs when asked to print
a string pointer that is null.  The POSIX behaviour is not defined, and so
in some systems, it dumps core rather than detecting the illegal string.

This bit of code demonstrates:

#include <stdio.h>
int main() {
        printf("test: %s\n",NULL);
}

On Linux and FreeBSD, this prints out "test: (null)" but on Solaris 10
this dumps core.

Fixing this is beyond my skill level.  But I would start looking at the
functions errprintf() and dbgprintf() to see if NULL strings can be
detected and prevented from being passed through to *printf.

But that might be somewhat tricky.  Alternatively, updating
dump_tcp_services() to do the same, is probably a much simpler task, but
doesn't safeguard any other code from the same fate.

J


On 11 March 2015 at 17:38, Jeremy Laidman <user-71895fb2e44c@xymon.invalid>
wrote:
Even simpler:

#include <stdio.h>
int main() {
        printf("test: %s\n",NULL);
}


On 11 March 2015 at 17:33, Jeremy Laidman <user-71895fb2e44c@xymon.invalid>
wrote:
On 11 March 2015 at 15:43, Vernon Everett <user-b3f8dacb72c8@xymon.invalid>
wrote:
We have 4.3.12
And the command dumps core in the same way.
That's good.

Can you try this out?  Plop into a text file, eg test.c, then "make
test" and then "./test".

#include <stdio.h>
#include <stdarg.h>
void dbgprintf(const char *fmt, ...)
{
        va_list args;
        va_start(args, fmt);
        vfprintf(stdout, fmt, args);
        va_end(args);
        fflush(stdout);
}
int main() {
        dbgprintf("test: %s\n",NULL);
}

This is the essence of the Xymon code that prints the empty send/expect
strings.  But I've pared away all of the Xymon code to leave just the call
to vprintf() and whatever is needed to make it run.

On my systems I get the following output:

test: (null)

I'm guessing on your system you'll get a core dump.  If so, it would
suggest a difference in implementations of the standard library functions.
If that's the case, we simply (!) need to catch the attempts to print NULL
pointers, and substitute our own "(null)" string.

Indeed, I just tested this on a Solaris system and it dumped core.  On
Linux and FreeBSD systems, it printed "(null)".

J

-- 
"Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton
list John Thurston · Wed, 11 Mar 2015 08:47:08 -0800 ·
quoted from Vernon Everett
On 3/10/2015 7:18 PM, Vernon Everett wrote:
Hi Jeremy

I think you got that one nailed.
I added
   send ""
   expect ""
as you requested, and it cores again.
This time on the next entry, which is telnets
[telnets]
    options ssl,banner,telnet
    port 992
Again, no send and expect.
Updated telnets, and it crapped out at ldap.
[ldap]
    port 389

[ldaps]
    options ssl
    port 636

[rsync]
    expect "@RSYNCD"
    options banner
    port 873
About now, I am getting a little nervous adding send and expect, because
unlike telnet and telnets, we are doing ldap and ldaps testing.

Any suggestions?
I think we have some debug code update recommendations for JC though. :-)
A good win, but still doesn't solve my original problem. :-(
I identified and shared this exact situation in August and September of last year:
http://lists.xymon.com/pipermail/xymon/2014-August/040130.html
  and
http://lists.xymon.com/pipermail/xymon/2014-September/040163.html

My C isn't good enough to offer a correction for the defect, but at least someone had now duplicated the problem. My solution has been not to enable the --debug option on rrd handler....and hope I don't have a problem I need to debug.
quoted from John Thurston
-- 
    Do things because you should, not just because you can.

John Thurston    XXX-XXX-XXXX
user-ce4d79d99bab@xymon.invalid
Enterprise Technology Services
Department of Administration
State of Alaska
list Jeremy Laidman · Thu, 12 Mar 2015 10:10:41 +1100 ·
quoted from John Thurston
On 12/03/2015 3:47 am, "John Thurston" <user-ce4d79d99bab@xymon.invalid> wrote:
I identified and shared this exact situation in August and September of
last year:
Oh, how embarrassment. I even contributed to that thread and didn't
remember. I must be getting old!
quoted from John Thurston
My C isn't good enough to offer a correction for the defect, but at least
someone had now duplicated the problem. My solution has been not to enable
the --debug option on rrd handler....and hope I don't have a problem I need
to debug.
OK, I'll work on a patch to stop these core dumps and try to get it into
the mainstream. I don't think we can make any more progress on the original
problem without a working debug mode.

J
list Jeremy Laidman · Thu, 12 Mar 2015 11:51:21 +1100 ·
quoted from Vernon Everett
On 11 March 2015 at 14:18, Vernon Everett <user-b3f8dacb72c8@xymon.invalid> wrote:
About now, I am getting a little nervous adding send and expect, because
unlike telnet and telnets, we are doing ldap and ldaps testing.
That's understandable.  A read through the code suggests that at least in
some places, an empty string is equivalent to an undefined string, as the
string length (shown in Sendlen in the debug output) is zero in both
cases.  So until a patch is in place, a work-around might be to define
empty "send" and "expect" strings for those that have none.
quoted from John Thurston

Any suggestions?
I think we have some debug code update recommendations for JC though. :-)
 Here's my patch.  I'll push this into the dev list for proposed inclusion
in a future release.

--- lib/netservices.c.orig      2012-07-25 01:48:41.000000000 +1000
+++ lib/netservices.c   2015-03-12 11:18:18.000000000 +1100
@@ -328,9 +328,9 @@
        dbgprintf("Service list dump\n");
        for (i=0; (svcinfo[i].svcname); i++) {
                dbgprintf(" Name      : %s\n", svcinfo[i].svcname);
-               dbgprintf("   Sendtext: %s\n", binview(svcinfo[i].sendtxt,
svcinfo[i].sendlen));
+               dbgprintf("   Sendtext: %s\n",
svcinfo[i].sendtxt!=NULL?binview(svcinfo[i].sendtxt,
svcinfo[i].sendlen):"[null]");
                dbgprintf("   Sendlen : %d\n", svcinfo[i].sendlen);
-               dbgprintf("   Exp.text: %s\n", binview(svcinfo[i].exptext,
svcinfo[i].explen));
+               dbgprintf("   Exp.text: %s\n",
svcinfo[i].exptext!=NULL?binview(svcinfo[i].exptext,
svcinfo[i].explen):"[null]");
                dbgprintf("   Exp.len : %d\n", svcinfo[i].explen);
                dbgprintf("   Exp.ofs : %d\n", svcinfo[i].expofs);
                dbgprintf("   Flags   : %d\n", svcinfo[i].flags);

This produces "[null]" where we would have seen "(null)" on a GNU-based OS,
to differentiate between the two situations.

In the mean time, you could compile a special version of xymond_rrd, and
run it manually on the same data channel as the real one, but have it make
RRD files and log file to a different location.  This shouldn't interfere
with your production Xymon.  Here's one I prepared earlier that works for
me:

sudo -u xymon mkdir /tmp/my-rrd-data/
sudo -u xymon xymoncmd /bin/sh -c 'XYMONTMP=/tmp;
/usr/lib/xymon/server/bin/xymond_channel --channel=data
--log=/tmp/my-rrd-data.log /path/to/xymond_rrd_debug_patch
--rrddir=/tmp/my-rrd-data/ --debug'

This seems to show some really useful stuff that's relevant to solving your
problem.  Some sample debug lines:

15306 2015-03-12 11:36:28 xymond_rrd_debug_patch: Got message 165619
@@data#165619/servername|1426120588.401891|172.16.0.1||servername|vmstat|sunos|ABC
...
15306 2015-03-12 11:36:28 Creating rrd
/tmp/my-rrd-data//servername/vmstat.rrd
15306 2015-03-12 11:36:28 RRD create param 00: 'rrdcreate'
15306 2015-03-12 11:36:28 RRD create param 01:
'/tmp/my-rrd-data//servername/vmstat.rrd'
15306 2015-03-12 11:36:28 RRD create param 02: '-s'
15306 2015-03-12 11:36:28 RRD create param 03: '300'
15306 2015-03-12 11:36:28 RRD create param 04: 'DS:cpu_r:GAUGE:600:0:U'
15306 2015-03-12 11:36:28 RRD create param 05: 'DS:cpu_b:GAUGE:600:0:U'
15306 2015-03-12 11:36:28 RRD create param 06: 'DS:cpu_w:GAUGE:600:0:U'
...
15306 2015-03-12 11:39:42 Got 265 bytes
15306 2015-03-12 11:39:42 xymond_rrd_debug_patch: Got message 165737
@@data#165737/servername|1426120782.080244|172.16.0.2||servername|trends||DEF
15306 2015-03-12 11:39:42 startpos 216644, fillpos 216644, endpos -1
15306 2015-03-12 11:39:42 Flushing
'/servername/tcp.xopiy90404.parameter.rrd' with 1 updates pending, template
'sec'
15306 2015-03-12 11:39:42 Want msg 165738, startpos 216644, fillpos 216644,
endpos -1, usedbytes=0, bufleft=1884603

J
list Japheth Cleaver · Fri, 13 Mar 2015 00:24:27 -0700 ·
quoted from Jeremy Laidman
On Wed, March 11, 2015 5:51 pm, Jeremy Laidman wrote:
On 11 March 2015 at 14:18, Vernon Everett <user-b3f8dacb72c8@xymon.invalid>
wrote:
About now, I am getting a little nervous adding send and expect, because
unlike telnet and telnets, we are doing ldap and ldaps testing.
That's understandable.  A read through the code suggests that at least in
some places, an empty string is equivalent to an undefined string, as the
string length (shown in Sendlen in the debug output) is zero in both
cases.  So until a patch is in place, a work-around might be to define
empty "send" and "expect" strings for those that have none.

Any suggestions?
I think we have some debug code update recommendations for JC though.

:-)
quoted from Jeremy Laidman
 Here's my patch.  I'll push this into the dev list for proposed inclusion
in a future release.

--- lib/netservices.c.orig      2012-07-25 01:48:41.000000000 +1000
+++ lib/netservices.c   2015-03-12 11:18:18.000000000 +1100
@@ -328,9 +328,9 @@
        dbgprintf("Service list dump\n");
        for (i=0; (svcinfo[i].svcname); i++) {
                dbgprintf(" Name      : %s\n", svcinfo[i].svcname);
-               dbgprintf("   Sendtext: %s\n", binview(svcinfo[i].sendtxt,
svcinfo[i].sendlen));
+               dbgprintf("   Sendtext: %s\n",
svcinfo[i].sendtxt!=NULL?binview(svcinfo[i].sendtxt,
svcinfo[i].sendlen):"[null]");
                dbgprintf("   Sendlen : %d\n", svcinfo[i].sendlen);
-               dbgprintf("   Exp.text: %s\n", binview(svcinfo[i].exptext,
svcinfo[i].explen));
+               dbgprintf("   Exp.text: %s\n",
svcinfo[i].exptext!=NULL?binview(svcinfo[i].exptext,
svcinfo[i].explen):"[null]");
                dbgprintf("   Exp.len : %d\n", svcinfo[i].explen);
                dbgprintf("   Exp.ofs : %d\n", svcinfo[i].expofs);
                dbgprintf("   Flags   : %d\n", svcinfo[i].flags);

This produces "[null]" where we would have seen "(null)" on a GNU-based
OS,
to differentiate between the two situations.

In the mean time, you could compile a special version of xymond_rrd, and
run it manually on the same data channel as the real one, but have it make
RRD files and log file to a different location.  This shouldn't interfere
with your production Xymon.  Here's one I prepared earlier that works for
me:

sudo -u xymon mkdir /tmp/my-rrd-data/
sudo -u xymon xymoncmd /bin/sh -c 'XYMONTMP=/tmp;
/usr/lib/xymon/server/bin/xymond_channel --channel=data
--log=/tmp/my-rrd-data.log /path/to/xymond_rrd_debug_patch
--rrddir=/tmp/my-rrd-data/ --debug'

This seems to show some really useful stuff that's relevant to solving
your
problem.  Some sample debug lines:

15306 2015-03-12 11:36:28 xymond_rrd_debug_patch: Got message 165619
@@data#165619/servername|1426120588.401891|172.16.0.1||servername|vmstat|sunos|ABC
...
15306 2015-03-12 11:36:28 Creating rrd
/tmp/my-rrd-data//servername/vmstat.rrd
15306 2015-03-12 11:36:28 RRD create param 00: 'rrdcreate'
15306 2015-03-12 11:36:28 RRD create param 01:
'/tmp/my-rrd-data//servername/vmstat.rrd'
15306 2015-03-12 11:36:28 RRD create param 02: '-s'
15306 2015-03-12 11:36:28 RRD create param 03: '300'
15306 2015-03-12 11:36:28 RRD create param 04: 'DS:cpu_r:GAUGE:600:0:U'
15306 2015-03-12 11:36:28 RRD create param 05: 'DS:cpu_b:GAUGE:600:0:U'
15306 2015-03-12 11:36:28 RRD create param 06: 'DS:cpu_w:GAUGE:600:0:U'
...
15306 2015-03-12 11:39:42 Got 265 bytes
15306 2015-03-12 11:39:42 xymond_rrd_debug_patch: Got message 165737
@@data#165737/servername|1426120782.080244|172.16.0.2||servername|trends||DEF
15306 2015-03-12 11:39:42 startpos 216644, fillpos 216644, endpos -1
15306 2015-03-12 11:39:42 Flushing
'/servername/tcp.xopiy90404.parameter.rrd' with 1 updates pending,
template
'sec'
15306 2015-03-12 11:39:42 Want msg 165738, startpos 216644, fillpos
216644,
endpos -1, usedbytes=0, bufleft=1884603

J
This is some excellent sleuthing! :)

As I was pouring through the thread (sorry, I've been out the last few
days), I failed to take note of the SPARC-Enterprise-T2000 in the output.


The patch below should fix the immediate issue triggered by debug mode...
letting us move on to the larger oddness. Unfortunately, I have a feeling
there are other occasions where we're relying on GNU's printf(NULL)
printing that out and thus might be caught by this. As I find them, I go
ahead and work to put fixes in.

In the meantime, this will be in 4.3.19 and can be patched directly from
below.


HTH,

-jc


--- lib/netservices.c   (revision 7598)
+++ lib/netservices.c   (working copy)
@@ -81,9 +81,9 @@
        unsigned char *inp, *outp;
        int i;

-       if (!buf) return NULL;
+       if (result) xfree(result);
+       if (!buf) { result = strdup("[null]"); return result; }

-       if (result) xfree(result);
        if (buf && (buflen == 0)) buflen = strlen(buf);
        result = (char *)malloc(4*buflen + 1);  /* Worst case: All binary */
list Vernon Everett · Fri, 20 Mar 2015 16:35:25 +0800 ·
Hi all

I was only back at the client today, and unfortunately have not managed to
get that patch in yet.
(As I mentioned before, it's a production system)

However, I did notice something really odd.
I have focused my attention on the trends graphs, where I get all the extra
values, but it's not happening in the test itself, despite the existence of
the additional rrd files.

Example.
I have something that plots the power usage of the PSUs on a NetApp
e-series.
There are 4 PSUs, output looks like this.

Total power drawn- 487 Watts
Number of trays- 2
Tray power input details-

   TRAY ID  POWER SUPPLY SERIAL NUMBER   INPUT POWER
   99       0                            145 Watts
   99       1                            151 Watts
   0        0                            99 Watts
   0        1                            92 Watts

All good. And I have a graph with 4 lines. Min, Max, Curr and Avg values
are all there. It looks beautiful.
But go look at the power graph in trends, and it's ugly.
Heaps of additional data lines with no entries. All values are NaN
And mixed in amongst the additional empty graphs, are the 4 valid lines.

I look at the rrd files, and they are all there, even the bad ones.
Here's a few of them.
power,tcpListenDrop.rrd
power,tcpOutAck.rrd
power,tcpOutDataSegs.rrd
power,tcpOutRsts.rrd
power,tcpOutUrg.rrd
power,tcpOutWinProbe.rrd
power,tcpRetransSegs.rrd
power,tcpRtoMax.rrd
power,tcpRttUpdate.rrd
power,tcpTimKeepaliveProbe.rrd
power,tcpTimRetransDrop.rrd
power,Tray0_PSU0.rrd                  <--- Valid
power,Tray0_PSU1.rrd                  <--- Valid
power,Tray99_PSU0.rrd                 <--- Valid
power,Tray99_PSU1.rrd                 <--- Valid
power,trlogpool.rrd
power,UDP_udpInDatagrams.rrd
power,udpInCksumErrs.rrd
power,udpOutDatagrams.rrd
power,vnet.rrd

So I thought I would check my configs.
In xymonserver
From TEST2RRD= ,power=ncv,
From GRAPHS=  ,power::9,
And further down
SPLITNCV_power="*:GAUGE"

And in graphs.cfg
[power]
    FNPATTERN power,(.*).rrd
    TITLE Database Power Consumption Per Tray PSU
    YAXIS Watts
    -l 0
    DEF:p at RRDIDX@=@RRDFN@:lambda:AVERAGE
    LINE2:p at RRDIDX@#@COLOR@:@RRDPARAM@
    GPRINT:p at RRDIDX@:LAST: \: %5.1lf (cur)
    GPRINT:p at RRDIDX@:MAX: \: %5.1lf (max)
    GPRINT:p at RRDIDX@:MIN: \: %5.1lf (min)
    GPRINT:p at RRDIDX@:AVERAGE: \: %5.1lf (avg)\n

With luck I will get approval to recompile with the debugging bug-fix, and
we can get more info, but I thought the extra entries in trends, but not in
the test was interesting.

Regards
Vernon
quoted from Japheth Cleaver


On 13 March 2015 at 15:24, J.C. Cleaver <user-87556346d4af@xymon.invalid> wrote:
On Wed, March 11, 2015 5:51 pm, Jeremy Laidman wrote:
On 11 March 2015 at 14:18, Vernon Everett <user-b3f8dacb72c8@xymon.invalid>
wrote:
About now, I am getting a little nervous adding send and expect, because
unlike telnet and telnets, we are doing ldap and ldaps testing.
That's understandable.  A read through the code suggests that at least in
some places, an empty string is equivalent to an undefined string, as the
string length (shown in Sendlen in the debug output) is zero in both
cases.  So until a patch is in place, a work-around might be to define
empty "send" and "expect" strings for those that have none.

Any suggestions?
I think we have some debug code update recommendations for JC though.

:-)
quoted from Japheth Cleaver
 Here's my patch.  I'll push this into the dev list for proposed
inclusion
in a future release.

--- lib/netservices.c.orig      2012-07-25 01:48:41.000000000 +1000
+++ lib/netservices.c   2015-03-12 11:18:18.000000000 +1100
@@ -328,9 +328,9 @@
        dbgprintf("Service list dump\n");
        for (i=0; (svcinfo[i].svcname); i++) {
                dbgprintf(" Name      : %s\n", svcinfo[i].svcname);
-               dbgprintf("   Sendtext: %s\n",
binview(svcinfo[i].sendtxt,
svcinfo[i].sendlen));
+               dbgprintf("   Sendtext: %s\n",
svcinfo[i].sendtxt!=NULL?binview(svcinfo[i].sendtxt,
svcinfo[i].sendlen):"[null]");
                dbgprintf("   Sendlen : %d\n", svcinfo[i].sendlen);
-               dbgprintf("   Exp.text: %s\n",
binview(svcinfo[i].exptext,
svcinfo[i].explen));
+               dbgprintf("   Exp.text: %s\n",
svcinfo[i].exptext!=NULL?binview(svcinfo[i].exptext,
svcinfo[i].explen):"[null]");
                dbgprintf("   Exp.len : %d\n", svcinfo[i].explen);
                dbgprintf("   Exp.ofs : %d\n", svcinfo[i].expofs);
                dbgprintf("   Flags   : %d\n", svcinfo[i].flags);

This produces "[null]" where we would have seen "(null)" on a GNU-based
OS,
to differentiate between the two situations.

In the mean time, you could compile a special version of xymond_rrd, and
run it manually on the same data channel as the real one, but have it
make
RRD files and log file to a different location.  This shouldn't interfere
with your production Xymon.  Here's one I prepared earlier that works for
me:

sudo -u xymon mkdir /tmp/my-rrd-data/
sudo -u xymon xymoncmd /bin/sh -c 'XYMONTMP=/tmp;
/usr/lib/xymon/server/bin/xymond_channel --channel=data
--log=/tmp/my-rrd-data.log /path/to/xymond_rrd_debug_patch
--rrddir=/tmp/my-rrd-data/ --debug'

This seems to show some really useful stuff that's relevant to solving
your
problem.  Some sample debug lines:

15306 2015-03-12 11:36:28 xymond_rrd_debug_patch: Got message 165619
@@data#165619/servername|1426120588.401891|172.16.0.1||servername|vmstat|sunos|ABC
...
15306 2015-03-12 11:36:28 Creating rrd
/tmp/my-rrd-data//servername/vmstat.rrd
15306 2015-03-12 11:36:28 RRD create param 00: 'rrdcreate'
15306 2015-03-12 11:36:28 RRD create param 01:
'/tmp/my-rrd-data//servername/vmstat.rrd'
15306 2015-03-12 11:36:28 RRD create param 02: '-s'
15306 2015-03-12 11:36:28 RRD create param 03: '300'
15306 2015-03-12 11:36:28 RRD create param 04: 'DS:cpu_r:GAUGE:600:0:U'
15306 2015-03-12 11:36:28 RRD create param 05: 'DS:cpu_b:GAUGE:600:0:U'
15306 2015-03-12 11:36:28 RRD create param 06: 'DS:cpu_w:GAUGE:600:0:U'
...
15306 2015-03-12 11:39:42 Got 265 bytes
15306 2015-03-12 11:39:42 xymond_rrd_debug_patch: Got message 165737
@@data#165737/servername|1426120782.080244|172.16.0.2||servername|trends||DEF
15306 2015-03-12 11:39:42 startpos 216644, fillpos 216644, endpos -1
15306 2015-03-12 11:39:42 Flushing
'/servername/tcp.xopiy90404.parameter.rrd' with 1 updates pending,
template
'sec'
15306 2015-03-12 11:39:42 Want msg 165738, startpos 216644, fillpos
216644,
endpos -1, usedbytes=0, bufleft=1884603

J
This is some excellent sleuthing! :)

As I was pouring through the thread (sorry, I've been out the last few
days), I failed to take note of the SPARC-Enterprise-T2000 in the output.


The patch below should fix the immediate issue triggered by debug mode...
letting us move on to the larger oddness. Unfortunately, I have a feeling
there are other occasions where we're relying on GNU's printf(NULL)
printing that out and thus might be caught by this. As I find them, I go
ahead and work to put fixes in.

In the meantime, this will be in 4.3.19 and can be patched directly from
below.


HTH,

-jc


--- lib/netservices.c   (revision 7598)
+++ lib/netservices.c   (working copy)
@@ -81,9 +81,9 @@
        unsigned char *inp, *outp;
        int i;

-       if (!buf) return NULL;
+       if (result) xfree(result);
+       if (!buf) { result = strdup("[null]"); return result; }

-       if (result) xfree(result);
        if (buf && (buflen == 0)) buflen = strlen(buf);
        result = (char *)malloc(4*buflen + 1);  /* Worst case: All binary
*/

-- 
"Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton
list Jeremy Laidman · Fri, 20 Mar 2015 08:43:59 +0000 ·
Vernon

The power status page must refer to a different graph name in graphs.cfg
with a different FNPATTERN.

Click on the graphs images for each version to get the 4-graph view and
compare the URLs.

J
quoted from Vernon Everett

On Fri, 20 Mar 2015 19:35 Vernon Everett <user-b3f8dacb72c8@xymon.invalid> wrote:
Hi all

I was only back at the client today, and unfortunately have not managed to
get that patch in yet.
(As I mentioned before, it's a production system)

However, I did notice something really odd.
I have focused my attention on the trends graphs, where I get all the
extra values, but it's not happening in the test itself, despite the
existence of the additional rrd files.

Example.
I have something that plots the power usage of the PSUs on a NetApp
e-series.
There are 4 PSUs, output looks like this.

Total power drawn- 487 Watts
Number of trays- 2
Tray power input details-

   TRAY ID  POWER SUPPLY SERIAL NUMBER   INPUT POWER
   99       0                            145 Watts
   99       1                            151 Watts
   0        0                            99 Watts
   0        1                            92 Watts

All good. And I have a graph with 4 lines. Min, Max, Curr and Avg values
are all there. It looks beautiful.
But go look at the power graph in trends, and it's ugly.
Heaps of additional data lines with no entries. All values are NaN
And mixed in amongst the additional empty graphs, are the 4 valid lines.

I look at the rrd files, and they are all there, even the bad ones.
Here's a few of them.
power,tcpListenDrop.rrd
power,tcpOutAck.rrd
power,tcpOutDataSegs.rrd
power,tcpOutRsts.rrd
power,tcpOutUrg.rrd
power,tcpOutWinProbe.rrd
power,tcpRetransSegs.rrd
power,tcpRtoMax.rrd
power,tcpRttUpdate.rrd
power,tcpTimKeepaliveProbe.rrd
power,tcpTimRetransDrop.rrd
power,Tray0_PSU0.rrd                  <--- Valid
power,Tray0_PSU1.rrd                  <--- Valid
power,Tray99_PSU0.rrd                 <--- Valid
power,Tray99_PSU1.rrd                 <--- Valid
power,trlogpool.rrd
power,UDP_udpInDatagrams.rrd
power,udpInCksumErrs.rrd
power,udpOutDatagrams.rrd
power,vnet.rrd

So I thought I would check my configs.
In xymonserver
From TEST2RRD= ,power=ncv,
From GRAPHS=  ,power::9,
And further down
SPLITNCV_power="*:GAUGE"

And in graphs.cfg
[power]
    FNPATTERN power,(.*).rrd
    TITLE Database Power Consumption Per Tray PSU
    YAXIS Watts
    -l 0
    DEF:p at RRDIDX@=@RRDFN@:lambda:AVERAGE
    LINE2:p at RRDIDX@#@COLOR@:@RRDPARAM@
    GPRINT:p at RRDIDX@:LAST: \: %5.1lf (cur)
    GPRINT:p at RRDIDX@:MAX: \: %5.1lf (max)
    GPRINT:p at RRDIDX@:MIN: \: %5.1lf (min)
    GPRINT:p at RRDIDX@:AVERAGE: \: %5.1lf (avg)\n

With luck I will get approval to recompile with the debugging bug-fix, and
we can get more info, but I thought the extra entries in trends, but not in
the test was interesting.

Regards
Vernon


On 13 March 2015 at 15:24, J.C. Cleaver <user-87556346d4af@xymon.invalid> wrote:
On Wed, March 11, 2015 5:51 pm, Jeremy Laidman wrote:
On 11 March 2015 at 14:18, Vernon Everett <user-b3f8dacb72c8@xymon.invalid>
wrote:
About now, I am getting a little nervous adding send and expect,
because
unlike telnet and telnets, we are doing ldap and ldaps testing.
That's understandable.  A read through the code suggests that at least
in
some places, an empty string is equivalent to an undefined string, as
the
string length (shown in Sendlen in the debug output) is zero in both
cases.  So until a patch is in place, a work-around might be to define
empty "send" and "expect" strings for those that have none.

Any suggestions?
I think we have some debug code update recommendations for JC though.

:-)
quoted from Vernon Everett
 Here's my patch.  I'll push this into the dev list for proposed
inclusion
in a future release.

--- lib/netservices.c.orig      2012-07-25 01:48:41.000000000 +1000
+++ lib/netservices.c   2015-03-12 11:18:18.000000000 +1100
@@ -328,9 +328,9 @@
        dbgprintf("Service list dump\n");
        for (i=0; (svcinfo[i].svcname); i++) {
                dbgprintf(" Name      : %s\n", svcinfo[i].svcname);
-               dbgprintf("   Sendtext: %s\n",
binview(svcinfo[i].sendtxt,
svcinfo[i].sendlen));
+               dbgprintf("   Sendtext: %s\n",
svcinfo[i].sendtxt!=NULL?binview(svcinfo[i].sendtxt,
svcinfo[i].sendlen):"[null]");
                dbgprintf("   Sendlen : %d\n", svcinfo[i].sendlen);
-               dbgprintf("   Exp.text: %s\n",
binview(svcinfo[i].exptext,
svcinfo[i].explen));
+               dbgprintf("   Exp.text: %s\n",
svcinfo[i].exptext!=NULL?binview(svcinfo[i].exptext,
svcinfo[i].explen):"[null]");
                dbgprintf("   Exp.len : %d\n", svcinfo[i].explen);
                dbgprintf("   Exp.ofs : %d\n", svcinfo[i].expofs);
                dbgprintf("   Flags   : %d\n", svcinfo[i].flags);

This produces "[null]" where we would have seen "(null)" on a GNU-based
OS,
to differentiate between the two situations.

In the mean time, you could compile a special version of xymond_rrd, and
run it manually on the same data channel as the real one, but have it
make
RRD files and log file to a different location.  This shouldn't
interfere
with your production Xymon.  Here's one I prepared earlier that works
for
me:

sudo -u xymon mkdir /tmp/my-rrd-data/
sudo -u xymon xymoncmd /bin/sh -c 'XYMONTMP=/tmp;
/usr/lib/xymon/server/bin/xymond_channel --channel=data
--log=/tmp/my-rrd-data.log /path/to/xymond_rrd_debug_patch
--rrddir=/tmp/my-rrd-data/ --debug'

This seems to show some really useful stuff that's relevant to solving
your
problem.  Some sample debug lines:

15306 2015-03-12 11:36:28 xymond_rrd_debug_patch: Got message 165619
@@data#165619/servername|1426120588.401891|172.16.0.1||servername|vmstat|sunos|ABC
...
15306 2015-03-12 11:36:28 Creating rrd
/tmp/my-rrd-data//servername/vmstat.rrd
15306 2015-03-12 11:36:28 RRD create param 00: 'rrdcreate'
15306 2015-03-12 11:36:28 RRD create param 01:
'/tmp/my-rrd-data//servername/vmstat.rrd'
15306 2015-03-12 11:36:28 RRD create param 02: '-s'
15306 2015-03-12 11:36:28 RRD create param 03: '300'
15306 2015-03-12 11:36:28 RRD create param 04: 'DS:cpu_r:GAUGE:600:0:U'
15306 2015-03-12 11:36:28 RRD create param 05: 'DS:cpu_b:GAUGE:600:0:U'
15306 2015-03-12 11:36:28 RRD create param 06: 'DS:cpu_w:GAUGE:600:0:U'
...
15306 2015-03-12 11:39:42 Got 265 bytes
15306 2015-03-12 11:39:42 xymond_rrd_debug_patch: Got message 165737
@@data#165737/servername|1426120782.080244|172.16.0.2||servername|trends||DEF
15306 2015-03-12 11:39:42 startpos 216644, fillpos 216644, endpos -1
15306 2015-03-12 11:39:42 Flushing
'/servername/tcp.xopiy90404.parameter.rrd' with 1 updates pending,
template
'sec'
15306 2015-03-12 11:39:42 Want msg 165738, startpos 216644, fillpos
216644,
endpos -1, usedbytes=0, bufleft=1884603

J
This is some excellent sleuthing! :)

As I was pouring through the thread (sorry, I've been out the last few
days), I failed to take note of the SPARC-Enterprise-T2000 in the output.


The patch below should fix the immediate issue triggered by debug mode...
letting us move on to the larger oddness. Unfortunately, I have a feeling
there are other occasions where we're relying on GNU's printf(NULL)
printing that out and thus might be caught by this. As I find them, I go
ahead and work to put fixes in.

In the meantime, this will be in 4.3.19 and can be patched directly from
below.


HTH,

-jc


--- lib/netservices.c   (revision 7598)
+++ lib/netservices.c   (working copy)
@@ -81,9 +81,9 @@
        unsigned char *inp, *outp;
        int i;

-       if (!buf) return NULL;
+       if (result) xfree(result);
+       if (!buf) { result = strdup("[null]"); return result; }

-       if (result) xfree(result);
        if (buf && (buflen == 0)) buflen = strlen(buf);
        result = (char *)malloc(4*buflen + 1);  /* Worst case: All binary
*/
--
"Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton
list Vernon Everett · Sat, 21 Mar 2015 07:16:38 +0800 ·
Hi Jeremy

That thought occurred to me, but I checked.
There is only one [power] entry in the graphs.cfg file.
And I put it there for this particular test.

Would have made this one too easy if it was that. :-)

Regards
Vernon
quoted from Jeremy Laidman


On 20 March 2015 at 16:43, Jeremy Laidman <user-71895fb2e44c@xymon.invalid> wrote:
Vernon

The power status page must refer to a different graph name in graphs.cfg
with a different FNPATTERN.

Click on the graphs images for each version to get the 4-graph view and
compare the URLs.

J

On Fri, 20 Mar 2015 19:35 Vernon Everett <user-b3f8dacb72c8@xymon.invalid> wrote:
Hi all

I was only back at the client today, and unfortunately have not managed
to get that patch in yet.
(As I mentioned before, it's a production system)

However, I did notice something really odd.
I have focused my attention on the trends graphs, where I get all the
extra values, but it's not happening in the test itself, despite the
existence of the additional rrd files.

Example.
I have something that plots the power usage of the PSUs on a NetApp
e-series.
There are 4 PSUs, output looks like this.

Total power drawn- 487 Watts
Number of trays- 2
Tray power input details-

   TRAY ID  POWER SUPPLY SERIAL NUMBER   INPUT POWER
   99       0                            145 Watts
   99       1                            151 Watts
   0        0                            99 Watts
   0        1                            92 Watts

All good. And I have a graph with 4 lines. Min, Max, Curr and Avg values
are all there. It looks beautiful.
But go look at the power graph in trends, and it's ugly.
Heaps of additional data lines with no entries. All values are NaN
And mixed in amongst the additional empty graphs, are the 4 valid lines.

I look at the rrd files, and they are all there, even the bad ones.
Here's a few of them.
power,tcpListenDrop.rrd
power,tcpOutAck.rrd
power,tcpOutDataSegs.rrd
power,tcpOutRsts.rrd
power,tcpOutUrg.rrd
power,tcpOutWinProbe.rrd
power,tcpRetransSegs.rrd
power,tcpRtoMax.rrd
power,tcpRttUpdate.rrd
power,tcpTimKeepaliveProbe.rrd
power,tcpTimRetransDrop.rrd
power,Tray0_PSU0.rrd                  <--- Valid
power,Tray0_PSU1.rrd                  <--- Valid
power,Tray99_PSU0.rrd                 <--- Valid
power,Tray99_PSU1.rrd                 <--- Valid
power,trlogpool.rrd
power,UDP_udpInDatagrams.rrd
power,udpInCksumErrs.rrd
power,udpOutDatagrams.rrd
power,vnet.rrd

So I thought I would check my configs.
In xymonserver
From TEST2RRD= ,power=ncv,
From GRAPHS=  ,power::9,
And further down
SPLITNCV_power="*:GAUGE"

And in graphs.cfg
[power]
    FNPATTERN power,(.*).rrd
    TITLE Database Power Consumption Per Tray PSU
    YAXIS Watts
    -l 0
    DEF:p at RRDIDX@=@RRDFN@:lambda:AVERAGE
    LINE2:p at RRDIDX@#@COLOR@:@RRDPARAM@
    GPRINT:p at RRDIDX@:LAST: \: %5.1lf (cur)
    GPRINT:p at RRDIDX@:MAX: \: %5.1lf (max)
    GPRINT:p at RRDIDX@:MIN: \: %5.1lf (min)
    GPRINT:p at RRDIDX@:AVERAGE: \: %5.1lf (avg)\n

With luck I will get approval to recompile with the debugging bug-fix,
and we can get more info, but I thought the extra entries in trends, but
not in the test was interesting.

Regards
Vernon


On 13 March 2015 at 15:24, J.C. Cleaver <user-87556346d4af@xymon.invalid> wrote:
On Wed, March 11, 2015 5:51 pm, Jeremy Laidman wrote:
On 11 March 2015 at 14:18, Vernon Everett <user-b3f8dacb72c8@xymon.invalid>
wrote:
About now, I am getting a little nervous adding send and expect,
because
unlike telnet and telnets, we are doing ldap and ldaps testing.
That's understandable.  A read through the code suggests that at least
in
some places, an empty string is equivalent to an undefined string, as
the
string length (shown in Sendlen in the debug output) is zero in both
cases.  So until a patch is in place, a work-around might be to define
empty "send" and "expect" strings for those that have none.

Any suggestions?
I think we have some debug code update recommendations for JC though.

:-)
quoted from Jeremy Laidman
 Here's my patch.  I'll push this into the dev list for proposed
inclusion
in a future release.

--- lib/netservices.c.orig      2012-07-25 01:48:41.000000000 +1000
+++ lib/netservices.c   2015-03-12 11:18:18.000000000 +1100
@@ -328,9 +328,9 @@
        dbgprintf("Service list dump\n");
        for (i=0; (svcinfo[i].svcname); i++) {
                dbgprintf(" Name      : %s\n", svcinfo[i].svcname);
-               dbgprintf("   Sendtext: %s\n",
binview(svcinfo[i].sendtxt,
svcinfo[i].sendlen));
+               dbgprintf("   Sendtext: %s\n",
svcinfo[i].sendtxt!=NULL?binview(svcinfo[i].sendtxt,
svcinfo[i].sendlen):"[null]");
                dbgprintf("   Sendlen : %d\n", svcinfo[i].sendlen);
-               dbgprintf("   Exp.text: %s\n",
binview(svcinfo[i].exptext,
svcinfo[i].explen));
+               dbgprintf("   Exp.text: %s\n",
svcinfo[i].exptext!=NULL?binview(svcinfo[i].exptext,
svcinfo[i].explen):"[null]");
                dbgprintf("   Exp.len : %d\n", svcinfo[i].explen);
                dbgprintf("   Exp.ofs : %d\n", svcinfo[i].expofs);
                dbgprintf("   Flags   : %d\n", svcinfo[i].flags);

This produces "[null]" where we would have seen "(null)" on a GNU-based
OS,
to differentiate between the two situations.

In the mean time, you could compile a special version of xymond_rrd,
and
run it manually on the same data channel as the real one, but have it
make
RRD files and log file to a different location.  This shouldn't
interfere
with your production Xymon.  Here's one I prepared earlier that works
for
me:

sudo -u xymon mkdir /tmp/my-rrd-data/
sudo -u xymon xymoncmd /bin/sh -c 'XYMONTMP=/tmp;
/usr/lib/xymon/server/bin/xymond_channel --channel=data
--log=/tmp/my-rrd-data.log /path/to/xymond_rrd_debug_patch
--rrddir=/tmp/my-rrd-data/ --debug'

This seems to show some really useful stuff that's relevant to solving
your
problem.  Some sample debug lines:

15306 2015-03-12 11:36:28 xymond_rrd_debug_patch: Got message 165619
@@data#165619/servername|1426120588.401891|172.16.0.1||servername|vmstat|sunos|ABC
...
15306 2015-03-12 11:36:28 Creating rrd
/tmp/my-rrd-data//servername/vmstat.rrd
15306 2015-03-12 11:36:28 RRD create param 00: 'rrdcreate'
15306 2015-03-12 11:36:28 RRD create param 01:
'/tmp/my-rrd-data//servername/vmstat.rrd'
15306 2015-03-12 11:36:28 RRD create param 02: '-s'
15306 2015-03-12 11:36:28 RRD create param 03: '300'
15306 2015-03-12 11:36:28 RRD create param 04: 'DS:cpu_r:GAUGE:600:0:U'
15306 2015-03-12 11:36:28 RRD create param 05: 'DS:cpu_b:GAUGE:600:0:U'
15306 2015-03-12 11:36:28 RRD create param 06: 'DS:cpu_w:GAUGE:600:0:U'
...
15306 2015-03-12 11:39:42 Got 265 bytes
15306 2015-03-12 11:39:42 xymond_rrd_debug_patch: Got message 165737
@@data#165737/servername|1426120782.080244|172.16.0.2||servername|trends||DEF
15306 2015-03-12 11:39:42 startpos 216644, fillpos 216644, endpos -1
15306 2015-03-12 11:39:42 Flushing
'/servername/tcp.xopiy90404.parameter.rrd' with 1 updates pending,
template
'sec'
15306 2015-03-12 11:39:42 Want msg 165738, startpos 216644, fillpos
216644,
endpos -1, usedbytes=0, bufleft=1884603

J
This is some excellent sleuthing! :)

As I was pouring through the thread (sorry, I've been out the last few
days), I failed to take note of the SPARC-Enterprise-T2000 in the output.


The patch below should fix the immediate issue triggered by debug mode...
letting us move on to the larger oddness. Unfortunately, I have a feeling
there are other occasions where we're relying on GNU's printf(NULL)
printing that out and thus might be caught by this. As I find them, I go
ahead and work to put fixes in.

In the meantime, this will be in 4.3.19 and can be patched directly from
below.


HTH,

-jc


--- lib/netservices.c   (revision 7598)
+++ lib/netservices.c   (working copy)
@@ -81,9 +81,9 @@
        unsigned char *inp, *outp;
        int i;

-       if (!buf) return NULL;
+       if (result) xfree(result);
+       if (!buf) { result = strdup("[null]"); return result; }

-       if (result) xfree(result);
        if (buf && (buflen == 0)) buflen = strlen(buf);
        result = (char *)malloc(4*buflen + 1);  /* Worst case: All
binary */
--
"Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton
-- 
"Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton
list Jeremy Laidman · Sat, 21 Mar 2015 09:48:08 +0000 ·
So the URLs are different? But both have service=power in the URLs?
quoted from Vernon Everett

On Sat, 21 Mar 2015 10:16 Vernon Everett <user-b3f8dacb72c8@xymon.invalid> wrote:
Hi Jeremy

That thought occurred to me, but I checked.
There is only one [power] entry in the graphs.cfg file.
And I put it there for this particular test.

Would have made this one too easy if it was that. :-)

Regards
Vernon


On 20 March 2015 at 16:43, Jeremy Laidman <user-71895fb2e44c@xymon.invalid>
wrote:
Vernon

The power status page must refer to a different graph name in graphs.cfg
with a different FNPATTERN.

Click on the graphs images for each version to get the 4-graph view and
compare the URLs.

J

On Fri, 20 Mar 2015 19:35 Vernon Everett <user-b3f8dacb72c8@xymon.invalid>
wrote:
Hi all

I was only back at the client today, and unfortunately have not managed
to get that patch in yet.
(As I mentioned before, it's a production system)

However, I did notice something really odd.
I have focused my attention on the trends graphs, where I get all the
extra values, but it's not happening in the test itself, despite the
existence of the additional rrd files.

Example.
I have something that plots the power usage of the PSUs on a NetApp
e-series.
There are 4 PSUs, output looks like this.

Total power drawn- 487 Watts
Number of trays- 2
Tray power input details-

   TRAY ID  POWER SUPPLY SERIAL NUMBER   INPUT POWER
   99       0                            145 Watts
   99       1                            151 Watts
   0        0                            99 Watts
   0        1                            92 Watts

All good. And I have a graph with 4 lines. Min, Max, Curr and Avg values
are all there. It looks beautiful.
But go look at the power graph in trends, and it's ugly.
Heaps of additional data lines with no entries. All values are NaN
And mixed in amongst the additional empty graphs, are the 4 valid lines.

I look at the rrd files, and they are all there, even the bad ones.
Here's a few of them.
power,tcpListenDrop.rrd
power,tcpOutAck.rrd
power,tcpOutDataSegs.rrd
power,tcpOutRsts.rrd
power,tcpOutUrg.rrd
power,tcpOutWinProbe.rrd
power,tcpRetransSegs.rrd
power,tcpRtoMax.rrd
power,tcpRttUpdate.rrd
power,tcpTimKeepaliveProbe.rrd
power,tcpTimRetransDrop.rrd
power,Tray0_PSU0.rrd                  <--- Valid
power,Tray0_PSU1.rrd                  <--- Valid
power,Tray99_PSU0.rrd                 <--- Valid
power,Tray99_PSU1.rrd                 <--- Valid
power,trlogpool.rrd
power,UDP_udpInDatagrams.rrd
power,udpInCksumErrs.rrd
power,udpOutDatagrams.rrd
power,vnet.rrd

So I thought I would check my configs.
In xymonserver
From TEST2RRD= ,power=ncv,
From GRAPHS=  ,power::9,
And further down
SPLITNCV_power="*:GAUGE"

And in graphs.cfg
[power]
    FNPATTERN power,(.*).rrd
    TITLE Database Power Consumption Per Tray PSU
    YAXIS Watts
    -l 0
    DEF:p at RRDIDX@=@RRDFN@:lambda:AVERAGE
    LINE2:p at RRDIDX@#@COLOR@:@RRDPARAM@
    GPRINT:p at RRDIDX@:LAST: \: %5.1lf (cur)
    GPRINT:p at RRDIDX@:MAX: \: %5.1lf (max)
    GPRINT:p at RRDIDX@:MIN: \: %5.1lf (min)
    GPRINT:p at RRDIDX@:AVERAGE: \: %5.1lf (avg)\n

With luck I will get approval to recompile with the debugging bug-fix,
and we can get more info, but I thought the extra entries in trends, but
not in the test was interesting.

Regards
Vernon


On 13 March 2015 at 15:24, J.C. Cleaver <user-87556346d4af@xymon.invalid> wrote:
On Wed, March 11, 2015 5:51 pm, Jeremy Laidman wrote:
On 11 March 2015 at 14:18, Vernon Everett <user-b3f8dacb72c8@xymon.invalid>
wrote:
About now, I am getting a little nervous adding send and expect,
because
unlike telnet and telnets, we are doing ldap and ldaps testing.
That's understandable.  A read through the code suggests that at
least in
some places, an empty string is equivalent to an undefined string, as
the
string length (shown in Sendlen in the debug output) is zero in both
cases.  So until a patch is in place, a work-around might be to define
empty "send" and "expect" strings for those that have none.

Any suggestions?
I think we have some debug code update recommendations for JC though.

:-)
quoted from Vernon Everett
 Here's my patch.  I'll push this into the dev list for proposed
inclusion
in a future release.

--- lib/netservices.c.orig      2012-07-25 01:48:41.000000000 +1000
+++ lib/netservices.c   2015-03-12 11:18:18.000000000 +1100
@@ -328,9 +328,9 @@
        dbgprintf("Service list dump\n");
        for (i=0; (svcinfo[i].svcname); i++) {
                dbgprintf(" Name      : %s\n", svcinfo[i].svcname);
-               dbgprintf("   Sendtext: %s\n",
binview(svcinfo[i].sendtxt,
svcinfo[i].sendlen));
+               dbgprintf("   Sendtext: %s\n",
svcinfo[i].sendtxt!=NULL?binview(svcinfo[i].sendtxt,
svcinfo[i].sendlen):"[null]");
                dbgprintf("   Sendlen : %d\n", svcinfo[i].sendlen);
-               dbgprintf("   Exp.text: %s\n",
binview(svcinfo[i].exptext,
svcinfo[i].explen));
+               dbgprintf("   Exp.text: %s\n",
svcinfo[i].exptext!=NULL?binview(svcinfo[i].exptext,
svcinfo[i].explen):"[null]");
                dbgprintf("   Exp.len : %d\n", svcinfo[i].explen);
                dbgprintf("   Exp.ofs : %d\n", svcinfo[i].expofs);
                dbgprintf("   Flags   : %d\n", svcinfo[i].flags);

This produces "[null]" where we would have seen "(null)" on a
GNU-based
OS,
to differentiate between the two situations.

In the mean time, you could compile a special version of xymond_rrd,
and
run it manually on the same data channel as the real one, but have it
make
RRD files and log file to a different location.  This shouldn't
interfere
with your production Xymon.  Here's one I prepared earlier that works
for
me:

sudo -u xymon mkdir /tmp/my-rrd-data/
sudo -u xymon xymoncmd /bin/sh -c 'XYMONTMP=/tmp;
/usr/lib/xymon/server/bin/xymond_channel --channel=data
--log=/tmp/my-rrd-data.log /path/to/xymond_rrd_debug_patch
--rrddir=/tmp/my-rrd-data/ --debug'

This seems to show some really useful stuff that's relevant to solving
your
problem.  Some sample debug lines:

15306 2015-03-12 11:36:28 xymond_rrd_debug_patch: Got message 165619
@@data#165619/servername|1426120588.401891|172.16.0.1||servername|vmstat|sunos|ABC
...
15306 2015-03-12 11:36:28 Creating rrd
/tmp/my-rrd-data//servername/vmstat.rrd
15306 2015-03-12 11:36:28 RRD create param 00: 'rrdcreate'
15306 2015-03-12 11:36:28 RRD create param 01:
'/tmp/my-rrd-data//servername/vmstat.rrd'
15306 2015-03-12 11:36:28 RRD create param 02: '-s'
15306 2015-03-12 11:36:28 RRD create param 03: '300'
15306 2015-03-12 11:36:28 RRD create param 04:
'DS:cpu_r:GAUGE:600:0:U'
15306 2015-03-12 11:36:28 RRD create param 05:
'DS:cpu_b:GAUGE:600:0:U'
15306 2015-03-12 11:36:28 RRD create param 06:
'DS:cpu_w:GAUGE:600:0:U'
...
15306 2015-03-12 11:39:42 Got 265 bytes
15306 2015-03-12 11:39:42 xymond_rrd_debug_patch: Got message 165737
@@data#165737/servername|1426120782.080244|172.16.0.2||servername|trends||DEF
15306 2015-03-12 11:39:42 startpos 216644, fillpos 216644, endpos -1
15306 2015-03-12 11:39:42 Flushing
'/servername/tcp.xopiy90404.parameter.rrd' with 1 updates pending,
template
'sec'
15306 2015-03-12 11:39:42 Want msg 165738, startpos 216644, fillpos
216644,
endpos -1, usedbytes=0, bufleft=1884603

J
This is some excellent sleuthing! :)

As I was pouring through the thread (sorry, I've been out the last few
days), I failed to take note of the SPARC-Enterprise-T2000 in the
output.


The patch below should fix the immediate issue triggered by debug
mode...
letting us move on to the larger oddness. Unfortunately, I have a
feeling
there are other occasions where we're relying on GNU's printf(NULL)
printing that out and thus might be caught by this. As I find them, I go
ahead and work to put fixes in.

In the meantime, this will be in 4.3.19 and can be patched directly from
below.


HTH,

-jc


--- lib/netservices.c   (revision 7598)
+++ lib/netservices.c   (working copy)
@@ -81,9 +81,9 @@
        unsigned char *inp, *outp;
        int i;

-       if (!buf) return NULL;
+       if (result) xfree(result);
+       if (!buf) { result = strdup("[null]"); return result; }

-       if (result) xfree(result);
        if (buf && (buflen == 0)) buflen = strlen(buf);
        result = (char *)malloc(4*buflen + 1);  /* Worst case: All
binary */
--
"Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton
--
"Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton
list Vernon Everett · Sat, 21 Mar 2015 18:31:03 +0800 ·
Very confused now.
In the test graph, showing the history graphs, the URL contains
service=ncv:power
And in the history graphs in the status, it's this.
service=power

That doesn't confuse me as much as what the graphs look like now.
Both the test and the trends graph now contain the spurious values.
Yesterday, they were only in the trends graph.

A need to get that debug output fixed.

Regards
Vernon
quoted from Jeremy Laidman


On 21 March 2015 at 17:48, Jeremy Laidman <user-71895fb2e44c@xymon.invalid> wrote:
So the URLs are different? But both have service=power in the URLs?

On Sat, 21 Mar 2015 10:16 Vernon Everett <user-b3f8dacb72c8@xymon.invalid> wrote:
Hi Jeremy

That thought occurred to me, but I checked.
There is only one [power] entry in the graphs.cfg file.
And I put it there for this particular test.

Would have made this one too easy if it was that. :-)

Regards
Vernon


On 20 March 2015 at 16:43, Jeremy Laidman <user-71895fb2e44c@xymon.invalid>
wrote:
Vernon

The power status page must refer to a different graph name in graphs.cfg
with a different FNPATTERN.

Click on the graphs images for each version to get the 4-graph view and
compare the URLs.

J

On Fri, 20 Mar 2015 19:35 Vernon Everett <user-b3f8dacb72c8@xymon.invalid>
wrote:
Hi all

I was only back at the client today, and unfortunately have not managed
to get that patch in yet.
(As I mentioned before, it's a production system)

However, I did notice something really odd.
I have focused my attention on the trends graphs, where I get all the
extra values, but it's not happening in the test itself, despite the
existence of the additional rrd files.

Example.
I have something that plots the power usage of the PSUs on a NetApp
e-series.
There are 4 PSUs, output looks like this.

Total power drawn- 487 Watts
Number of trays- 2
Tray power input details-

   TRAY ID  POWER SUPPLY SERIAL NUMBER   INPUT POWER
   99       0                            145 Watts
   99       1                            151 Watts
   0        0                            99 Watts
   0        1                            92 Watts

All good. And I have a graph with 4 lines. Min, Max, Curr and Avg
values are all there. It looks beautiful.
But go look at the power graph in trends, and it's ugly.
Heaps of additional data lines with no entries. All values are NaN
And mixed in amongst the additional empty graphs, are the 4 valid lines.

I look at the rrd files, and they are all there, even the bad ones.
Here's a few of them.
power,tcpListenDrop.rrd
power,tcpOutAck.rrd
power,tcpOutDataSegs.rrd
power,tcpOutRsts.rrd
power,tcpOutUrg.rrd
power,tcpOutWinProbe.rrd
power,tcpRetransSegs.rrd
power,tcpRtoMax.rrd
power,tcpRttUpdate.rrd
power,tcpTimKeepaliveProbe.rrd
power,tcpTimRetransDrop.rrd
power,Tray0_PSU0.rrd                  <--- Valid
power,Tray0_PSU1.rrd                  <--- Valid
power,Tray99_PSU0.rrd                 <--- Valid
power,Tray99_PSU1.rrd                 <--- Valid
power,trlogpool.rrd
power,UDP_udpInDatagrams.rrd
power,udpInCksumErrs.rrd
power,udpOutDatagrams.rrd
power,vnet.rrd

So I thought I would check my configs.
In xymonserver
From TEST2RRD= ,power=ncv,
From GRAPHS=  ,power::9,
And further down
SPLITNCV_power="*:GAUGE"

And in graphs.cfg
[power]
    FNPATTERN power,(.*).rrd
    TITLE Database Power Consumption Per Tray PSU
    YAXIS Watts
    -l 0
    DEF:p at RRDIDX@=@RRDFN@:lambda:AVERAGE
    LINE2:p at RRDIDX@#@COLOR@:@RRDPARAM@
    GPRINT:p at RRDIDX@:LAST: \: %5.1lf (cur)
    GPRINT:p at RRDIDX@:MAX: \: %5.1lf (max)
    GPRINT:p at RRDIDX@:MIN: \: %5.1lf (min)
    GPRINT:p at RRDIDX@:AVERAGE: \: %5.1lf (avg)\n

With luck I will get approval to recompile with the debugging bug-fix,
and we can get more info, but I thought the extra entries in trends, but
not in the test was interesting.

Regards
Vernon


On 13 March 2015 at 15:24, J.C. Cleaver <user-87556346d4af@xymon.invalid> wrote:
On Wed, March 11, 2015 5:51 pm, Jeremy Laidman wrote:
On 11 March 2015 at 14:18, Vernon Everett <user-b3f8dacb72c8@xymon.invalid>
wrote:
About now, I am getting a little nervous adding send and expect,
because
unlike telnet and telnets, we are doing ldap and ldaps testing.
That's understandable.  A read through the code suggests that at
least in
some places, an empty string is equivalent to an undefined string,
as the
string length (shown in Sendlen in the debug output) is zero in both
cases.  So until a patch is in place, a work-around might be to
define
empty "send" and "expect" strings for those that have none.

Any suggestions?
I think we have some debug code update recommendations for JC
though.

:-)
quoted from Jeremy Laidman
 Here's my patch.  I'll push this into the dev list for proposed
inclusion
in a future release.

--- lib/netservices.c.orig      2012-07-25 01:48:41.000000000 +1000
+++ lib/netservices.c   2015-03-12 11:18:18.000000000 +1100
@@ -328,9 +328,9 @@
        dbgprintf("Service list dump\n");
        for (i=0; (svcinfo[i].svcname); i++) {
                dbgprintf(" Name      : %s\n", svcinfo[i].svcname);
-               dbgprintf("   Sendtext: %s\n",
binview(svcinfo[i].sendtxt,
svcinfo[i].sendlen));
+               dbgprintf("   Sendtext: %s\n",
svcinfo[i].sendtxt!=NULL?binview(svcinfo[i].sendtxt,
svcinfo[i].sendlen):"[null]");
                dbgprintf("   Sendlen : %d\n", svcinfo[i].sendlen);
-               dbgprintf("   Exp.text: %s\n",
binview(svcinfo[i].exptext,
svcinfo[i].explen));
+               dbgprintf("   Exp.text: %s\n",
svcinfo[i].exptext!=NULL?binview(svcinfo[i].exptext,
svcinfo[i].explen):"[null]");
                dbgprintf("   Exp.len : %d\n", svcinfo[i].explen);
                dbgprintf("   Exp.ofs : %d\n", svcinfo[i].expofs);
                dbgprintf("   Flags   : %d\n", svcinfo[i].flags);

This produces "[null]" where we would have seen "(null)" on a
GNU-based
OS,
to differentiate between the two situations.

In the mean time, you could compile a special version of xymond_rrd,
and
run it manually on the same data channel as the real one, but have
it make
RRD files and log file to a different location.  This shouldn't
interfere
with your production Xymon.  Here's one I prepared earlier that
works for
me:

sudo -u xymon mkdir /tmp/my-rrd-data/
sudo -u xymon xymoncmd /bin/sh -c 'XYMONTMP=/tmp;
/usr/lib/xymon/server/bin/xymond_channel --channel=data
--log=/tmp/my-rrd-data.log /path/to/xymond_rrd_debug_patch
--rrddir=/tmp/my-rrd-data/ --debug'

This seems to show some really useful stuff that's relevant to
solving
your
problem.  Some sample debug lines:

15306 2015-03-12 11:36:28 xymond_rrd_debug_patch: Got message 165619
@@data#165619/servername|1426120588.401891|172.16.0.1||servername|vmstat|sunos|ABC
...
15306 2015-03-12 11:36:28 Creating rrd
/tmp/my-rrd-data//servername/vmstat.rrd
15306 2015-03-12 11:36:28 RRD create param 00: 'rrdcreate'
15306 2015-03-12 11:36:28 RRD create param 01:
'/tmp/my-rrd-data//servername/vmstat.rrd'
15306 2015-03-12 11:36:28 RRD create param 02: '-s'
15306 2015-03-12 11:36:28 RRD create param 03: '300'
15306 2015-03-12 11:36:28 RRD create param 04:
'DS:cpu_r:GAUGE:600:0:U'
15306 2015-03-12 11:36:28 RRD create param 05:
'DS:cpu_b:GAUGE:600:0:U'
15306 2015-03-12 11:36:28 RRD create param 06:
'DS:cpu_w:GAUGE:600:0:U'
...
15306 2015-03-12 11:39:42 Got 265 bytes
15306 2015-03-12 11:39:42 xymond_rrd_debug_patch: Got message 165737
@@data#165737/servername|1426120782.080244|172.16.0.2||servername|trends||DEF
15306 2015-03-12 11:39:42 startpos 216644, fillpos 216644, endpos -1
15306 2015-03-12 11:39:42 Flushing
'/servername/tcp.xopiy90404.parameter.rrd' with 1 updates pending,
template
'sec'
15306 2015-03-12 11:39:42 Want msg 165738, startpos 216644, fillpos
216644,
endpos -1, usedbytes=0, bufleft=1884603

J
This is some excellent sleuthing! :)

As I was pouring through the thread (sorry, I've been out the last few
days), I failed to take note of the SPARC-Enterprise-T2000 in the
output.


The patch below should fix the immediate issue triggered by debug
mode...
letting us move on to the larger oddness. Unfortunately, I have a
feeling
there are other occasions where we're relying on GNU's printf(NULL)
printing that out and thus might be caught by this. As I find them, I
go
ahead and work to put fixes in.

In the meantime, this will be in 4.3.19 and can be patched directly
from
below.


HTH,

-jc


--- lib/netservices.c   (revision 7598)
+++ lib/netservices.c   (working copy)
@@ -81,9 +81,9 @@
        unsigned char *inp, *outp;
        int i;

-       if (!buf) return NULL;
+       if (result) xfree(result);
+       if (!buf) { result = strdup("[null]"); return result; }

-       if (result) xfree(result);
        if (buf && (buflen == 0)) buflen = strlen(buf);
        result = (char *)malloc(4*buflen + 1);  /* Worst case: All
binary */
--
"Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton
--
"Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton
-- 
"Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton
list Galen Johnson · Sat, 21 Mar 2015 10:40:29 -0400 ·
Stupid question but can't you just set up a test Xymon server and have the
client system point to both servers?  That way you can update the test
instance and do whatever you need to tweak it without impacting prod.

On Sat, Mar 21, 2015 at 6:31 AM, Vernon Everett <user-b3f8dacb72c8@xymon.invalid>
quoted from Vernon Everett
wrote:
Very confused now.
In the test graph, showing the history graphs, the URL contains
service=ncv:power
And in the history graphs in the status, it's this.
service=power

That doesn't confuse me as much as what the graphs look like now.
Both the test and the trends graph now contain the spurious values.
Yesterday, they were only in the trends graph.

A need to get that debug output fixed.

Regards
Vernon


On 21 March 2015 at 17:48, Jeremy Laidman <user-71895fb2e44c@xymon.invalid>
wrote:
So the URLs are different? But both have service=power in the URLs?

On Sat, 21 Mar 2015 10:16 Vernon Everett <user-b3f8dacb72c8@xymon.invalid>
wrote:
Hi Jeremy

That thought occurred to me, but I checked.
There is only one [power] entry in the graphs.cfg file.
And I put it there for this particular test.

Would have made this one too easy if it was that. :-)

Regards
Vernon


On 20 March 2015 at 16:43, Jeremy Laidman <user-71895fb2e44c@xymon.invalid>
wrote:
Vernon

The power status page must refer to a different graph name in
graphs.cfg with a different FNPATTERN.

Click on the graphs images for each version to get the 4-graph view and
compare the URLs.

J

On Fri, 20 Mar 2015 19:35 Vernon Everett <user-b3f8dacb72c8@xymon.invalid>
wrote:
Hi all

I was only back at the client today, and unfortunately have not
managed to get that patch in yet.
(As I mentioned before, it's a production system)

However, I did notice something really odd.
I have focused my attention on the trends graphs, where I get all the
extra values, but it's not happening in the test itself, despite the
existence of the additional rrd files.

Example.
I have something that plots the power usage of the PSUs on a NetApp
e-series.
There are 4 PSUs, output looks like this.

Total power drawn- 487 Watts
Number of trays- 2
Tray power input details-

   TRAY ID  POWER SUPPLY SERIAL NUMBER   INPUT POWER
   99       0                            145 Watts
   99       1                            151 Watts
   0        0                            99 Watts
   0        1                            92 Watts

All good. And I have a graph with 4 lines. Min, Max, Curr and Avg
values are all there. It looks beautiful.
But go look at the power graph in trends, and it's ugly.
Heaps of additional data lines with no entries. All values are NaN
And mixed in amongst the additional empty graphs, are the 4 valid
lines.

I look at the rrd files, and they are all there, even the bad ones.
Here's a few of them.
power,tcpListenDrop.rrd
power,tcpOutAck.rrd
power,tcpOutDataSegs.rrd
power,tcpOutRsts.rrd
power,tcpOutUrg.rrd
power,tcpOutWinProbe.rrd
power,tcpRetransSegs.rrd
power,tcpRtoMax.rrd
power,tcpRttUpdate.rrd
power,tcpTimKeepaliveProbe.rrd
power,tcpTimRetransDrop.rrd
power,Tray0_PSU0.rrd                  <--- Valid
power,Tray0_PSU1.rrd                  <--- Valid
power,Tray99_PSU0.rrd                 <--- Valid
power,Tray99_PSU1.rrd                 <--- Valid
power,trlogpool.rrd
power,UDP_udpInDatagrams.rrd
power,udpInCksumErrs.rrd
power,udpOutDatagrams.rrd
power,vnet.rrd

So I thought I would check my configs.
In xymonserver
From TEST2RRD= ,power=ncv,
From GRAPHS=  ,power::9,
And further down
SPLITNCV_power="*:GAUGE"

And in graphs.cfg
[power]
    FNPATTERN power,(.*).rrd
    TITLE Database Power Consumption Per Tray PSU
    YAXIS Watts
    -l 0
    DEF:p at RRDIDX@=@RRDFN@:lambda:AVERAGE
    LINE2:p at RRDIDX@#@COLOR@:@RRDPARAM@
    GPRINT:p at RRDIDX@:LAST: \: %5.1lf (cur)
    GPRINT:p at RRDIDX@:MAX: \: %5.1lf (max)
    GPRINT:p at RRDIDX@:MIN: \: %5.1lf (min)
    GPRINT:p at RRDIDX@:AVERAGE: \: %5.1lf (avg)\n

With luck I will get approval to recompile with the debugging bug-fix,
and we can get more info, but I thought the extra entries in trends, but
not in the test was interesting.

Regards
Vernon


On 13 March 2015 at 15:24, J.C. Cleaver <user-87556346d4af@xymon.invalid>
wrote:
On Wed, March 11, 2015 5:51 pm, Jeremy Laidman wrote:
On 11 March 2015 at 14:18, Vernon Everett <user-b3f8dacb72c8@xymon.invalid

wrote:
About now, I am getting a little nervous adding send and expect,
because
unlike telnet and telnets, we are doing ldap and ldaps testing.
That's understandable.  A read through the code suggests that at
least in
some places, an empty string is equivalent to an undefined string,
as the
string length (shown in Sendlen in the debug output) is zero in both
cases.  So until a patch is in place, a work-around might be to
define
empty "send" and "expect" strings for those that have none.

Any suggestions?
I think we have some debug code update recommendations for JC
though.

:-)
quoted from Vernon Everett
 Here's my patch.  I'll push this into the dev list for proposed
inclusion
in a future release.

--- lib/netservices.c.orig      2012-07-25 01:48:41.000000000 +1000
+++ lib/netservices.c   2015-03-12 11:18:18.000000000 +1100
@@ -328,9 +328,9 @@
        dbgprintf("Service list dump\n");
        for (i=0; (svcinfo[i].svcname); i++) {
                dbgprintf(" Name      : %s\n", svcinfo[i].svcname);
-               dbgprintf("   Sendtext: %s\n",
binview(svcinfo[i].sendtxt,
svcinfo[i].sendlen));
+               dbgprintf("   Sendtext: %s\n",
svcinfo[i].sendtxt!=NULL?binview(svcinfo[i].sendtxt,
svcinfo[i].sendlen):"[null]");
                dbgprintf("   Sendlen : %d\n", svcinfo[i].sendlen);
-               dbgprintf("   Exp.text: %s\n",
binview(svcinfo[i].exptext,
svcinfo[i].explen));
+               dbgprintf("   Exp.text: %s\n",
svcinfo[i].exptext!=NULL?binview(svcinfo[i].exptext,
svcinfo[i].explen):"[null]");
                dbgprintf("   Exp.len : %d\n", svcinfo[i].explen);
                dbgprintf("   Exp.ofs : %d\n", svcinfo[i].expofs);
                dbgprintf("   Flags   : %d\n", svcinfo[i].flags);

This produces "[null]" where we would have seen "(null)" on a
GNU-based
OS,
to differentiate between the two situations.

In the mean time, you could compile a special version of
xymond_rrd, and
run it manually on the same data channel as the real one, but have
it make
RRD files and log file to a different location.  This shouldn't
interfere
with your production Xymon.  Here's one I prepared earlier that
works for
me:

sudo -u xymon mkdir /tmp/my-rrd-data/
sudo -u xymon xymoncmd /bin/sh -c 'XYMONTMP=/tmp;
/usr/lib/xymon/server/bin/xymond_channel --channel=data
--log=/tmp/my-rrd-data.log /path/to/xymond_rrd_debug_patch
--rrddir=/tmp/my-rrd-data/ --debug'

This seems to show some really useful stuff that's relevant to
solving
your
problem.  Some sample debug lines:

15306 2015-03-12 11:36:28 xymond_rrd_debug_patch: Got message 165619
@@data#165619/servername|1426120588.401891|172.16.0.1||servername|vmstat|sunos|ABC
...
15306 2015-03-12 11:36:28 Creating rrd
/tmp/my-rrd-data//servername/vmstat.rrd
15306 2015-03-12 11:36:28 RRD create param 00: 'rrdcreate'
15306 2015-03-12 11:36:28 RRD create param 01:
'/tmp/my-rrd-data//servername/vmstat.rrd'
15306 2015-03-12 11:36:28 RRD create param 02: '-s'
15306 2015-03-12 11:36:28 RRD create param 03: '300'
15306 2015-03-12 11:36:28 RRD create param 04:
'DS:cpu_r:GAUGE:600:0:U'
15306 2015-03-12 11:36:28 RRD create param 05:
'DS:cpu_b:GAUGE:600:0:U'
15306 2015-03-12 11:36:28 RRD create param 06:
'DS:cpu_w:GAUGE:600:0:U'
...
15306 2015-03-12 11:39:42 Got 265 bytes
15306 2015-03-12 11:39:42 xymond_rrd_debug_patch: Got message 165737
@@data#165737/servername|1426120782.080244|172.16.0.2||servername|trends||DEF
15306 2015-03-12 11:39:42 startpos 216644, fillpos 216644, endpos -1
15306 2015-03-12 11:39:42 Flushing
'/servername/tcp.xopiy90404.parameter.rrd' with 1 updates pending,
template
'sec'
15306 2015-03-12 11:39:42 Want msg 165738, startpos 216644, fillpos
216644,
endpos -1, usedbytes=0, bufleft=1884603

J
This is some excellent sleuthing! :)

As I was pouring through the thread (sorry, I've been out the last few
days), I failed to take note of the SPARC-Enterprise-T2000 in the
output.


The patch below should fix the immediate issue triggered by debug
mode...
letting us move on to the larger oddness. Unfortunately, I have a
feeling
there are other occasions where we're relying on GNU's printf(NULL)
printing that out and thus might be caught by this. As I find them, I
go
ahead and work to put fixes in.

In the meantime, this will be in 4.3.19 and can be patched directly
from
below.


HTH,

-jc


--- lib/netservices.c   (revision 7598)
+++ lib/netservices.c   (working copy)
@@ -81,9 +81,9 @@
        unsigned char *inp, *outp;
        int i;

-       if (!buf) return NULL;
+       if (result) xfree(result);
+       if (!buf) { result = strdup("[null]"); return result; }

-       if (result) xfree(result);
        if (buf && (buflen == 0)) buflen = strlen(buf);
        result = (char *)malloc(4*buflen + 1);  /* Worst case: All
binary */
--
"Accept the challenges so that you can feel the exhilaration of
victory"
- General George Patton
--
"Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton
--
"Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton

list Vernon Everett · Sun, 22 Mar 2015 08:19:35 +0800 ·
I was hoping to get approval to patch the production system, but change
control at $CLIENT is sclerotic on a good day.
Your suggestion is probably the way I will do it.

Back there on Wednesday. Watch this space.

Cheers
quoted from Galen Johnson
Vernon


On 21 March 2015 at 22:40, Galen Johnson <user-fc632e705d24@xymon.invalid> wrote:
Stupid question but can't you just set up a test Xymon server and have the
client system point to both servers?  That way you can update the test
instance and do whatever you need to tweak it without impacting prod.

On Sat, Mar 21, 2015 at 6:31 AM, Vernon Everett <user-b3f8dacb72c8@xymon.invalid>
wrote:
Very confused now.
In the test graph, showing the history graphs, the URL contains
service=ncv:power
And in the history graphs in the status, it's this.
service=power

That doesn't confuse me as much as what the graphs look like now.
Both the test and the trends graph now contain the spurious values.
Yesterday, they were only in the trends graph.

A need to get that debug output fixed.

Regards
Vernon


On 21 March 2015 at 17:48, Jeremy Laidman <user-71895fb2e44c@xymon.invalid>
wrote:
So the URLs are different? But both have service=power in the URLs?

On Sat, 21 Mar 2015 10:16 Vernon Everett <user-b3f8dacb72c8@xymon.invalid>
wrote:
Hi Jeremy

That thought occurred to me, but I checked.
There is only one [power] entry in the graphs.cfg file.
And I put it there for this particular test.

Would have made this one too easy if it was that. :-)

Regards
Vernon


On 20 March 2015 at 16:43, Jeremy Laidman <user-71895fb2e44c@xymon.invalid>
wrote:
Vernon

The power status page must refer to a different graph name in
graphs.cfg with a different FNPATTERN.

Click on the graphs images for each version to get the 4-graph view
and compare the URLs.

J

On Fri, 20 Mar 2015 19:35 Vernon Everett <user-b3f8dacb72c8@xymon.invalid>
wrote:
Hi all

I was only back at the client today, and unfortunately have not
managed to get that patch in yet.
(As I mentioned before, it's a production system)

However, I did notice something really odd.
I have focused my attention on the trends graphs, where I get all the
extra values, but it's not happening in the test itself, despite the
existence of the additional rrd files.

Example.
I have something that plots the power usage of the PSUs on a NetApp
e-series.
There are 4 PSUs, output looks like this.

Total power drawn- 487 Watts
Number of trays- 2
Tray power input details-

   TRAY ID  POWER SUPPLY SERIAL NUMBER   INPUT POWER
   99       0                            145 Watts
   99       1                            151 Watts
   0        0                            99 Watts
   0        1                            92 Watts

All good. And I have a graph with 4 lines. Min, Max, Curr and Avg
values are all there. It looks beautiful.
But go look at the power graph in trends, and it's ugly.
Heaps of additional data lines with no entries. All values are NaN
And mixed in amongst the additional empty graphs, are the 4 valid
lines.

I look at the rrd files, and they are all there, even the bad ones.
Here's a few of them.
power,tcpListenDrop.rrd
power,tcpOutAck.rrd
power,tcpOutDataSegs.rrd
power,tcpOutRsts.rrd
power,tcpOutUrg.rrd
power,tcpOutWinProbe.rrd
power,tcpRetransSegs.rrd
power,tcpRtoMax.rrd
power,tcpRttUpdate.rrd
power,tcpTimKeepaliveProbe.rrd
power,tcpTimRetransDrop.rrd
power,Tray0_PSU0.rrd                  <--- Valid
power,Tray0_PSU1.rrd                  <--- Valid
power,Tray99_PSU0.rrd                 <--- Valid
power,Tray99_PSU1.rrd                 <--- Valid
power,trlogpool.rrd
power,UDP_udpInDatagrams.rrd
power,udpInCksumErrs.rrd
power,udpOutDatagrams.rrd
power,vnet.rrd

So I thought I would check my configs.
In xymonserver
From TEST2RRD= ,power=ncv,
From GRAPHS=  ,power::9,
And further down
SPLITNCV_power="*:GAUGE"

And in graphs.cfg
[power]
    FNPATTERN power,(.*).rrd
    TITLE Database Power Consumption Per Tray PSU
    YAXIS Watts
    -l 0
    DEF:p at RRDIDX@=@RRDFN@:lambda:AVERAGE
    LINE2:p at RRDIDX@#@COLOR@:@RRDPARAM@
    GPRINT:p at RRDIDX@:LAST: \: %5.1lf (cur)
    GPRINT:p at RRDIDX@:MAX: \: %5.1lf (max)
    GPRINT:p at RRDIDX@:MIN: \: %5.1lf (min)
    GPRINT:p at RRDIDX@:AVERAGE: \: %5.1lf (avg)\n

With luck I will get approval to recompile with the debugging
bug-fix, and we can get more info, but I thought the extra entries in
trends, but not in the test was interesting.

Regards
Vernon


On 13 March 2015 at 15:24, J.C. Cleaver <user-87556346d4af@xymon.invalid>
wrote:
On Wed, March 11, 2015 5:51 pm, Jeremy Laidman wrote:
On 11 March 2015 at 14:18, Vernon Everett <
user-b3f8dacb72c8@xymon.invalid>
wrote:
About now, I am getting a little nervous adding send and expect,
because
unlike telnet and telnets, we are doing ldap and ldaps testing.
That's understandable.  A read through the code suggests that at
least in
some places, an empty string is equivalent to an undefined string,
as the
string length (shown in Sendlen in the debug output) is zero in
both
cases.  So until a patch is in place, a work-around might be to
define
empty "send" and "expect" strings for those that have none.

Any suggestions?
I think we have some debug code update recommendations for JC
though.

:-)
quoted from Galen Johnson
 Here's my patch.  I'll push this into the dev list for proposed
inclusion
in a future release.

--- lib/netservices.c.orig      2012-07-25 01:48:41.000000000 +1000
+++ lib/netservices.c   2015-03-12 11:18:18.000000000 +1100
@@ -328,9 +328,9 @@
        dbgprintf("Service list dump\n");
        for (i=0; (svcinfo[i].svcname); i++) {
                dbgprintf(" Name      : %s\n", svcinfo[i].svcname);
-               dbgprintf("   Sendtext: %s\n",
binview(svcinfo[i].sendtxt,
svcinfo[i].sendlen));
+               dbgprintf("   Sendtext: %s\n",
svcinfo[i].sendtxt!=NULL?binview(svcinfo[i].sendtxt,
svcinfo[i].sendlen):"[null]");
                dbgprintf("   Sendlen : %d\n", svcinfo[i].sendlen);
-               dbgprintf("   Exp.text: %s\n",
binview(svcinfo[i].exptext,
svcinfo[i].explen));
+               dbgprintf("   Exp.text: %s\n",
svcinfo[i].exptext!=NULL?binview(svcinfo[i].exptext,
svcinfo[i].explen):"[null]");
                dbgprintf("   Exp.len : %d\n", svcinfo[i].explen);
                dbgprintf("   Exp.ofs : %d\n", svcinfo[i].expofs);
                dbgprintf("   Flags   : %d\n", svcinfo[i].flags);

This produces "[null]" where we would have seen "(null)" on a
GNU-based
OS,
to differentiate between the two situations.

In the mean time, you could compile a special version of
xymond_rrd, and
run it manually on the same data channel as the real one, but have
it make
RRD files and log file to a different location.  This shouldn't
interfere
with your production Xymon.  Here's one I prepared earlier that
works for
me:

sudo -u xymon mkdir /tmp/my-rrd-data/
sudo -u xymon xymoncmd /bin/sh -c 'XYMONTMP=/tmp;
/usr/lib/xymon/server/bin/xymond_channel --channel=data
--log=/tmp/my-rrd-data.log /path/to/xymond_rrd_debug_patch
--rrddir=/tmp/my-rrd-data/ --debug'

This seems to show some really useful stuff that's relevant to
solving
your
problem.  Some sample debug lines:

15306 2015-03-12 11:36:28 xymond_rrd_debug_patch: Got message
165619
@@data#165619/servername|1426120588.401891|172.16.0.1||servername|vmstat|sunos|ABC
...
15306 2015-03-12 11:36:28 Creating rrd
/tmp/my-rrd-data//servername/vmstat.rrd
15306 2015-03-12 11:36:28 RRD create param 00: 'rrdcreate'
15306 2015-03-12 11:36:28 RRD create param 01:
'/tmp/my-rrd-data//servername/vmstat.rrd'
15306 2015-03-12 11:36:28 RRD create param 02: '-s'
15306 2015-03-12 11:36:28 RRD create param 03: '300'
15306 2015-03-12 11:36:28 RRD create param 04:
'DS:cpu_r:GAUGE:600:0:U'
15306 2015-03-12 11:36:28 RRD create param 05:
'DS:cpu_b:GAUGE:600:0:U'
15306 2015-03-12 11:36:28 RRD create param 06:
'DS:cpu_w:GAUGE:600:0:U'
...
15306 2015-03-12 11:39:42 Got 265 bytes
15306 2015-03-12 11:39:42 xymond_rrd_debug_patch: Got message
165737
@@data#165737/servername|1426120782.080244|172.16.0.2||servername|trends||DEF
15306 2015-03-12 11:39:42 startpos 216644, fillpos 216644, endpos
-1
15306 2015-03-12 11:39:42 Flushing
'/servername/tcp.xopiy90404.parameter.rrd' with 1 updates pending,
template
'sec'
15306 2015-03-12 11:39:42 Want msg 165738, startpos 216644, fillpos
216644,
endpos -1, usedbytes=0, bufleft=1884603

J
This is some excellent sleuthing! :)

As I was pouring through the thread (sorry, I've been out the last
few
days), I failed to take note of the SPARC-Enterprise-T2000 in the
output.


The patch below should fix the immediate issue triggered by debug
mode...
letting us move on to the larger oddness. Unfortunately, I have a
feeling
there are other occasions where we're relying on GNU's printf(NULL)
printing that out and thus might be caught by this. As I find them,
I go
ahead and work to put fixes in.

In the meantime, this will be in 4.3.19 and can be patched directly
from
below.


HTH,

-jc


--- lib/netservices.c   (revision 7598)
+++ lib/netservices.c   (working copy)
@@ -81,9 +81,9 @@
        unsigned char *inp, *outp;
        int i;

-       if (!buf) return NULL;
+       if (result) xfree(result);
+       if (!buf) { result = strdup("[null]"); return result; }

-       if (result) xfree(result);
        if (buf && (buflen == 0)) buflen = strlen(buf);
        result = (char *)malloc(4*buflen + 1);  /* Worst case: All
binary */
--
"Accept the challenges so that you can feel the exhilaration of
victory"
- General George Patton
--
"Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton
--
"Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton

-- 
"Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton
list Vernon Everett · Wed, 25 Mar 2015 14:31:55 +0800 ·
Hi all

Patched and reinstalled.
But it didn't go quite as well as expected.
Lots of shit broke.
Then I remembered it overwrites most of the config files. Which will do
that.
There were a few minutes of panic there, until I remembered it makes
backups of everything. :-)

Copied the backup configs back, and it started to look good again. Whew!

And, it looks like we have debug data.

Now to catch it when it goes wrong.

Regards
Vernon


On 25 March 2015 at 13:07, Jeremy Laidman <user-71895fb2e44c@xymon.invalid> wrote:
On 25 March 2015 at 15:54, Vernon Everett <user-b3f8dacb72c8@xymon.invalid>
wrote:
Ah!
There are two patches for this.
Yours, and JC's.
I have been trying to apply JC's
Attempting yours fails too. :-(
Mine would possibly fail if JC's has already been applied, and vice versa.

Both patches do a similar thing, just at different places in the code
execution path.  Mine catches the null pointer and avoids calling
binview(), whereas JC's catches the null pointer inside binview().  I
suspect JC's patch will end up going into a future release of Xymon, so I'd
probably go with his.
quoted from Vernon Everett

J

-- 
"Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton