Xymon Mailing List Archive search

disk graph page limits total file systems

11 messages in this thread

list Schminke_Erik_D · Tue, 17 May 2016 13:26:49 -0500 ·
I'm having some trouble with the disk graph pages for hosts that have
numerous file systems reporting in.  The limit seems to be 85.  There are 5
file systems displayed on each graph and there are 17 graphs max on this
page.  If I view the trends page for those host, all of the missing
filesystems are graphed.  85 seems pretty arbitrary.

I've gone through and cleaned up the RRD files from $XYMON/data/rrd/$HOST
that are "stale".  Cleaning some of these out, which would otherwise not be
listed, permit the filesystems that were trimmed to then be listed on the
graph page.  Still, not all of the filesystems make it on the graphs.

I searched the mail archive, and didn't find a solutions.  I found a
discussion of changing the [disk] section to [disk::10] for 10 filesystems
per graph... that prevented the graphs from being generated at all.  My
thinking was, perhaps if i increase the number of filesystems per graph, 17
graphs would be enough for this particular system.

In any event, I'm wondering if this is a bug or a configurable.  It
actually works fine in a previous version of xymon. (v4.2.2 works,
v.4.3.24).  Thanks!

Erik D. Schminke | Associate Systems Programmer
Hormel Foods Corporation | One Hormel Place | Austin, MN XXXXX
Phone: (XXX) XXX-XXXX
user-15513f33c451@xymon.invalid | www.hormelfoods.com
list Japheth Cleaver · Tue, 17 May 2016 14:53:37 -0700 ·
quoted from Schminke_Erik_D
On Tue, May 17, 2016 11:26 am, user-15513f33c451@xymon.invalid wrote:
I'm having some trouble with the disk graph pages for hosts that have
numerous file systems reporting in.  The limit seems to be 85.  There are
5
file systems displayed on each graph and there are 17 graphs max on this
page.  If I view the trends page for those host, all of the missing
filesystems are graphed.  85 seems pretty arbitrary.

I've gone through and cleaned up the RRD files from $XYMON/data/rrd/$HOST
that are "stale".  Cleaning some of these out, which would otherwise not
be
listed, permit the filesystems that were trimmed to then be listed on the
graph page.  Still, not all of the filesystems make it on the graphs.

I searched the mail archive, and didn't find a solutions.  I found a
discussion of changing the [disk] section to [disk::10] for 10 filesystems
per graph... that prevented the graphs from being generated at all.  My
thinking was, perhaps if i increase the number of filesystems per graph,
17
graphs would be enough for this particular system.

In any event, I'm wondering if this is a bug or a configurable.  It
actually works fine in a previous version of xymon. (v4.2.2 works,
v.4.3.24).  Thanks!

Erik D. Schminke | Associate Systems Programmer
Hormel Foods Corporation | One Hormel Place | Austin, MN XXXXX
Phone: (XXX) XXX-XXXX
user-15513f33c451@xymon.invalid | www.hormelfoods.com
Erik,

85 isn't an intentional hard limit here. I've been scanning through the
showgraph code and it seems like the reallocation should be able to
continue as needed (whether not having a hard limit at all is a good idea
is a separate question...). There's a reference to 16 arguments to
rrd_graph, however that's a per-graph value and I don't believe it would
affect the number here.

A couple of next steps:
- Can you increase to disk::6 or 7 and see if there's a point where the
parsing of that number breaks?
- Does it constantly die at the same partition being graphed?
- Are there any errors coming out in /logs/ or the httpd error log, or
core files left?
- Are there any unusual file conditions in that directory?

I'd definitely suggest upgrading to a new version for security purposes,
but I don't think any fixes addressing this area specifically are present.

There were a lot of changes between 4.2.2 and 4.3.x, so it's hard to say
exactly what might be contributing there.


HTH,
-jc
list Schminke_Erik_D · Wed, 18 May 2016 09:28:16 -0500 ·
I could try upgrading.  At this point, it would be relatively easy since
this particular deployment isn't "in production" yet.

I, too, thought 85 was too arbitrary to have been an imposed, hard-coded
limit.  Especially given that the graphs are generated and displayed on the
trends page.

It does not always break at the same partition-- or filesystem.  But it
always breaks at the 85th partition.  If I list the rrd directory for a
host (ls $XYMON/data/rrd/$HOST/disk* | sed 's/[.]rrd//' | sort) the last
filesystem on the last graph will always be the 85th line of output.  If I
delete an rrd file (for a filesystem i really don't care about) and look at
the page again....  85th line.  Everytime.

I've found no core files and there are no errors to be found in any of the
xymon or httpd logs.  I'm willing to turn on more verbose logging to the
httpd server, just let me know how high you'd like me to turn it up.

I'm also willing to attempt a reconfiguration to show more filesystems per
graph, but I'm not 100% clear on how to go about that.  What I thought was
the correct way to do it was unsuccessful.

From my graphs.cfg:
[disk]
        FNPATTERN ^disk(.*).rrd
        TITLE Disk Utilization
        YAXIS % Full
        DEF:p at RRDIDX@=@RRDFN@:pct:AVERAGE
        LINE2:p at RRDIDX@#@COLOR@:@RRDPARAM@
        -u 100
        -l 0
        GPRINT:p at RRDIDX@:LAST: \: %5.1lf (cur)
        GPRINT:p at RRDIDX@:MAX: \: %5.1lf (max)
        GPRINT:p at RRDIDX@:MIN: \: %5.1lf (min)
        GPRINT:p at RRDIDX@:AVERAGE: \: %5.1lf (avg)\n


Do I simply change "disk" to "disk::10"? ...because if I do that, no graphs
for disk are generated... i.e.:

[disk::10]
        FNPATTERN ^disk(.*).rrd
        TITLE Disk Utilization
        YAXIS % Full
        DEF:p at RRDIDX@=@RRDFN@:pct:AVERAGE
        LINE2:p at RRDIDX@#@COLOR@:@RRDPARAM@
        -u 100
        -l 0
        GPRINT:p at RRDIDX@:LAST: \: %5.1lf (cur)
        GPRINT:p at RRDIDX@:MAX: \: %5.1lf (max)
        GPRINT:p at RRDIDX@:MIN: \: %5.1lf (min)
        GPRINT:p at RRDIDX@:AVERAGE: \: %5.1lf (avg)\n

If I'm doing that wrong, let me know.  Thanks.
signature


Erik D. Schminke | Associate Systems Programmer
Hormel Foods Corporation | One Hormel Place | Austin, MN XXXXX
Phone: (XXX) XXX-XXXX
user-15513f33c451@xymon.invalid | www.hormelfoods.com


quoted from Japheth Cleaver
From:	"J.C. Cleaver" <user-87556346d4af@xymon.invalid>
To:	user-15513f33c451@xymon.invalid
Cc:	"Xymon Mailing List" <xymon at xymon.com>
Date:	05/17/2016 04:53 PM
Subject:	Re: [Xymon] disk graph page limits total file systems


Erik,

85 isn't an intentional hard limit here. I've been scanning through the
showgraph code and it seems like the reallocation should be able to
continue as needed (whether not having a hard limit at all is a good idea
is a separate question...). There's a reference to 16 arguments to
rrd_graph, however that's a per-graph value and I don't believe it would
affect the number here.

A couple of next steps:
- Can you increase to disk::6 or 7 and see if there's a point where the
parsing of that number breaks?
- Does it constantly die at the same partition being graphed?
- Are there any errors coming out in /logs/ or the httpd error log, or
core files left?
- Are there any unusual file conditions in that directory?

I'd definitely suggest upgrading to a new version for security purposes,
but I don't think any fixes addressing this area specifically are present.

There were a lot of changes between 4.2.2 and 4.3.x, so it's hard to say
exactly what might be contributing there.


HTH,
-jc
list Schminke_Erik_D · Wed, 18 May 2016 15:13:35 -0500 ·
update: i upgraded v4.3.24 to v4.3.27.  this had no effect on my problem.
85 filesystem "limit" still exists.
signature

Erik D. Schminke | Associate Systems Programmer
Hormel Foods Corporation | One Hormel Place | Austin, MN XXXXX
Phone: (XXX) XXX-XXXX
user-15513f33c451@xymon.invalid | www.hormelfoods.com


quoted from Japheth Cleaver
From:	"J.C. Cleaver" <user-87556346d4af@xymon.invalid>
To:	user-15513f33c451@xymon.invalid
Cc:	"Xymon Mailing List" <xymon at xymon.com>
Date:	05/17/2016 04:53 PM
Subject:	Re: [Xymon] disk graph page limits total file systems


On Tue, May 17, 2016 11:26 am, user-15513f33c451@xymon.invalid wrote:
I'm having some trouble with the disk graph pages for hosts that have
numerous file systems reporting in.  The limit seems to be 85.  There are
5
file systems displayed on each graph and there are 17 graphs max on this
page.  If I view the trends page for those host, all of the missing
filesystems are graphed.  85 seems pretty arbitrary.

I've gone through and cleaned up the RRD files from $XYMON/data/rrd/$HOST
that are "stale".  Cleaning some of these out, which would otherwise not
be
listed, permit the filesystems that were trimmed to then be listed on the
graph page.  Still, not all of the filesystems make it on the graphs.

I searched the mail archive, and didn't find a solutions.  I found a
discussion of changing the [disk] section to [disk::10] for 10
filesystems
per graph... that prevented the graphs from being generated at all.  My
thinking was, perhaps if i increase the number of filesystems per graph,
17
graphs would be enough for this particular system.

In any event, I'm wondering if this is a bug or a configurable.  It
actually works fine in a previous version of xymon. (v4.2.2 works,
v.4.3.24).  Thanks!

Erik D. Schminke | Associate Systems Programmer
Hormel Foods Corporation | One Hormel Place | Austin, MN XXXXX
Phone: (XXX) XXX-XXXX
user-15513f33c451@xymon.invalid | www.hormelfoods.com
Erik,

85 isn't an intentional hard limit here. I've been scanning through the
showgraph code and it seems like the reallocation should be able to
continue as needed (whether not having a hard limit at all is a good idea
is a separate question...). There's a reference to 16 arguments to
rrd_graph, however that's a per-graph value and I don't believe it would
affect the number here.

A couple of next steps:
- Can you increase to disk::6 or 7 and see if there's a point where the
parsing of that number breaks?
- Does it constantly die at the same partition being graphed?
- Are there any errors coming out in /logs/ or the httpd error log, or
core files left?
- Are there any unusual file conditions in that directory?

I'd definitely suggest upgrading to a new version for security purposes,
but I don't think any fixes addressing this area specifically are present.

There were a lot of changes between 4.2.2 and 4.3.x, so it's hard to say
exactly what might be contributing there.


HTH,
-jc
list Schminke_Erik_D · Thu, 19 May 2016 07:44:47 -0500 ·
John,

Xymon graphs the reported data, but not all the graphs show up on the disk
page for a host that goes over 85 filesystems.  The graphs appear on the
"trends" page.  Just not the disk page.

I'm currently at version 4.3.27.  I would need someone familiar with the
code to review the relevant sections to see what differences exist between
these versions.

Thanks,

Erik
signature

Erik D. Schminke | Associate Systems Programmer
Hormel Foods Corporation | One Hormel Place | Austin, MN XXXXX
Phone: (XXX) XXX-XXXX
user-15513f33c451@xymon.invalid | www.hormelfoods.com


quoted from Schminke_Erik_D
From:	John Palys <user-6ab3068b00ee@xymon.invalid>
To:	user-15513f33c451@xymon.invalid
Date:	05/18/2016 05:17 PM
Subject:	Re: [Xymon] disk graph page limits total file systems


Ed,


We use xymon version 4.3.7 and have 1051 filesystems which graph just
fine.  N o changes from base system other than the changes at the tail end
of ../etc/xymonserver.cfg

MAXMSG_STATUS="3145728"
MAXMSG_CLIENT="3145728"
MAXMSG_DATA="3145728"


[root at sagan ~]# df |wc -l
1051
[root at sagan ~]#
quoted from Schminke_Erik_D


On Tue, May 17, 2016 at 11:26 AM, <user-15513f33c451@xymon.invalid> wrote:

  I'm having some trouble with the disk graph pages for hosts that have
  numerous file systems reporting in.  The limit seems to be 85.  There are
  5
  file systems displayed on each graph and there are 17 graphs max on this
  page.  If I view the trends page for those host, all of the missing
  filesystems are graphed.  85 seems pretty arbitrary.

  I've gone through and cleaned up the RRD files from $XYMON/data/rrd/$HOST
  that are "stale".  Cleaning some of these out, which would otherwise not
  be
  listed, permit the filesystems that were trimmed to then be listed on the
  graph page.  Still, not all of the filesystems make it on the graphs.

  I searched the mail archive, and didn't find a solutions.  I found a
  discussion of changing the [disk] section to [disk::10] for 10
  filesystems
  per graph... that prevented the graphs from being generated at all.  My
  thinking was, perhaps if i increase the number of filesystems per graph,
  17
  graphs would be enough for this particular system.

  In any event, I'm wondering if this is a bug or a configurable.  It
  actually works fine in a previous version of xymon. (v4.2.2 works,
  v.4.3.24).  Thanks!

  Erik D. Schminke | Associate Systems Programmer
  Hormel Foods Corporation | One Hormel Place | Austin, MN XXXXX
  Phone: (XXX) XXX-XXXX
  user-15513f33c451@xymon.invalid | www.hormelfoods.com


--

John Palys
Systems Administrator
School Pathways, Inc.
XXX-XXX-XXXX #2010  Office
list Schminke_Erik_D · Thu, 19 May 2016 08:16:34 -0500 ·
Ron,

Thanks for your reply... I find that unlikely for a number of reasons:

1) I'm seeing nothing in the xymon logs about messages being truncated
2) I can see the entire message on the xymon server, and the file systems
I'm expecting to see graphed are in the message but are not being
graphed... on the disk page
3) The file systems ARE graphed, but the graphs don't appear on the disk
page... all filesystems are appearing on the trends page.


JC (or anyone else)----

Could you suggest to me a way to run the portion that actually generates
that page in a debug mode so I can see whats happening?  As a command on a
terminal, for example?

Additional suggestions and questions are welcome!
signature


Thanks,
Erik


Erik D. Schminke | Associate Systems Programmer
Hormel Foods Corporation | One Hormel Place | Austin, MN XXXXX
Phone: (XXX) XXX-XXXX
user-15513f33c451@xymon.invalid | www.hormelfoods.com


quoted from Schminke_Erik_D
From:	Ron Cohen <user-f26e06d1e992@xymon.invalid>
To:	user-15513f33c451@xymon.invalid
Date:	05/18/2016 03:59 PM
Subject:	Re: [Xymon] disk graph page limits total file systems


Maybe it's to do with the sending msg size?
quoted from Schminke_Erik_D


On 18 May 2016 21:13, <user-15513f33c451@xymon.invalid> wrote:
  update: i upgraded v4.3.24 to v4.3.27.  this had no effect on my problem.
  85 filesystem "limit" still exists.

  Erik D. Schminke | Associate Systems Programmer
  Hormel Foods Corporation | One Hormel Place | Austin, MN XXXXX
  Phone: (XXX) XXX-XXXX
  user-15513f33c451@xymon.invalid | www.hormelfoods.com


  From:   "J.C. Cleaver" <user-87556346d4af@xymon.invalid>
  To:     user-15513f33c451@xymon.invalid
  Cc:     "Xymon Mailing List" <xymon at xymon.com>
  Date:   05/17/2016 04:53 PM
  Subject:        Re: [Xymon] disk graph page limits total file systems


  On Tue, May 17, 2016 11:26 am, user-15513f33c451@xymon.invalid wrote:
I'm having some trouble with the disk graph pages for hosts that have
numerous file systems reporting in.  The limit seems to be 85.  There
  are
5
file systems displayed on each graph and there are 17 graphs max on
  this
page.  If I view the trends page for those host, all of the missing
filesystems are graphed.  85 seems pretty arbitrary.

I've gone through and cleaned up the RRD files from
  $XYMON/data/rrd/$HOST
that are "stale".  Cleaning some of these out, which would otherwise
  not
be
listed, permit the filesystems that were trimmed to then be listed on
  the
graph page.  Still, not all of the filesystems make it on the graphs.

I searched the mail archive, and didn't find a solutions.  I found a
discussion of changing the [disk] section to [disk::10] for 10
  filesystems
per graph... that prevented the graphs from being generated at all.  My
thinking was, perhaps if i increase the number of filesystems per
  graph,
17
graphs would be enough for this particular system.

In any event, I'm wondering if this is a bug or a configurable.  It
actually works fine in a previous version of xymon. (v4.2.2 works,
v.4.3.24).  Thanks!

Erik D. Schminke | Associate Systems Programmer
Hormel Foods Corporation | One Hormel Place | Austin, MN XXXXX
Phone: (XXX) XXX-XXXX
user-15513f33c451@xymon.invalid | www.hormelfoods.com
  Erik,

  85 isn't an intentional hard limit here. I've been scanning through the
  showgraph code and it seems like the reallocation should be able to
  continue as needed (whether not having a hard limit at all is a good idea
  is a separate question...). There's a reference to 16 arguments to
  rrd_graph, however that's a per-graph value and I don't believe it would
  affect the number here.

  A couple of next steps:
  - Can you increase to disk::6 or 7 and see if there's a point where the
  parsing of that number breaks?
  - Does it constantly die at the same partition being graphed?
  - Are there any errors coming out in /logs/ or the httpd error log, or
  core files left?
  - Are there any unusual file conditions in that directory?

  I'd definitely suggest upgrading to a new version for security purposes,
  but I don't think any fixes addressing this area specifically are
  present.

  There were a lot of changes between 4.2.2 and 4.3.x, so it's hard to say
  exactly what might be contributing there.


  HTH,
  -jc

list Schminke_Erik_D · Mon, 23 May 2016 14:46:21 -0500 ·
Another update for this topic:

I added 100 file systems to a couple systems to see what would happen with
the graphs.  The target systems were different from the one that spawned
this topic.

When I added 100 file systems to a Linux (RHEL 6.6) system, all file
systems were reported/graphed on the disk page.
When I added 100 file systems to an AIX (v7.1) system, file systems were
truncated; although at a different point.  120 of 132 file systems were
represented..

I don't understand why they're so different.  Comparing the df portions of
the messages from each of the systems do not reveal any obvious
differences.

I'd really appreciate some suggestions for debugging.  What commands I can
run manually that the disk page is running internally.  I've looked at the
code, but haven't quite figured out what's going on in there; my C skills
are rubbish.
quoted from Schminke_Erik_D

Erik D. Schminke | Associate Systems Programmer
Hormel Foods Corporation | One Hormel Place | Austin, MN XXXXX
Phone: (XXX) XXX-XXXX
user-15513f33c451@xymon.invalid | www.hormelfoods.com
list Japheth Cleaver · Mon, 23 May 2016 20:32:35 -0700 ·
quoted from Schminke_Erik_D

On Mon, May 23, 2016 12:46 pm, user-15513f33c451@xymon.invalid wrote:
Another update for this topic:

I added 100 file systems to a couple systems to see what would happen with
the graphs.  The target systems were different from the one that spawned
this topic.

When I added 100 file systems to a Linux (RHEL 6.6) system, all file
systems were reported/graphed on the disk page.
When I added 100 file systems to an AIX (v7.1) system, file systems were
truncated; although at a different point.  120 of 132 file systems were
represented..

I don't understand why they're so different.  Comparing the df portions of
the messages from each of the systems do not reveal any obvious
differences.

I'd really appreciate some suggestions for debugging.  What commands I can
run manually that the disk page is running internally.  I've looked at the
code, but haven't quite figured out what's going on in there; my C skills
are rubbish.

Hi Erik,

This actually helps a great deal, as it implies there's a distinction in
parsing code ... and potentially not an issue on the display side at all
(which I've been pouring over with little success).

Can you confirm whether the RRD files themselves are being properly
updated for both the AIX and Linux systems? (It might help to disable
caching in xymond_rrd during this process, if your system has enough space
I/O capacity.) In theory all partitions that are coming in should have
their .rrd files updated continually, but if there's a parsing issue then
that might explain one aspect of the failure.

Alternatively, can you try adding and removing partition values in the
client report and see if going above and below the 85-parition value
reliably enables the 86th?

It might be helpful to manually edit the xymonclient-$OS.sh script to grep
out (or include additional) lines of the 'df' output.

Can you also confirm that the remainder of the client report
(CPU/memory/etc.) is being handled OK, even on the AIX system?


So far I've been unable to duplicate this, but I was primarily testing on
x86_64 Linux VMs.


Regards,
-jc
list Schminke_Erik_D · Wed, 25 May 2016 13:40:43 -0500 ·
JC,

I think I'm starting to see a pattern emerge, and a theory develop, here.
Hope everyone is able to follow this... here goes:

I think there may potentially be a disconnect between how the disk page
determines how many filesystems SHOULD be graphed and the number of RRD
files that are available TO BE graphed.  I think the reason the trends page
seems to work OK is it just graphs all data that it has available without
condition.  It seems like the disk page determines 1) the number of
filesystems to graph and,  2) based on that number, the number of
filesystems per image.  These numbers seem to be determined BEFORE it
generates the HTML that produces the link HREFs and image SRCs.  It then
seems to produce just enough graphs to satisfy the predetermined number,
plus enough to satisfy the predetermined "multiple",

The predetermined number of filesystems seems to come from the number of
filesystems reported in the previous message from the client.   I believe
the assumption was made those numbers should always match.  And for the
most part they do.  It's not everyday sys admins remove filesystems from
their systems.  Until now, it may not have been so easy to spot.  It's a
little more obvious to me, being primarily an AIX administrator.  We have a
daily process that creates an "alt_disk_copy" of our rootvg so that we
always have a hot backup of the OS.  This process causes a lot of transient
filesystems to be created.  Those filesystems get reported and recorded
during the brief window that this process is running.  On closer
examination of my AIX systems, it is not just the one with 85 filesystems
getting truncated... it's all of them.
When I view HTML source on the disk page, I see that just ahead of the HTML
code that displays the graph images and links, there is an HTML comment
line: "<!-- linecount=x -->" Where x equals the number of filesystems that
were reported in the previous message from the client.  (Count number of
lines, excluding header, from [df] section.) I went through each of my
systems, Linux and AIX, and determined that to always be the case.  There
must also be some range at which it determines the number "y" that
determines how many filesystems to display on each graph.  It seems like
that number is x<80, y=4 and x>=80, y=5.  (If y changes to 6 at some point,
I haven't done enough testing to determine where that threshold is.)

The request to showgraph.cgi includes the parameters first=z and count=y.
If there are no more RRD files to graph, it stops and the graph shows fewer
filesystems than the count parameter.  But, if you have a situation where
you have more data available than the predetermined number of filesystems,
it will continue to graph them.

On the system that previously seem limited to 85 file systems, I modified
the "hobbitclient-$os.sh" script and grepped out a certain number of file
systems.  After doing this, I had 77 filesystems reported.  That number was
reflected in the "linecount=" HTML comment, and I also began seeing 4
filesystems per graph (instead of 5, previously) and 20 graphs being
displayed (instead of 17, previously) for a total of 80 filesystems being
graphed.  It graphed 80 because it still had enough data from RRD files to
round out the last graph.  Also, the filesystems that were grepped out of
the message from the client, were still graphed.

I also went back and checked my Linux systems; the ones where I added 100
filesystems.  On those systems, I created enough filesystems to push past
that 85 filesystem "limit".  Since those all graphed successfully, I had
previously thought that it was the difference between AIX and Linux.  That
no longer seems to be the case.  Now that I have removed all of those test
file systems, and since it's only reporting 10 filesystems, only 10
filesystems are being graphed.  File systems like /, /boot, and /home are
graphed... but the test ones that I removed, are still being graphed, and
filesystems that you would expect to see at the end alphabetically,
(e.g. /usr, /var, /opt, /tmp, etc) are not displayed.

A lot of speculation, I realize that, but the theory seems to fit reality
in all cases.  I haven't examined the code to prove it out since, as I've
said before, my C skills are rubbish.  But if my theory proves to be true,
the suggestion for improvement that I would offer is, make sure that at
least every file system from the most recent message is represented, plus
any additional file systems that might have data available in the time
period requested; between "graph_start" and "graph_end".
signature


Erik D. Schminke | Associate Systems Programmer
Hormel Foods Corporation | One Hormel Place | Austin, MN XXXXX
Phone: (XXX) XXX-XXXX
user-15513f33c451@xymon.invalid | www.hormelfoods.com


quoted from Japheth Cleaver
From:	"J.C. Cleaver" <user-87556346d4af@xymon.invalid>
To:	user-15513f33c451@xymon.invalid
Cc:	"Xymon Mailing List" <xymon at xymon.com>
Date:	05/23/2016 10:32 PM
Subject:	Re: [Xymon] disk graph page limits total file systems


Hi Erik,

This actually helps a great deal, as it implies there's a distinction in
parsing code ... and potentially not an issue on the display side at all
(which I've been pouring over with little success).

Can you confirm whether the RRD files themselves are being properly
updated for both the AIX and Linux systems? (It might help to disable
caching in xymond_rrd during this process, if your system has enough space
I/O capacity.) In theory all partitions that are coming in should have
their .rrd files updated continually, but if there's a parsing issue then
that might explain one aspect of the failure.

Alternatively, can you try adding and removing partition values in the
client report and see if going above and below the 85-parition value
reliably enables the 86th?

It might be helpful to manually edit the xymonclient-$OS.sh script to grep
out (or include additional) lines of the 'df' output.

Can you also confirm that the remainder of the client report
(CPU/memory/etc.) is being handled OK, even on the AIX system?


So far I've been unable to duplicate this, but I was primarily testing on
x86_64 Linux VMs.


Regards,
-jc
list Japheth Cleaver · Sun, 29 May 2016 14:20:27 -0700 ·
Hi Ed,

Apologies for the delay, there've been some RL issues getting in the way
here.


Thank you for the analysis below; I think you're near the issue here.
Looking at lib/htmllog.c:422 et seq, there's even a comment on the
possible issues with the line parsing logic. The storage-of-previous-info
might be a red herring in that ... I'm not seeing a way that actually gets
stored in the first place. On the other hand, the graphs *could* be being
affected by something similar: the HG_WITHOUT_STALE_RRDS value.

The line counting looks like it's "reasonable enough", but I could also
see complications from unusually-named or unusually-wrapped partitions
confusing it about the real number.

I don't have access to an AIX system at the moment, but is there a
POSIX-mode or guaranteed-no-line-wrap option for it's 'df' command? If so,
the lack of it in $OS.sh is a problem.


Two other ways to test here:

1) Can you take an existing disk status report and reinjected it,
including the HTML comment <!-- linecount=XX --> with the proper number in
XX? Per line 431, that value should be used instead of a figure calculated
at display time. (This seems like something xymond_client.c might/should
include at status-generation time, since we're already going through the
values anyway, but it's not at the moment. Probably should be added.)

2) Secondly, can you add '&nostale' to the RRD graph page loads? That
should ensure that partitions are *always* displayed even if the
underlying RRD file hasn't been updated recently.


HTH,
-jc
quoted from Schminke_Erik_D


On Wed, May 25, 2016 11:40 am, user-15513f33c451@xymon.invalid wrote:
JC,

I think I'm starting to see a pattern emerge, and a theory develop, here.
Hope everyone is able to follow this... here goes:

I think there may potentially be a disconnect between how the disk page
determines how many filesystems SHOULD be graphed and the number of RRD
files that are available TO BE graphed.  I think the reason the trends
page
seems to work OK is it just graphs all data that it has available without
condition.  It seems like the disk page determines 1) the number of
filesystems to graph and,  2) based on that number, the number of
filesystems per image.  These numbers seem to be determined BEFORE it
generates the HTML that produces the link HREFs and image SRCs.  It then
seems to produce just enough graphs to satisfy the predetermined number,
plus enough to satisfy the predetermined "multiple",

The predetermined number of filesystems seems to come from the number of
filesystems reported in the previous message from the client.   I believe
the assumption was made those numbers should always match.  And for the
most part they do.  It's not everyday sys admins remove filesystems from
their systems.  Until now, it may not have been so easy to spot.  It's a
little more obvious to me, being primarily an AIX administrator.  We have
a
daily process that creates an "alt_disk_copy" of our rootvg so that we
always have a hot backup of the OS.  This process causes a lot of
transient
filesystems to be created.  Those filesystems get reported and recorded
during the brief window that this process is running.  On closer
examination of my AIX systems, it is not just the one with 85 filesystems
getting truncated... it's all of them.
When I view HTML source on the disk page, I see that just ahead of the
HTML
code that displays the graph images and links, there is an HTML comment
line: "<!-- linecount=x -->" Where x equals the number of filesystems that
were reported in the previous message from the client.  (Count number of
lines, excluding header, from [df] section.) I went through each of my
systems, Linux and AIX, and determined that to always be the case.  There
must also be some range at which it determines the number "y" that
determines how many filesystems to display on each graph.  It seems like
that number is x<80, y=4 and x>=80, y=5.  (If y changes to 6 at some
point,
I haven't done enough testing to determine where that threshold is.)

The request to showgraph.cgi includes the parameters first=z and count=y.
If there are no more RRD files to graph, it stops and the graph shows
fewer
filesystems than the count parameter.  But, if you have a situation where
you have more data available than the predetermined number of filesystems,
it will continue to graph them.

On the system that previously seem limited to 85 file systems, I modified
the "hobbitclient-$os.sh" script and grepped out a certain number of file
systems.  After doing this, I had 77 filesystems reported.  That number
was
reflected in the "linecount=" HTML comment, and I also began seeing 4
filesystems per graph (instead of 5, previously) and 20 graphs being
displayed (instead of 17, previously) for a total of 80 filesystems being
graphed.  It graphed 80 because it still had enough data from RRD files to
round out the last graph.  Also, the filesystems that were grepped out of
the message from the client, were still graphed.

I also went back and checked my Linux systems; the ones where I added 100
filesystems.  On those systems, I created enough filesystems to push past
that 85 filesystem "limit".  Since those all graphed successfully, I had
previously thought that it was the difference between AIX and Linux.  That
no longer seems to be the case.  Now that I have removed all of those test
file systems, and since it's only reporting 10 filesystems, only 10
filesystems are being graphed.  File systems like /, /boot, and /home are
graphed... but the test ones that I removed, are still being graphed, and
filesystems that you would expect to see at the end alphabetically,
(e.g. /usr, /var, /opt, /tmp, etc) are not displayed.

A lot of speculation, I realize that, but the theory seems to fit reality
in all cases.  I haven't examined the code to prove it out since, as I've
said before, my C skills are rubbish.  But if my theory proves to be true,
the suggestion for improvement that I would offer is, make sure that at
least every file system from the most recent message is represented, plus
any additional file systems that might have data available in the time
period requested; between "graph_start" and "graph_end".


Erik D. Schminke | Associate Systems Programmer
Hormel Foods Corporation | One Hormel Place | Austin, MN XXXXX
Phone: (XXX) XXX-XXXX
user-15513f33c451@xymon.invalid | www.hormelfoods.com


From:	"J.C. Cleaver" <user-87556346d4af@xymon.invalid>
To:	user-15513f33c451@xymon.invalid
Cc:	"Xymon Mailing List" <xymon at xymon.com>
Date:	05/23/2016 10:32 PM
Subject:	Re: [Xymon] disk graph page limits total file systems


Hi Erik,

This actually helps a great deal, as it implies there's a distinction in
parsing code ... and potentially not an issue on the display side at all
(which I've been pouring over with little success).

Can you confirm whether the RRD files themselves are being properly
updated for both the AIX and Linux systems? (It might help to disable
caching in xymond_rrd during this process, if your system has enough space
I/O capacity.) In theory all partitions that are coming in should have
their .rrd files updated continually, but if there's a parsing issue then
that might explain one aspect of the failure.

Alternatively, can you try adding and removing partition values in the
client report and see if going above and below the 85-parition value
reliably enables the 86th?

It might be helpful to manually edit the xymonclient-$OS.sh script to grep
out (or include additional) lines of the 'df' output.

Can you also confirm that the remainder of the client report
(CPU/memory/etc.) is being handled OK, even on the AIX system?


So far I've been unable to duplicate this, but I was primarily testing on
x86_64 Linux VMs.


Regards,
-jc

list Japheth Cleaver · Sun, 29 May 2016 14:22:50 -0700 ·
And further apologies, Erik, for the wrong name, which I caught right as I
was hitting 'Send'! :/

-jc
quoted from Japheth Cleaver


On Sun, May 29, 2016 2:20 pm, J.C. Cleaver wrote:
Hi Ed,

Apologies for the delay, there've been some RL issues getting in the way
here.