Trying to set up split client load over mutiple servers.
list Thomas R. Brand
Sorry if this post gets a bit long, I've read thru the man pages and the
archives and have failed to reach understanding and need some help.
I have over 4300 Hobbit 4.2.0+all-patches clients currently reporting to
one server running Xymon 4.3.0-0.beta2. This server is also used for
other workloads, which normally do not produce much of a load but can at
times.
We are adding more clients to this configuration at rate of about
400/week for a final total of about 7200 clients.
As of this week it seems that my Xymon server is unable to keep up with
the load and I suspect this is due to several reasons: the large number
of clients, file system type (ext3) used for the data/.. directories (,
the problem is.
I am trying to determine my best course to be able to handle the 7200+
clients ...
Some options I've considered:
A) simply split the clients over 2 or more independent Xymon servers
a. easiest to configure
b. lose 'single web page' overview
c. lose combined statistics/reporting
B) Split clients over multiple servers for data gathering/storage
and (how?) use one server to display a bb2.html page which combines all
non green from all 'data-gathering' servers
a. Is this even possible?
C) Some other method/configuration? Anyone on the list running this
many clients?
a. How did you set up your environment?
b. What size/performance/type of system are you using for your
Xymon server?
Some of my current symptoms:
The 'top' command on the server is indicating high I/O load (at times
%iowait > 60%).
The 'procs' column warns/alerts about missing processes; closer
examination shows that the client data received has been truncated,
usually somewhere in the 'ps' outout section.
The bb2.html page sometimes shows many 'purple' clients; the clients in
purple change (eg, a client goes purple for a while 5-30+ minutes, then
we get another message processed and the client goes green again.
In /var/log/xymon/clientdata.log, I am seeing many (2212 yesterday
alone) messages like:
2010-05-27 11:38:09 hobbitd_client: Got message 55294, expected 55277
2010-05-27 12:08:12 Flushed 7 stale messages for 0.0.0.0:0
2010-05-27 12:08:13 Flushed 43 stale messages for 0.0.0.0:0
2010-05-27 12:08:13 hobbitd_client: Got message 81190, expected 81140
2010-05-27 12:38:36 Flushed 16 stale messages for 0.0.0.0:0
2010-05-27 12:38:37 Flushed 43 stale messages for 0.0.0.0:0
2010-05-27 12:38:38 Flushed 26 stale messages for 0.0.0.0:0
2010-05-27 12:38:38 hobbitd_client: Got message 107484, expected 107399
The hobbitd page shows:
Statistics for Hobbit daemon
Up since 26-May-2010 14:55:38 (0 days, 22:00:02)
Incoming messages : 16721697
- status : 11146704
- combo : 1122112
Incoming messages/sec : 216 (average last 300 seconds)
The bbtest is taking 225 seconds to complete
PING test completed (4390 hosts) 5005592.360761
203.397989
TIME TOTAL
225.391564
Hoping the group-mind can help me out,
Thanks,
Tom
Thomas Brand
Disclaimer: 1) all opinions are my own, 2) I may be completely wrong, 3)
my advice is worth at least as much as what you are paying for it, or
your money cheerfully refunded.
CONFIDENTIALITY NOTICE: This communication and any attachments may
contain confidential and/or privileged information for the use of the
designated recipients named above. If you are not the intended
recipient, you are hereby notified that you have received this
communication in error and that any review, disclosure, dissemination,
distribution or copying of it or its contents is prohibited. If you
have received this communication in error, please notify the sender
immediately by telephone and destroy all copies of this communication
and any attachments.
list Josh Luthman
What kind of hardware are you on now?
▸
On 5/27/10, Brand, Thomas R. <user-10a840458972@xymon.invalid> wrote:Sorry if this post gets a bit long, I've read thru the man pages and the archives and have failed to reach understanding and need some help. I have over 4300 Hobbit 4.2.0+all-patches clients currently reporting to one server running Xymon 4.3.0-0.beta2. This server is also used for other workloads, which normally do not produce much of a load but can at times. We are adding more clients to this configuration at rate of about 400/week for a final total of about 7200 clients. As of this week it seems that my Xymon server is unable to keep up with the load and I suspect this is due to several reasons: the large number of clients, file system type (ext3) used for the data/.. directories (, the problem is. I am trying to determine my best course to be able to handle the 7200+ clients ... Some options I've considered: A) simply split the clients over 2 or more independent Xymon servers a. easiest to configure b. lose 'single web page' overview c. lose combined statistics/reporting B) Split clients over multiple servers for data gathering/storage and (how?) use one server to display a bb2.html page which combines all non green from all 'data-gathering' servers a. Is this even possible? C) Some other method/configuration? Anyone on the list running this many clients? a. How did you set up your environment? b. What size/performance/type of system are you using for your Xymon server? Some of my current symptoms: The 'top' command on the server is indicating high I/O load (at times %iowait > 60%). The 'procs' column warns/alerts about missing processes; closer examination shows that the client data received has been truncated, usually somewhere in the 'ps' outout section. The bb2.html page sometimes shows many 'purple' clients; the clients in purple change (eg, a client goes purple for a while 5-30+ minutes, then we get another message processed and the client goes green again. In /var/log/xymon/clientdata.log, I am seeing many (2212 yesterday alone) messages like: 2010-05-27 11:38:09 hobbitd_client: Got message 55294, expected 55277 2010-05-27 12:08:12 Flushed 7 stale messages for 0.0.0.0:0 2010-05-27 12:08:13 Flushed 43 stale messages for 0.0.0.0:0 2010-05-27 12:08:13 hobbitd_client: Got message 81190, expected 81140 2010-05-27 12:38:36 Flushed 16 stale messages for 0.0.0.0:0 2010-05-27 12:38:37 Flushed 43 stale messages for 0.0.0.0:0 2010-05-27 12:38:38 Flushed 26 stale messages for 0.0.0.0:0 2010-05-27 12:38:38 hobbitd_client: Got message 107484, expected 107399 The hobbitd page shows: Statistics for Hobbit daemon Up since 26-May-2010 14:55:38 (0 days, 22:00:02) Incoming messages : 16721697 - status : 11146704 - combo : 1122112 Incoming messages/sec : 216 (average last 300 seconds) The bbtest is taking 225 seconds to complete PING test completed (4390 hosts) 5005592.360761 203.397989 TIME TOTAL 225.391564 Hoping the group-mind can help me out, Thanks, Tom Thomas Brand Disclaimer: 1) all opinions are my own, 2) I may be completely wrong, 3) my advice is worth at least as much as what you are paying for it, or your money cheerfully refunded. CONFIDENTIALITY NOTICE: This communication and any attachments may contain confidential and/or privileged information for the use of the designated recipients named above. If you are not the intended recipient, you are hereby notified that you have received this communication in error and that any review, disclosure, dissemination, distribution or copying of it or its contents is prohibited. If you have received this communication in error, please notify the sender immediately by telephone and destroy all copies of this communication and any attachments.
--
Josh Luthman
Office: XXX-XXX-XXXX
Direct: XXX-XXX-XXXX
XXXX Wayne St
Suite XXXX
Troy, OH XXXXX
“Success is not final, failure is not fatal: it is the courage to
continue that counts.”
--- Winston Churchill
list Olivier Beau
Hi,
▸
Some of my current symptoms: The ‘top’ command on the server is indicating high I/O load (at times %iowait > 60%).
try changing the cache time of hobbitd_rrd from 30 minutes to 1 hour it's hardcoded in do_rrd.c change #define CACHESZ 6 to #define CACHESZ 12
The ‘procs’ column warns/alerts about missing processes; closer examination shows that the client data received has been truncated, usually somewhere in the ‘ps’ outout section.
try putting higher values for MAXLINE and MAXMSG* in hobbitserver.cfg
The bb2.html page sometimes shows many ‘purple’ clients; the clients in purple change (eg, a client goes purple for a while 5-30+ minutes, then we get another message processed and the client goes green again.
any logs on the client itself ? (not being able to report to your hobbitserver ?) Olivier