Xymon Mailing List Archive search

Scaling

12 messages in this thread

list Bruce Ferrell · Fri, 05 Apr 2013 10:57:04 -0700 ·
Hi all,

I've been doing systems monitoring for a very long time now... I was early on with BB, used HP openview back in day day, blah blah.

Anyway, recently I've been told that in very large installations (multi thousands of devices) things like zabbix are the only thing(s) that will do.

What are the groups thoughts on this?  What ARE the scaling limits of xymon and can they be overcome somehow?
list Larry Barber · Fri, 5 Apr 2013 13:54:50 -0500 ·
Well, I'm monitoring ~2000 hosts on a fairly modest box (8 3Ghz cores, 8 GB
of memory). I'm also running quite a few cpu intensive scripts on the same
box that could be easily moved to another host, if needed. I do the network
testing on separate hosts in each of our major security zones, for
reliability of the tests more than to unload the main Xymon server. The
main server is not operating anywhere near its capacity, it's using less
than 10% (physical) of it's memory and the load average tends to stay
around 1. I suspect that the box could handle 5000 hosts without too much
trouble, maybe more.

If you do have scaling problems there are some things you can do, though.
Move things like the network tests to separate hosts. You can also move the
alerting to a different host using xymonproxy. I've found that the most
likely limit you're likely to hit with Xymon is disk i/o, this can be
helped by moving the data directory to SAN.

Thanks,
Larry Barber
quoted from Bruce Ferrell


On Fri, Apr 5, 2013 at 12:57 PM, Bruce Ferrell <user-24fbf1912cfe@xymon.invalid>wrote:
Hi all,

I've been doing systems monitoring for a very long time now... I was early
on with BB, used HP openview back in day day, blah blah.

Anyway, recently I've been told that in very large installations (multi
thousands of devices) things like zabbix are the only thing(s) that will do.

What are the groups thoughts on this?  What ARE the scaling limits of
xymon and can they be overcome somehow?
______________________________**

Xymon at xymon.com<
list Olivier Audry · Fri, 05 Apr 2013 22:33:41 +0200 ·
hello

15 000 devices here. For me the key is ssd :)

I plan to monitore 60 000 devices with xymon. Only network devices.

We'll see the result.

oau
quoted from Larry Barber

Le vendredi 05 avril 2013 à 13:54 -0500, Larry Barber a écrit :
Well, I'm monitoring ~2000 hosts on a fairly modest box (8 3Ghz cores,
8 GB of memory). I'm also running quite a few cpu intensive scripts on
the same box that could be easily moved to another host, if needed. I
do the network testing on separate hosts in each of our major security
zones, for reliability of the tests more than to unload the main Xymon
server. The main server is not operating anywhere near its capacity,
it's using less than 10% (physical) of it's memory and the load
average tends to stay around 1. I suspect that the box could handle
5000 hosts without too much trouble, maybe more. 

If you do have scaling problems there are some things you can do,
though. Move things like the network tests to separate hosts. You can
also move the alerting to a different host using xymonproxy. I've
found that the most likely limit you're likely to hit with Xymon is
disk i/o, this can be helped by moving the data directory to SAN. 

Thanks,
Larry Barber


On Fri, Apr 5, 2013 at 12:57 PM, Bruce Ferrell <user-24fbf1912cfe@xymon.invalid>
quoted from Larry Barber
wrote:
        Hi all,
                I've been doing systems monitoring for a very long time now...
        I was early on with BB, used HP openview back in day day, blah
        blah.
                Anyway, recently I've been told that in very large
        installations (multi thousands of devices) things like zabbix
        are the only thing(s) that will do.
                What are the groups thoughts on this?  What ARE the scaling
        limits of xymon and can they be overcome somehow?

list Bruce White · Wed, 10 Apr 2013 10:51:21 -0500 ·
Over 1000 devices monitored here and only real issue is rrd keeping up.  I have been told an ssd for the rrd files will solve this issue.


 
Bruce White
Senior Enterprise Systems Engineer | Phone: X-XXX-XXX-XXXX | Fax: XXX-XXX-XXXX  | user-58f975e8bf9d@xymon.invalid | http://www.fellowes.com/
 
 
 
Disclaimer: The information contained in this message may be privileged and confidential and protected from disclosure. If the reader of this message is not the intended recipient or an employee or agent responsible for delivering this message to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this communication in error, please notify us immediately by replying to the message and deleting it from your computer. Thank you. Fellowes, Inc.
quoted from Olivier Audry
 
-----Original Message-----
From: xymon-bounces at xymon.com [mailto:xymon-bounces at xymon.com] On Behalf Of Olivier AUDRY
Sent: Friday, April 05, 2013 3:34 PM
To: Larry Barber
Cc: xymon at xymon.com
Subject: Re: [Xymon] Scaling

hello

15 000 devices here. For me the key is ssd :)

I plan to monitore 60 000 devices with xymon. Only network devices.

We'll see the result.

oau

Le vendredi 05 avril 2013 à 13:54 -0500, Larry Barber a écrit :
Well, I'm monitoring ~2000 hosts on a fairly modest box (8 3Ghz cores,
8 GB of memory). I'm also running quite a few cpu intensive scripts on the same box that could be easily moved to another host, if needed. I do the network testing on separate hosts in each of our major security zones, for reliability of the tests more than to unload the main Xymon server. The main server is not operating anywhere near its capacity, it's using less than 10% (physical) of it's memory and the load average tends to stay around 1. I suspect that the box could handle
5000 hosts without too much trouble, maybe more. 

If you do have scaling problems there are some things you can do, though. Move things like the network tests to separate hosts. You can also move the alerting to a different host using xymonproxy. I've found that the most likely limit you're likely to hit with Xymon is disk i/o, this can be helped by moving the data directory to SAN.


Thanks,
Larry Barber


On Fri, Apr 5, 2013 at 12:57 PM, Bruce Ferrell <user-24fbf1912cfe@xymon.invalid>
wrote:
        Hi all,
                I've been doing systems monitoring for a very long time now...
        I was early on with BB, used HP openview back in day day, blah
        blah.
                Anyway, recently I've been told that in very large
        installations (multi thousands of devices) things like zabbix
        are the only thing(s) that will do.
                What are the groups thoughts on this?  What ARE the scaling
        limits of xymon and can they be overcome somehow?

list Cami Sardinha · Wed, 10 Apr 2013 22:09:21 +0200 ·
quoted from Bruce White
On Wed, Apr 10, 2013 at 5:51 PM, White, Bruce <user-58f975e8bf9d@xymon.invalid> wrote:
Over 1000 devices monitored here and only real issue is rrd keeping up.  I
have been told an ssd for the rrd files will solve this issue.
~2000 hosts and that will double or triple in the next few weeks. I really
don't see any IO issues in the slightest.
6 x 15k RPM SCSI drives in Raid 5 on a Dell PowerEdge 2950 with 8 gigs of
ram and the thing is snoring (LA: 0.25)

Regards,
Cami
list Japheth Cleaver · Thu, 11 Apr 2013 17:18:04 -0000 (UTC) ·
quoted from Cami Sardinha
On Wed, Apr 10, 2013 at 5:51 PM, White, Bruce <user-58f975e8bf9d@xymon.invalid>
wrote:
Over 1000 devices monitored here and only real issue is rrd keeping up.
I
have been told an ssd for the rrd files will solve this issue.
~2000 hosts and that will double or triple in the next few weeks. I really
don't see any IO issues in the slightest.
6 x 15k RPM SCSI drives in Raid 5 on a Dell PowerEdge 2950 with 8 gigs of
ram and the thing is snoring (LA: 0.25)

Regards,
Cami

We're currently processing ~2K incoming messages a second on a single
xymond instance. This is a pretty beefy box, but it's also handling lots
of other concurrent monitoring tasks that we're slowly moving over to
xymon... including a non-fping-enabled Icinga install >.<

]# xymon localhost "xymondboard test=info fields=hostname" | wc -l
42459

(Not all of those are full hosts; some are application nodes with statuses
being generated server-side out of client-side jvm stats or the like.)


At these levels it's important to ensure you're using whatever NUMA
capabilities your system has properly, since message passing is basically
just shoveling incoming TCP data around within memory. Also, you might
want to tweak net.ipv4.ip_local_port_range and enable
net.ipv4.tcp_tw_reuse and/or net.ipv4.tcp_tw_recycle on Linux to eke more
simultaneous testing out of xymonnet.


One of the beauties of Xymon's architecture is the ability to cleanly
disconnect the components... Xymongen can run on some other box,
xymond_locator can be used to send rrd data off somewhere if IO becomes an
issue, xymonnet pollers can be distributed, and xymonproxy can be used as
needed to aggregate and smooth out incoming status reports, etc.

There are lots of different mechanisms for "scaling" efficiently depending
on your particular needs, but I'd bet that on decently modern server
hardware you'll probably want to scale for HA purposes long before you
actually /need/ the additional power.


HTH,

-jc
list Olivier Audry · Thu, 11 Apr 2013 19:29:37 +0200 ·
hello

I impressed with your 2k incoming message. I only got 600 and we have
a lot of gap in our trends.

I suspect xymonproxy to add latency into the process or our huge and
historical extra-rrd.pl

We don't have load or iowait.

I'm not sure that it could be network issue. So if you have an idee :)

oau
quoted from Japheth Cleaver

Le jeudi 11 avril 2013 à 17:18 +0000, user-87556346d4af@xymon.invalid a écrit :
On Wed, Apr 10, 2013 at 5:51 PM, White, Bruce <user-58f975e8bf9d@xymon.invalid>
wrote:
Over 1000 devices monitored here and only real issue is rrd keeping up.
I
have been told an ssd for the rrd files will solve this issue.
~2000 hosts and that will double or triple in the next few weeks. I really
don't see any IO issues in the slightest.
6 x 15k RPM SCSI drives in Raid 5 on a Dell PowerEdge 2950 with 8 gigs of
ram and the thing is snoring (LA: 0.25)

Regards,
Cami

We're currently processing ~2K incoming messages a second on a single
xymond instance. This is a pretty beefy box, but it's also handling lots
of other concurrent monitoring tasks that we're slowly moving over to
xymon... including a non-fping-enabled Icinga install >.<

]# xymon localhost "xymondboard test=info fields=hostname" | wc -l
42459

(Not all of those are full hosts; some are application nodes with statuses
being generated server-side out of client-side jvm stats or the like.)


At these levels it's important to ensure you're using whatever NUMA
capabilities your system has properly, since message passing is basically
just shoveling incoming TCP data around within memory. Also, you might
want to tweak net.ipv4.ip_local_port_range and enable
net.ipv4.tcp_tw_reuse and/or net.ipv4.tcp_tw_recycle on Linux to eke more
simultaneous testing out of xymonnet.


One of the beauties of Xymon's architecture is the ability to cleanly
disconnect the components... Xymongen can run on some other box,
xymond_locator can be used to send rrd data off somewhere if IO becomes an
issue, xymonnet pollers can be distributed, and xymonproxy can be used as
needed to aggregate and smooth out incoming status reports, etc.

There are lots of different mechanisms for "scaling" efficiently depending
on your particular needs, but I'd bet that on decently modern server
hardware you'll probably want to scale for HA purposes long before you
actually /need/ the additional power.


HTH,

-jc
list Olivier Audry · Thu, 11 Apr 2013 20:40:14 +0200 ·
hello

can you gives us more information on your numa config ?

As I understand I only see two node 1 per physical cpu 

numactl --hardware
available: 2 nodes (0-1)
node 0 size: 12097 MB
node 0 free: 594 MB
node 1 size: 12120 MB
node 1 free: 12 MB
node distances:
node   0   1 
  0:  10  20 


event I got 24 cpu. Multi core and hyperthreading. Is that correct ?

As I can see my two node are full. Not good at all I guess.

My policy is the default one. Perhaps you can advice a specific policy
for a xymon setup ? 

 numactl --show
policy: default
preferred node: current
physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
23 
cpubind: 0 1 
nodebind: 0 1 
membind: 0 1 


I'm looking into /proc/pid/numa_maps to find more info.

If you can help it will be great :)

thx
quoted from Olivier Audry

oau

Le jeudi 11 avril 2013 à 17:18 +0000, user-87556346d4af@xymon.invalid a écrit :
On Wed, Apr 10, 2013 at 5:51 PM, White, Bruce <user-58f975e8bf9d@xymon.invalid>
wrote:
Over 1000 devices monitored here and only real issue is rrd keeping up.
I
have been told an ssd for the rrd files will solve this issue.
~2000 hosts and that will double or triple in the next few weeks. I really
don't see any IO issues in the slightest.
6 x 15k RPM SCSI drives in Raid 5 on a Dell PowerEdge 2950 with 8 gigs of
ram and the thing is snoring (LA: 0.25)

Regards,
Cami

We're currently processing ~2K incoming messages a second on a single
xymond instance. This is a pretty beefy box, but it's also handling lots
of other concurrent monitoring tasks that we're slowly moving over to
xymon... including a non-fping-enabled Icinga install >.<

]# xymon localhost "xymondboard test=info fields=hostname" | wc -l
42459

(Not all of those are full hosts; some are application nodes with statuses
being generated server-side out of client-side jvm stats or the like.)


At these levels it's important to ensure you're using whatever NUMA
capabilities your system has properly, since message passing is basically
just shoveling incoming TCP data around within memory. Also, you might
want to tweak net.ipv4.ip_local_port_range and enable
net.ipv4.tcp_tw_reuse and/or net.ipv4.tcp_tw_recycle on Linux to eke more
simultaneous testing out of xymonnet.


One of the beauties of Xymon's architecture is the ability to cleanly
disconnect the components... Xymongen can run on some other box,
xymond_locator can be used to send rrd data off somewhere if IO becomes an
issue, xymonnet pollers can be distributed, and xymonproxy can be used as
needed to aggregate and smooth out incoming status reports, etc.

There are lots of different mechanisms for "scaling" efficiently depending
on your particular needs, but I'd bet that on decently modern server
hardware you'll probably want to scale for HA purposes long before you
actually /need/ the additional power.


HTH,

-jc
list Olivier Audry · Thu, 11 Apr 2013 21:40:40 +0200 ·
hello

as I understand I should run xymon on a single node to improve memory
access latency. Right ?

I will test this if I found the right command :)
quoted from Olivier Audry

oau

Le jeudi 11 avril 2013 à 20:40 +0200, Olivier AUDRY a écrit :
hello

can you gives us more information on your numa config ?

As I understand I only see two node 1 per physical cpu 
numactl --hardware
available: 2 nodes (0-1)
node 0 size: 12097 MB
node 0 free: 594 MB
node 1 size: 12120 MB
node 1 free: 12 MB
node distances:
node   0   1   0:  10  20 

event I got 24 cpu. Multi core and hyperthreading. Is that correct ?

As I can see my two node are full. Not good at all I guess.

My policy is the default one. Perhaps you can advice a specific policy
for a xymon setup ? 
 numactl --show
policy: default
preferred node: current
physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
23 cpubind: 0 1 nodebind: 0 1 membind: 0 1 

I'm looking into /proc/pid/numa_maps to find more info.

If you can help it will be great :)

thx

oau

Le jeudi 11 avril 2013 à 17:18 +0000, user-87556346d4af@xymon.invalid a écrit :
On Wed, Apr 10, 2013 at 5:51 PM, White, Bruce <user-58f975e8bf9d@xymon.invalid>
wrote:
Over 1000 devices monitored here and only real issue is rrd keeping up.
I
have been told an ssd for the rrd files will solve this issue.
~2000 hosts and that will double or triple in the next few weeks. I really
don't see any IO issues in the slightest.
6 x 15k RPM SCSI drives in Raid 5 on a Dell PowerEdge 2950 with 8 gigs of
ram and the thing is snoring (LA: 0.25)

Regards,
Cami
We're currently processing ~2K incoming messages a second on a single
xymond instance. This is a pretty beefy box, but it's also handling lots
of other concurrent monitoring tasks that we're slowly moving over to
xymon... including a non-fping-enabled Icinga install >.<
]# xymon localhost "xymondboard test=info fields=hostname" | wc -l
42459
(Not all of those are full hosts; some are application nodes with statuses
being generated server-side out of client-side jvm stats or the like.)
At these levels it's important to ensure you're using whatever NUMA
capabilities your system has properly, since message passing is basically
just shoveling incoming TCP data around within memory. Also, you might
want to tweak net.ipv4.ip_local_port_range and enable
net.ipv4.tcp_tw_reuse and/or net.ipv4.tcp_tw_recycle on Linux to eke more
simultaneous testing out of xymonnet.
One of the beauties of Xymon's architecture is the ability to cleanly
disconnect the components... Xymongen can run on some other box,
xymond_locator can be used to send rrd data off somewhere if IO becomes an
issue, xymonnet pollers can be distributed, and xymonproxy can be used as
needed to aggregate and smooth out incoming status reports, etc.
There are lots of different mechanisms for "scaling" efficiently depending
on your particular needs, but I'd bet that on decently modern server
hardware you'll probably want to scale for HA purposes long before you
actually /need/ the additional power.
HTH,
-jc
list Japheth Cleaver · Thu, 11 Apr 2013 20:12:40 -0000 (UTC) ·
Le jeudi 11 avril 2013 à 20:40 +0200, Olivier AUDRY a écrit :
quoted from Olivier Audry
hello

as I understand I should run xymon on a single node to improve memory
access latency. Right ?
--snip--
quoted from Olivier Audry
numactl --hardware
available: 2 nodes (0-1)
node 0 size: 12097 MB
node 0 free: 594 MB
node 1 size: 12120 MB
node 1 free: 12 MB
node distances:
node   0   1
  0:  10  20


event I got 24 cpu. Multi core and hyperthreading. Is that correct ?
That seems odd; almost like hyperthreading is disabled? You should see
"node 0 cpus: ..." above each size. I'm running RHEL 6.4; it's possible
things have changed in that output over time if you're on a different
system.
quoted from Olivier Audry

As I can see my two node are full. Not good at all I guess.

My policy is the default one. Perhaps you can advice a specific policy
for a xymon setup ?

 numactl --show
policy: default
preferred node: current
physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
23
cpubind: 0 1
nodebind: 0 1
membind: 0 1
Generally speaking, yeah, use numactl in front of xymonlaunch to ensure
the entire process tree gets assigned to a single node. But it really
depends on your workload (can everything fit in that node?) and what else
is going on on the box. If you have something which analyzes xymondata in
a large dump, then does heavy munging on it and sends it back, it might be
better to have than on a different node than (say) the xymond_* worker
modules.

'numastat -s -z -p xymon' is your friend

The RH Performance Tuning and Resource Management guides are definitely
useful reading as well. I'm sure there's plenty of cgroup stuff that could
be helpful if/when the time came, but there are only so many hours in the
day and there's other low-hanging fruit at the moment :)

I'd definitely start with running the 'numad' service and seeing what it
does over time; it really could be all that you need.

HTH,

-jc
list Olivier Audry · Thu, 11 Apr 2013 22:23:15 +0200 ·
great many thx for your time I will check this
quoted from Japheth Cleaver
but there are only so many hours in the
day and there's other low-hanging fruit at the moment :)
so true :)
quoted from Japheth Cleaver

Le jeudi 11 avril 2013 à 20:12 +0000, user-87556346d4af@xymon.invalid a écrit :
Le jeudi 11 avril 2013 à 20:40 +0200, Olivier AUDRY a écrit :
hello

as I understand I should run xymon on a single node to improve memory
access latency. Right ?
--snip--
numactl --hardware
available: 2 nodes (0-1)
node 0 size: 12097 MB
node 0 free: 594 MB
node 1 size: 12120 MB
node 1 free: 12 MB
node distances:
node   0   1
  0:  10  20


event I got 24 cpu. Multi core and hyperthreading. Is that correct ?
That seems odd; almost like hyperthreading is disabled? You should see
"node 0 cpus: ..." above each size. I'm running RHEL 6.4; it's possible
things have changed in that output over time if you're on a different
system.

As I can see my two node are full. Not good at all I guess.

My policy is the default one. Perhaps you can advice a specific policy
for a xymon setup ?

 numactl --show
policy: default
preferred node: current
physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
23
cpubind: 0 1
nodebind: 0 1
membind: 0 1
Generally speaking, yeah, use numactl in front of xymonlaunch to ensure
the entire process tree gets assigned to a single node. But it really
depends on your workload (can everything fit in that node?) and what else
is going on on the box. If you have something which analyzes xymondata in
a large dump, then does heavy munging on it and sends it back, it might be
better to have than on a different node than (say) the xymond_* worker
modules.

'numastat -s -z -p xymon' is your friend

The RH Performance Tuning and Resource Management guides are definitely
useful reading as well. I'm sure there's plenty of cgroup stuff that could
be helpful if/when the time came, but there are only so many hours in the
day and there's other low-hanging fruit at the moment :)

I'd definitely start with running the 'numad' service and seeing what it
does over time; it really could be all that you need.

HTH,

-jc

list Sean Clark · Tue, 16 Apr 2013 11:04:15 -0400 ·
[Sorry to respond so late, I am catching up on emails]


I monitor about 43,000 devices split across 8 instances.
It runs on ancient hardware with 2 CPU, 8GB RAM, sun x4200's

I split RRD's to a different host, as well as xymongen and histfiles being
handled outside of stock xymon

The only issue I have run into (which I suspect will be fixed by beefier
hardware) is that once I get around 5,000 hosts, if xymon crashes, the
IPC/Shared Memory does not clean up right away, and it goes into a
continual restart process - henrik posted to the list earlier a way to
restart that kills all those things, so I haven't had issues since (still
tracking down what causes the crash)
quoted from Olivier Audry


On 4/11/13 4:23 PM, "Olivier AUDRY" <user-0dc286edb094@xymon.invalid> wrote:
great many thx for your time I will check this
but there are only so many hours in the
day and there's other low-hanging fruit at the moment :)
so true :)

Le jeudi 11 avril 2013 à 20:12 +0000, user-87556346d4af@xymon.invalid a écrit :
Le jeudi 11 avril 2013 à 20:40 +0200, Olivier AUDRY a écrit :
hello

as I understand I should run xymon on a single node to improve memory
access latency. Right ?
--snip--
numactl --hardware
available: 2 nodes (0-1)
node 0 size: 12097 MB
node 0 free: 594 MB
node 1 size: 12120 MB
node 1 free: 12 MB
node distances:
node   0   1
  0:  10  20


event I got 24 cpu. Multi core and hyperthreading. Is that correct ?
That seems odd; almost like hyperthreading is disabled? You should see
"node 0 cpus: ..." above each size. I'm running RHEL 6.4; it's possible
things have changed in that output over time if you're on a different
system.

As I can see my two node are full. Not good at all I guess.

My policy is the default one. Perhaps you can advice a specific
policy
for a xymon setup ?

 numactl --show
policy: default
preferred node: current
physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
22
23
cpubind: 0 1
nodebind: 0 1
membind: 0 1
Generally speaking, yeah, use numactl in front of xymonlaunch to ensure
the entire process tree gets assigned to a single node. But it really
depends on your workload (can everything fit in that node?) and what
else
is going on on the box. If you have something which analyzes xymondata
in
a large dump, then does heavy munging on it and sends it back, it might
be
better to have than on a different node than (say) the xymond_* worker
modules.

'numastat -s -z -p xymon' is your friend

The RH Performance Tuning and Resource Management guides are definitely
useful reading as well. I'm sure there's plenty of cgroup stuff that
could
be helpful if/when the time came, but there are only so many hours in
the
day and there's other low-hanging fruit at the moment :)

I'd definitely start with running the 'numad' service and seeing what it
does over time; it really could be all that you need.

HTH,

-jc

This E-mail and any of its attachments may contain Time Warner Cable proprietary information, which is privileged, confidential, or subject to copyright belonging to Time Warner Cable. This E-mail is intended solely for the use of the individual or entity to which it is addressed. If you are not the intended recipient of this E-mail, you are hereby notified that any dissemination, distribution, copying, or action taken in relation to the contents of and attachments to this E-mail is strictly prohibited and may be unlawful. If you have received this E-mail in error, please notify the sender immediately and permanently delete the original and any copy of this E-mail and any printout.