Xymon Mailing List Archive search

fping tuning

16 messages in this thread

list Eric E *hs Schwimmer · Wed, 26 Apr 2006 15:24:15 -0400 ·
We're monitoring 1420 IPs in hobbit, and it takes fping
~40 seconds to go through them all:

<snip>
[root at hobbit fping]# fping -i5 -b12 -f ips -r1 -t250 -B2 -q -s
    1430 targets
    1419 alive
      11 unreachable
       0 unknown addresses

      55 timeouts (waiting for response)
    1474 ICMP Echos sent
    1420 ICMP Echo Replies received
       0 other ICMP received

 0.05 ms (min round trip time)
 5.83 ms (avg round trip time)
 281 ms (max round trip time)
       40.704 sec (elapsed real time)
</snip>

Now, this seems a bit lengthy to me.  I mean, if the avg
round trip time is 5.83 ms, and there are 1430 hosts, 
should the total time in transit for all hosts should
be 8336ms, or 8 seconds... right?  Even when I remove the
hosts that aren't responding, the results on are par with
those above.

Our polling interval is once every 60 seconds (which we want 
to maintain, because we like to know ASAP when something drops
even one ping), so it's not a problem yet. We add hosts on a 
daily basis, however, so it will be a problem some time in 
the future and I'd like to fix it before it becomes a problem.

This machine is a dual 3GHz Xeon /w 6GB of memory,
running Fedora Core 5.  I've flipped every bit in the kernel
parameters that I dare via sysctl, with little to no effect
on the poll time.  Does anybody out there have any
recommendations on a way to speed this up?

Regards,
-Eric Schwimmer
Network Engineer
UVA HSCS Network Engineering
list Henrik Størner · Wed, 26 Apr 2006 22:14:22 +0200 ·
quoted from Eric E *hs Schwimmer
On Wed, Apr 26, 2006 at 03:24:15PM -0400, Schwimmer, Eric E *HS wrote:
We're monitoring 1420 IPs in hobbit, and it takes fping
~40 seconds to go through them all:
Is that a number you get from the "bbtest" status or from running
fping by hand?

Are you doing other network tests in Hobbit than just ping?
Hobbit does the ping tests in parallel with the other tests.
quoted from Eric E *hs Schwimmer
<snip>
[root at hobbit fping]# fping -i5 -b12 -f ips -r1 -t250 -B2 -q -s
Are you using those parameters also on the FPING command in
hobbitserver.cfg? Or is it just for your testing ?
quoted from Eric E *hs Schwimmer
Now, this seems a bit lengthy to me.  I mean, if the avg
round trip time is 5.83 ms, and there are 1430 hosts, 
should the total time in transit for all hosts should
be 8336ms, or 8 seconds... right?
No, it should be less - because fping pings several hosts in
parallel.

You have "-i5" which causes a 5 ms delay between each ping.
So that's (5/1000)*1430 = 7.15 seconds where it does nothing.
The default setting is "-i25" - i.e. 5 times higher - which
would actually match your ~40 seconds nicely.

Don't forget that there is probably also some time spent doing 
ARP lookups for all of these IP's. Unless you have "testip"
on all of the entries in bb-hosts (or run bbtest-net with "--dns=ip"), 
you'll also spend some time on DNS lookups (hint: use a local
caching DNS server on the Hobbit server).
quoted from Eric E *hs Schwimmer
Even when I remove the hosts that aren't responding, the 
results on are par with those above.

Our polling interval is once every 60 seconds (which we want 
to maintain, because we like to know ASAP when something drops
even one ping), so it's not a problem yet. We add hosts on a 
daily basis, however, so it will be a problem some time in 
the future and I'd like to fix it before it becomes a problem.
Well, the good news is that it probably won't become a problem.
Because fping pings multiple hosts in parallel, the runtime
doesn't change very much when you add more hosts.

If it does become an issue, spread the load. Setup an extra server
to do half the network tests, and configure your bb-hosts file 
with "NET:net-a" and "NET:net-b" tags on the hosts. Then you
set BBLOCATION="net-a" on one box, and "BBLOCATION=net-b" on the
other. Then they'll only test those hosts where the NET:... 
setting matches. Unless it's an OS limitation, you could probably
do that on a single box and just have two instances of the [bbnet]
task in hobbitlaunch.cfg - instead of running bbtest-net directly,
they would run a shell-script which sets the BBLOCATION environment 
just before running bbtest-net.


Regards,
Henrik
list Eric E *hs Schwimmer · Wed, 26 Apr 2006 17:49:45 -0400 ·
quoted from Henrik Størner
We're monitoring 1420 IPs in hobbit, and it takes fping
~40 seconds to go through them all:
Is that a number you get from the "bbtest" status or from running
fping by hand?
Both.  The values are fairly consistent, falling between somewhere
in the 39-42 range.
quoted from Henrik Størner
 
Are you doing other network tests in Hobbit than just ping?
Hobbit does the ping tests in parallel with the other tests.
We are doing other tests, but not many.  Here's the relevent
lines from our servers bbtest report:

TIME SPENT
Event                                            Starttime
Duration
TCP tests completed                      1146086231.293585
1.211963 
PING test completed (1434 hosts)         1146086271.488185
40.194600 
PING test results sent                   1146086271.523332
0.035147 
TIME TOTAL
41.549643 
quoted from Henrik Størner
<snip>
[root at hobbit fping]# fping -i5 -b12 -f ips -r1 -t250 -B2 -q -s
Are you using those parameters also on the FPING command in
hobbitserver.cfg? Or is it just for your testing ?
This is just what I've been using for testing (the -f flag is
root only and wouldn't work very well when used from within
hobbit).  The value of my FPING envvar in hobbitserver.cfg
is "/usr/sbin/fping -i10 -b12".  However the average difference 
in polling time betweeh the two is only 1 or 2 seconds.
quoted from Henrik Størner
 
Now, this seems a bit lengthy to me.  I mean, if the avg
round trip time is 5.83 ms, and there are 1430 hosts, 
should the total time in transit for all hosts should
be 8336ms, or 8 seconds... right?
No, it should be less - because fping pings several hosts in
parallel.

You have "-i5" which causes a 5 ms delay between each ping.
So that's (5/1000)*1430 = 7.15 seconds where it does nothing.
The default setting is "-i25" - i.e. 5 times higher - which
would actually match your ~40 seconds nicely.
Using the default delay interval (i.e. not specifying the -i flag
when calling fping) causes the test to take much longer, on
the order of 60 - 70 seconds.  However, values of 15 or less
passed to -i don't make much of a difference in polling time.
(FWIW, fping doesn't let you specific a value for -i less than
10 unless you are root.  I hacked the fping code to get around
this so I could run it under hobbit with -i1, but I saw no
difference in polling times using -i1 vs -i15).
quoted from Henrik Størner

Don't forget that there is probably also some time spent doing 
ARP lookups for all of these IP's. Unless you have "testip"
on all of the entries in bb-hosts (or run bbtest-net with 
"--dns=ip"), you'll also spend some time on DNS lookups 
(hint: use a local caching DNS server on the Hobbit server).
Yep, I have --dns=ip in the bbtest-net stanza of my hobbitlaunch.cfg
(that makes a BIG difference), so I don't think it's a DNS resolution
problem.  In the testing fping command above, the -f flag specifies
a file that is a list of all the IP addresses from my bb-host file, 
with not DNS names included, so I don't think it's a DNS problem. 
I feel like its some sort of concurrency issue within fping, since
I can reproduce this latency completely outside of hobbit.

As a complete aside, we caching server for things outside of hobbit,
and I've written a little script that monitors the bb-hosts file (
and all filed included from bb-hosts) and when it detects any
changes, it will write a bind9 zone file to somewhere on disk.  Its
handy for making sure your bb-hosts is synced with your DNS.  If 
anybody is interested in it, drop me a line (I'll have to 'pretty it
up' first) and I'll posted it on my hobbit tools page for people to
use.
quoted from Henrik Størner
Even when I remove the hosts that aren't responding, the 
results on are par with those above.

Our polling interval is once every 60 seconds (which we want 
to maintain, because we like to know ASAP when something drops
even one ping), so it's not a problem yet. We add hosts on a 
daily basis, however, so it will be a problem some time in 
the future and I'd like to fix it before it becomes a problem.
Well, the good news is that it probably won't become a problem.
Because fping pings multiple hosts in parallel, the runtime
doesn't change very much when you add more hosts.
Ah, so you would think ;)  However, our graph in our bbtest column
says otherwise;  it has been climbing slowly but steadily since it
started graphing data.  You can also reproduce this by using
a newline delimited list of IP addresses in a file, like I did above,
and feeding it to fping.  As you increase the number of IPs in the
file, the poll time increases geometrically.  For instance,
when I poll 300 hosts:

<snip>
[root at hobbit fpingtest]# fping -i5 -b12 -f 300ips -r1 -t250 -B2 -q -s

     300 targets
     298 alive
       2 unreachable
       0 unknown addresses

      12 timeouts (waiting for response)
     310 ICMP Echos sent
     298 ICMP Echo Replies received
       0 other ICMP received

 0.24 ms (min round trip time)
 2.38 ms (avg round trip time)
 101 ms (max round trip time)
        8.856 sec (elapsed real time)
</snip>

vs when I poll 600 hosts:
<snip>
[root at hobbit fpingtest]# fping -i5 -b12 -f 600ips -r1 -t250 -B2 -q -s

     600 targets
     597 alive
       3 unreachable
       0 unknown addresses

      14 timeouts (waiting for response)
     611 ICMP Echos sent
     597 ICMP Echo Replies received
       0 other ICMP received

 0.21 ms (min round trip time)
 2.48 ms (avg round trip time)
 100 ms (max round trip time)
       16.144 sec (elapsed real time)
</snip>

You can see that the ping time roughly doubles.  This is bad :(
quoted from Henrik Størner
If it does become an issue, spread the load. Setup an extra server
to do half the network tests, and configure your bb-hosts file 
with "NET:net-a" and "NET:net-b" tags on the hosts. Then you
set BBLOCATION="net-a" on one box, and "BBLOCATION=net-b" on the
other. Then they'll only test those hosts where the NET:... 
setting matches. Unless it's an OS limitation, you could probably
do that on a single box and just have two instances of the [bbnet]
task in hobbitlaunch.cfg - instead of running bbtest-net directly,
they would run a shell-script which sets the BBLOCATION environment 
just before running bbtest-net.
I was thinking of doing something along these lines, however the
bb-hosts
file is maintained mostly by the (non-unix-savvy) staff here, using the 
bb-hostedit CGI script, and I'd rather not have them have to keep track 
of which host needed which NET tag, etc.  

I've tested the network capabilities of this box using iperf as well 
as several concurrent ping floods, and it can send upwards of 10000+ 
ICMP packets per second (with successful replies from another host on
the 
same 1000bT switch).

So this leads me to believe that it is a problem solely with fping;  if
they had a public forum or a mailing list, I'd be whining there instead
of here. :)  I can't say that I was expecting to find the 'magic bullet'
for this problem here, but I was hoping that there might be some fping
wizard out there some magic bullets to spare.  Anywho, thanks for your
thoughts, Henrik.  I'll poke some more at the fping code and see if
I can figure out whats going on (I doubt it);  if not, I'll start
working towards hacking together a load balancing script that will
auto-add NET: tags to bb-hosts entry, or something along those lines.

Thanks,
-Eric
list Greg L Hubbard · Wed, 26 Apr 2006 17:07:22 -0500 ·
Eric,

If it takes twice as long to ping 600 things as it does 300 things,
isn't that to be expected?  After all, you are pinging twice as many
items.  I don't think this is "geometric" but linear.  I don't know how
many parallel paths fping has, but I suspect it is far less than either
300 or 600, so some queuing is going to occur.

But it is hard to calculate, too, because some of the IP addresses do
not respond.  If a host responds, then the pinger is free to move to the
next one in the list.  Otherwise it has to go through the "time out and
retry" dance.  So a non-responsive host could cause 5 or 6 seconds in
delay until the pinger decides it is down and moves on.  Since Fping
runs things in parallel, it is the luck of the draw regarding which
stream might get bogged down.

What is your ping interval?  Pinging 1400 IP's in 40 seconds sounds
pretty good to me -- you have a lot of room for growth.  (Big Brother
can't do this without some help)  That is about 35 IPs per second.
Round down to 30 per second, multiply by 300, and you could possibly
monitor 9,000 IP's with this one server in a five minute window!  After
all, you only need to get to them all in the cycle before circling
around and hitting them again.  Some of the pay ware management systems
actually try to space out their activity through the polling cycle so
they don't hog the network themselves.

GLH
quoted from Eric E *hs Schwimmer
 

-----Original Message-----
From: Schwimmer, Eric E *HS [mailto:user-1e1008b069d5@xymon.invalid] 
Sent: Wednesday, April 26, 2006 4:50 PM
To: user-ae9b8668bcde@xymon.invalid
Subject: RE: [hobbit] fping tuning
We're monitoring 1420 IPs in hobbit, and it takes fping ~40 seconds 
to go through them all:
Is that a number you get from the "bbtest" status or from running 
fping by hand?
Both.  The values are fairly consistent, falling between somewhere in
the 39-42 range.
 
Are you doing other network tests in Hobbit than just ping?
Hobbit does the ping tests in parallel with the other tests.
We are doing other tests, but not many.  Here's the relevent lines from
our servers bbtest report:

TIME SPENT
Event                                            Starttime
Duration
TCP tests completed                      1146086231.293585
1.211963 
PING test completed (1434 hosts)         1146086271.488185
40.194600 
PING test results sent                   1146086271.523332
0.035147
TIME TOTAL
41.549643 
<snip>
[root at hobbit fping]# fping -i5 -b12 -f ips -r1 -t250 -B2 -q -s
Are you using those parameters also on the FPING command in 
hobbitserver.cfg? Or is it just for your testing ?
This is just what I've been using for testing (the -f flag is root only
and wouldn't work very well when used from within hobbit).  The value of
my FPING envvar in hobbitserver.cfg is "/usr/sbin/fping -i10 -b12".
However the average difference in polling time betweeh the two is only 1
or 2 seconds.
 
Now, this seems a bit lengthy to me.  I mean, if the avg round trip 
time is 5.83 ms, and there are 1430 hosts, should the total time in 
transit for all hosts should be 8336ms, or 8 seconds... right?
No, it should be less - because fping pings several hosts in parallel.

You have "-i5" which causes a 5 ms delay between each ping.
So that's (5/1000)*1430 = 7.15 seconds where it does nothing.
The default setting is "-i25" - i.e. 5 times higher - which would 
actually match your ~40 seconds nicely.
Using the default delay interval (i.e. not specifying the -i flag when
calling fping) causes the test to take much longer, on the order of 60 -
70 seconds.  However, values of 15 or less passed to -i don't make much
of a difference in polling time.
(FWIW, fping doesn't let you specific a value for -i less than 10 unless
you are root.  I hacked the fping code to get around this so I could run
it under hobbit with -i1, but I saw no difference in polling times using
-i1 vs -i15).

Don't forget that there is probably also some time spent doing ARP 
lookups for all of these IP's. Unless you have "testip"
on all of the entries in bb-hosts (or run bbtest-net with "--dns=ip"),
you'll also spend some time on DNS lookups
(hint: use a local caching DNS server on the Hobbit server).
Yep, I have --dns=ip in the bbtest-net stanza of my hobbitlaunch.cfg
(that makes a BIG difference), so I don't think it's a DNS resolution
problem.  In the testing fping command above, the -f flag specifies a
file that is a list of all the IP addresses from my bb-host file, with
not DNS names included, so I don't think it's a DNS problem. 
I feel like its some sort of concurrency issue within fping, since I can
reproduce this latency completely outside of hobbit.

As a complete aside, we caching server for things outside of hobbit, and
I've written a little script that monitors the bb-hosts file ( and all
filed included from bb-hosts) and when it detects any changes, it will
write a bind9 zone file to somewhere on disk.  Its handy for making sure
your bb-hosts is synced with your DNS.  If anybody is interested in it,
drop me a line (I'll have to 'pretty it up' first) and I'll posted it on
my hobbit tools page for people to use.
Even when I remove the hosts that aren't responding, the results on 
are par with those above.

Our polling interval is once every 60 seconds (which we want to 
maintain, because we like to know ASAP when something drops even one
ping), so it's not a problem yet. We add hosts on a daily basis, 
however, so it will be a problem some time in the future and I'd 
like to fix it before it becomes a problem.
Well, the good news is that it probably won't become a problem.
Because fping pings multiple hosts in parallel, the runtime doesn't 
change very much when you add more hosts.
Ah, so you would think ;)  However, our graph in our bbtest column says
otherwise;  it has been climbing slowly but steadily since it started
graphing data.  You can also reproduce this by using a newline delimited
list of IP addresses in a file, like I did above, and feeding it to
fping.  As you increase the number of IPs in the file, the poll time
increases geometrically.  For instance, when I poll 300 hosts:

<snip>
[root at hobbit fpingtest]# fping -i5 -b12 -f 300ips -r1 -t250 -B2 -q -s

     300 targets
     298 alive
       2 unreachable
       0 unknown addresses

      12 timeouts (waiting for response)
     310 ICMP Echos sent
     298 ICMP Echo Replies received
       0 other ICMP received

 0.24 ms (min round trip time)
 2.38 ms (avg round trip time)
 101 ms (max round trip time)
        8.856 sec (elapsed real time)
</snip>

vs when I poll 600 hosts:
<snip>
[root at hobbit fpingtest]# fping -i5 -b12 -f 600ips -r1 -t250 -B2 -q -s

     600 targets
     597 alive
       3 unreachable
       0 unknown addresses

      14 timeouts (waiting for response)
     611 ICMP Echos sent
     597 ICMP Echo Replies received
       0 other ICMP received

 0.21 ms (min round trip time)
 2.48 ms (avg round trip time)
 100 ms (max round trip time)
       16.144 sec (elapsed real time)
</snip>

You can see that the ping time roughly doubles.  This is bad :(
If it does become an issue, spread the load. Setup an extra server to 
do half the network tests, and configure your bb-hosts file with 
"NET:net-a" and "NET:net-b" tags on the hosts. Then you set 
BBLOCATION="net-a" on one box, and "BBLOCATION=net-b" on the other. 
Then they'll only test those hosts where the NET:...
setting matches. Unless it's an OS limitation, you could probably do 
that on a single box and just have two instances of the [bbnet] task 
in hobbitlaunch.cfg - instead of running bbtest-net directly, they 
would run a shell-script which sets the BBLOCATION environment just 
before running bbtest-net.
I was thinking of doing something along these lines, however the
bb-hosts file is maintained mostly by the (non-unix-savvy) staff here,
using the bb-hostedit CGI script, and I'd rather not have them have to
keep track of which host needed which NET tag, etc.  

I've tested the network capabilities of this box using iperf as well as
several concurrent ping floods, and it can send upwards of 10000+ ICMP
packets per second (with successful replies from another host on the
same 1000bT switch).

So this leads me to believe that it is a problem solely with fping;  if
they had a public forum or a mailing list, I'd be whining there instead
of here. :)  I can't say that I was expecting to find the 'magic bullet'
for this problem here, but I was hoping that there might be some fping
wizard out there some magic bullets to spare.  Anywho, thanks for your
thoughts, Henrik.  I'll poke some more at the fping code and see if I
can figure out whats going on (I doubt it);  if not, I'll start working
towards hacking together a load balancing script that will auto-add NET:
tags to bb-hosts entry, or something along those lines.

Thanks,
-Eric
list Henrik Størner · Thu, 27 Apr 2006 07:39:53 +0200 ·
quoted from Greg L Hubbard
On Wed, Apr 26, 2006 at 05:49:45PM -0400, Schwimmer, Eric E *HS wrote:
So this leads me to believe that it is a problem solely with fping;  if
they had a public forum or a mailing list, I'd be whining there instead
of here. :)  I can't say that I was expecting to find the 'magic bullet'
for this problem here, but I was hoping that there might be some fping
wizard out there some magic bullets to spare.  Anywho, thanks for your
thoughts, Henrik.  I'll poke some more at the fping code and see if
I can figure out whats going on (I doubt it)
I haven't spent much time looking at the fping code, so I have no idea
how well it's been optimized. It might be an idea to compile fping 
with profiling enabled (i.e. add the "-g -pg" options to the compile- and
link-flags) and run it through your test. This generates a "gmon.out"
file which you can run through gprof like "gprof fping gmon.out" and
it will tell you how much time is spent in various parts of the code.
"gprof -l ..." will do it on a line-by-line basis.

One thing I learnt from profiling the Hobbit code is that it is very
easy to store things in arrays or linked lists, but it is also very
expensive to search through such lists. So I wonder if fping might be
storing the IP-adresses it pings in an array, and scanning through that
array every time it receives a reply.

I'll have a look at it sometime.


Regards,
Henrik
list Frédéric Mangeant · Thu, 27 Apr 2006 09:21:17 +0200 ·
quoted from Eric E *hs Schwimmer
Schwimmer, Eric E *HS a écrit :
We're monitoring 1420 IPs in hobbit, and it takes fping
~40 seconds to go through them all:

<snip>
[root at hobbit fping]# fping -i5 -b12 -f ips -r1 -t250 -B2 -q -s
    1430 targets
    1419 alive
      11 unreachable
       0 unknown addresses

      55 timeouts (waiting for response)
    1474 ICMP Echos sent
    1420 ICMP Echo Replies received
       0 other ICMP received

 0.05 ms (min round trip time)
 5.83 ms (avg round trip time)
 281 ms (max round trip time)
       40.704 sec (elapsed real time)
</snip>
  
Hi Eric

this won't help you much, but I'm monitoring 1733 hosts with Hobbit, on 
a dual Xeon 3.2 GHz with 4 Gb running an up-to-date Gentoo Linux.
Hobbit takes between 15 and 30 seconds to ping 1632 hosts; sudo is used 
to run fping :


TIME SPENT
Event                                            Starttime          Duration
PING test completed (1632 hosts)         1146122348.804056         19.808170


Running fping by hand gives this :


# fping -i5 -b12 -f /tmp/ips.txt -r1 -t250 -B2 -q -s
[...]
30.999 sec (elapsed real time)


Lowering the -i, -r, -t values doesn't give anything...

The funny thing is that Hobbit runs sudo with -Ae, which is way slower 
when I run it by hand...

-- 

Frédéric Mangeant

Steria EDC Sophia-Antipolis
list Henrik Størner · Thu, 27 Apr 2006 14:43:38 +0200 ·
quoted from Henrik Størner
On Thu, Apr 27, 2006 at 07:39:53AM +0200, Henrik Stoerner wrote:
On Wed, Apr 26, 2006 at 05:49:45PM -0400, Schwimmer, Eric E *HS wrote:
So this leads me to believe that it is a problem solely with fping;  if
they had a public forum or a mailing list, I'd be whining there instead
of here. :)  I can't say that I was expecting to find the 'magic bullet'
for this problem here, but I was hoping that there might be some fping
wizard out there some magic bullets to spare.  Anywho, thanks for your
thoughts, Henrik.  I'll poke some more at the fping code and see if
I can figure out whats going on (I doubt it)
I haven't spent much time looking at the fping code, so I have no idea
how well it's been optimized. It might be an idea to compile fping 
with profiling enabled (i.e. add the "-g -pg" options to the compile- and
link-flags) and run it through your test. This generates a "gmon.out"
file which you can run through gprof like "gprof fping gmon.out" and
it will tell you how much time is spent in various parts of the code.
"gprof -l ..." will do it on a line-by-line basis.
I've had a look at the fping sources.

There aren't any really obvious reasons why it should take so long.
If you run it with "time", it also claims that the user- and system-time
are really low (I tried with ~1500 hosts), but the wall-clock time is
like 90 seconds (default options). Which kind of points at the code 
not really doing parallel pings.


I think I'll try some modifications to it over the week-end. If any of
it seems to improve it, I'll let you know.


Regards,
Henrik
list Eric E *hs Schwimmer · Thu, 27 Apr 2006 09:17:24 -0400 ·
quoted from Greg L Hubbard
If it takes twice as long to ping 600 things as it does 300 things,
isn't that to be expected?  After all, you are pinging twice as many
items.  
If you ping all the hosts in paralell (aka all ICMP replies are sent
near simultaenously) then your test latentcy should really only be
limited by the speed of your cpu as well as that of your network, not
by how many hosts you are pinging.
I don't think this is "geometric" but linear.
My bad.  I was never good at math :)
quoted from Greg L Hubbard
But it is hard to calculate, too, because some of the IP addresses do
not respond.  If a host responds, then the pinger is free to 
move to the next one in the list.  Otherwise it has to go through the 
"time out and retry" dance.  So a non-responsive host could cause 5 or
6 seconds in delay until the pinger decides it is down and moves on.  
Since Fping runs things in parallel, it is the luck of the draw 
regarding which stream might get bogged down.
I originally thought this might be part of the problem, so I wrote a
script
that went through the fping output and only included IP addresses that
had < 10ms response time, which was over 95% of the addresses from the
original list.  Running fping again on this new list didn't change
the behaviour much.  It was still taking ~35 seconds to poll 1300
devices.
quoted from Greg L Hubbard
What is your ping interval?  Pinging 1400 IP's in 40 seconds sounds
pretty good to me -- you have a lot of room for growth.  (Big Brother
can't do this without some help)  That is about 35 IPs per second.
Round down to 30 per second, multiply by 300, and you could possibly
monitor 9,000 IP's with this one server in a five minute 
window!  After all, you only need to get to them all in the cycle 
before circling around and hitting them again.  Some of the pay ware 
management systems actually try to space out their activity through 
the polling cycle so they don't hog the network themselves.
Our poll interval is 60 seconds; its true that we do have room for
growth, but at the rate that we are adding devices, we won't be able
to grow much longer without exceeding the poll interval :)  I'm just
hoping for an 'easy' fix to fping that will make everything better.

-Eric
list Eric E *hs Schwimmer · Thu, 27 Apr 2006 09:37:37 -0400 ·

Hey that's pretty nifty.  However, after running gprof on the fping
gmon.out, I get a nice callgraph analysis and such, but all the times
are listed as 0.00.  Weird.

-Eric 
quoted from Henrik Størner
-----Original Message-----
From: Henrik Stoerner [mailto:user-ce4a2c883f75@xymon.invalid] Sent: Thursday, April 27, 2006 1:40 AM
To: user-ae9b8668bcde@xymon.invalid
Subject: Re: [hobbit] fping tuning

On Wed, Apr 26, 2006 at 05:49:45PM -0400, Schwimmer, Eric E *HS wrote:
So this leads me to believe that it is a problem solely with fping;  if
they had a public forum or a mailing list, I'd be whining there instead
of here. :)  I can't say that I was expecting to find the 'magic bullet'
for this problem here, but I was hoping that there might be some fping
wizard out there some magic bullets to spare.  Anywho, thanks for your
thoughts, Henrik.  I'll poke some more at the fping code and see if
I can figure out whats going on (I doubt it)
I haven't spent much time looking at the fping code, so I have no idea
how well it's been optimized. It might be an idea to compile fping with profiling enabled (i.e. add the "-g -pg" options to the compile- and
link-flags) and run it through your test. This generates a "gmon.out"
file which you can run through gprof like "gprof fping gmon.out" and
it will tell you how much time is spent in various parts of the code.
"gprof -l ..." will do it on a line-by-line basis.

One thing I learnt from profiling the Hobbit code is that it is very
easy to store things in arrays or linked lists, but it is also very
expensive to search through such lists. So I wonder if fping might be
storing the IP-adresses it pings in an array, and scanning through that
array every time it receives a reply.

I'll have a look at it sometime.


Regards,
Henrik

list Eric E *hs Schwimmer · Thu, 27 Apr 2006 09:54:26 -0400 ·
quoted from Frédéric Mangeant
Hi Eric

this won't help you much, but I'm monitoring 1733 hosts with Hobbit, on a dual Xeon 3.2 GHz with 4 Gb running an up-to-date Gentoo Linux.  Hobbit takes between 15 and 30 seconds to ping 1632 hosts;  sudo is used to run fping :


TIME SPENT
Event                                            Starttime          Duration
PING test completed (1632 hosts)         1146122348.804056         19.808170


Running fping by hand gives this :


# fping -i5 -b12 -f /tmp/ips.txt -r1 -t250 -B2 -q -s
[...]
30.999 sec (elapsed real time)


Lowering the -i, -r, -t values doesn't give anything...

The funny thing is that Hobbit runs sudo with -Ae, which is way slower when I run it by hand...
Weird.  Using the -Ae flag doesn't make a difference in time for me,
nor does using sudo (this is testing it by hand, not from within
hobbit).  Still, it seems like you are doing better than we are.
I wonder if this is a Fedora-specific peculiararity?  I'll try
installing fping on my wimpy Arch Linux desktop box and seeing
if I can glean anything conclusive (although the difference
in hardware is going to make this difficult).

Thanks for sharing!
-Eric
list Eric E *hs Schwimmer · Thu, 27 Apr 2006 10:03:30 -0400 ·
quoted from Henrik Størner
I've had a look at the fping sources.

There aren't any really obvious reasons why it should take so long.
If you run it with "time", it also claims that the user- and system-time
are really low (I tried with ~1500 hosts), but the wall-clock time is
like 90 seconds (default options). Which kind of points at the code not really doing parallel pings.


I think I'll try some modifications to it over the week-end. If any of
it seems to improve it, I'll let you know.
I gave the source a quick peruse before sending my initial email, and
my gut was telling me likewise.  Thanks for looking in to it :)

-Eric
list Henrik Størner · Thu, 27 Apr 2006 16:13:38 +0200 ·
quoted from Eric E *hs Schwimmer
On Thu, Apr 27, 2006 at 09:37:37AM -0400, Schwimmer, Eric E *HS wrote:

Hey that's pretty nifty.  However, after running gprof on the fping
gmon.out, I get a nice callgraph analysis and such, but all the times
are listed as 0.00.  Weird.
gprof only measures time spent in user-mode. I think fping spends most
of its time inside the select() system-call waiting for data - that's why if you run it with "time" you'll see almost no time spent.


Henrik
list Henrik Størner · Mon, 1 May 2006 16:37:04 +0200 ·
quoted from Eric E *hs Schwimmer
On Thu, Apr 27, 2006 at 10:03:30AM -0400, Schwimmer, Eric E *HS wrote:
I've had a look at the fping sources.

There aren't any really obvious reasons why it should take so long.
If you run it with "time", it also claims that the user- and 
system-time are really low (I tried with ~1500 hosts), but the 
wall-clock time is like 90 seconds (default options). Which kind of 
points at the code not really doing parallel pings.

I think I'll try some modifications to it over the week-end. If any of
it seems to improve it, I'll let you know.
I gave the source a quick peruse before sending my initial email, and
my gut was telling me likewise.  Thanks for looking in to it :)
Just an update on this:

As some have noticed - because it didn't compile - I have cooked up a
"hobbitping" utility to replace fping. On my system, fping took about
90 seconds to do a full sweep of the hosts. hobbitping takes 17 seconds,
of which 15 are a static 3x5 seconds delay while the non-responding hosts
timeout.

Eric was so kind as to confirm that it works on his system as well.

So it's goodbye fping - hello hobbitping.


Regards,
Henrik
list Greg L Hubbard · Mon, 1 May 2006 09:43:42 -0500 ·
Perhaps it should be "wcping" as in "water-cooled ping"!

GLH
quoted from Henrik Størner

-----Original Message-----
From: Henrik Stoerner [mailto:user-ce4a2c883f75@xymon.invalid] Sent: Monday, May 01, 2006 9:37 AM
To: user-ae9b8668bcde@xymon.invalid
Subject: Re: [hobbit] fping tuning

On Thu, Apr 27, 2006 at 10:03:30AM -0400, Schwimmer, Eric E *HS wrote:
I've had a look at the fping sources.
There aren't any really obvious reasons why it should take so long.
If you run it with "time", it also claims that the user- and > system-time are really low (I tried with ~1500 hosts), but the > wall-clock time is like 90 seconds (default options). Which kind of > points at the code not really doing parallel pings.
I think I'll try some modifications to it over the week-end. If any > of it seems to improve it, I'll let you know.
I gave the source a quick peruse before sending my initial email, and my gut was telling me likewise.  Thanks for looking in to it :)
Just an update on this:

As some have noticed - because it didn't compile - I have cooked up a
"hobbitping" utility to replace fping. On my system, fping took about 90
seconds to do a full sweep of the hosts. hobbitping takes 17 seconds, of
which 15 are a static 3x5 seconds delay while the non-responding hosts
timeout.

Eric was so kind as to confirm that it works on his system as well.

So it's goodbye fping - hello hobbitping.


Regards,
Henrik
list Frédéric Mangeant · Tue, 02 May 2006 10:00:25 +0200 ·
quoted from Greg L Hubbard
Henrik Stoerner a écrit :
On Thu, Apr 27, 2006 at 10:03:30AM -0400, Schwimmer, Eric E *HS wrote:
  

Just an update on this:

As some have noticed - because it didn't compile - I have cooked up a
"hobbitping" utility to replace fping. On my system, fping took about
90 seconds to do a full sweep of the hosts. hobbitping takes 17 seconds,
of which 15 are a static 3x5 seconds delay while the non-responding hosts
timeout.

Eric was so kind as to confirm that it works on his system as well.

So it's goodbye fping - hello hobbitping.
  
Hi Henrik

many thanks for writing hobbitping ! It works fine on my test system 
(Gentoo Linux x86, glibc 2.3.6, GCC 3.4.6).

Is it / will it be possible to run it on top of a 4.1.2p1 installation ?


-- 

Frédéric Mangeant

Steria EDC Sophia-Antipolis
list Henrik Størner · Tue, 2 May 2006 14:34:56 +0200 ·
quoted from Frédéric Mangeant
On Tue, May 02, 2006 at 10:00:25AM +0200, Fr?d?ric Mangeant wrote:
many thanks for writing hobbitping ! It works fine on my test system 
(Gentoo Linux x86, glibc 2.3.6, GCC 3.4.6).

Is it / will it be possible to run it on top of a 4.1.2p1 installation ?
Shouldn't be a problem, since it's a very simple stand-alone tool.
Just build it - you've already done that - and copy over the hobbitping
binary to your 4.1.2 installation. Then change the FPING setting in
hobbitserver.cfg to run hobbitping instead of fping.

Make sure you have hobbitping installed suid-root.


Regards,
Henrik