WAN performance/monitoring

5 messages in this thread

list Adam Goryachev · Thu, 05 Jun 2014 10:29:19 +1000 ·

Hi All,

First some background, then sharing some scripts I've written/used, and finally asking for some advice please.

Some time ago I was having a LAN issue (dropped packets) which I wrote a small script to measure, and quantify the problem. (If you can't see the problem, you can't fix it, and you can't prove it is fixed afterwards).

All the script did was use fping to ping a group of IP's once per second, then every minute it would record a log of the date/time plus one line for each IP that had one or more dropped packets. This worked nicely for the above purpose, allowing me to easily pinpoint the common machines experiencing the problem, and then eventually solve it.

Now I'd like to extend the script to cover my WAN connections, but I also need more information, and don't want to re-invent the wheel. So, I'm looking for suggestions on how to implement what I need, and/or other products that already do this.

Specifically, I now want to record at least the following data into RRD's for later viewing:
1) Maximum ping time per minute
2) Average ping time per minute
3) Minimum ping time per minute
4) Packet loss per minute

Now the first three could be done by using my script to calculate the value and then record those three values per minute, or I could record 60 values per minute and let RRD do the calculation. One thing that does happen is obviously drift, ie, the processing time of my script will take a fraction of a second, so I won't really get a value for every single second, but then that is probably overkill anyway, if I can get one value for 99% of seconds, then I should get a clear picture of my links, performance, and any issues.

The second part of this question is what values for the above 4 things do you use for xymon as alarms? What is acceptable, what is marginal, and what is downright awful? In my case I'm using connections for RDP (Windows Remote Desktop).

BTW, currently the script doesn't actually integrate with xymon, that is still doing it's own standard network ping monitoring, but obviously this is a lot more intense/detailed, and I'd like to integrate the result (to get alerting/history/etc).

The current script I'm using which is started by /etc/rc.local at boot with "nohup /usr/local/bin/pingmon.sh >> /var/log/pingmon.log
#!/bin/bash
HOSTLIST="x.x.x.10
x.x.y.254
x.x.z.254"

HOSTLIST=$HOSTLIST
function doping
{
         START=`date '+%Y%m%d-%H:%M:%S'`
         result=`fping -C 60 -q ${HOSTLIST} 2>&1`
         echo "${result}" | grep -q -- - 2>&1 > /dev/null
         res=$?

         if [ $res == 0 ]
         then
                 echo -en "${START}\n${result}" | grep -- -
         else
                 echo "${START}"
         fi
}
while /bin/true
do
         doping >> /var/log/pingmon.log
done

I also wrote a report generator which was supposed to parse the log file and generate a summary/report in perl. I've attached that script here, but I can't claim that it is bug free, it also hard codes some business parameters (ie, business hours/days/etc), search for XXXX to find most things you will want to change.

Regards,
Adam

-- 
Adam Goryachev Website Managers www.websitemanagers.com.au

list Andy Smith · Thu, 05 Jun 2014 07:13:10 +0100 ·

▸ quoted from Adam Goryachev

Adam Goryachev wrote:

Hi All,

First some background, then sharing some scripts I've written/used, and finally asking for some advice please.

Some time ago I was having a LAN issue (dropped packets) which I wrote a small script to measure, and quantify the problem. (If you can't see the problem, you can't fix it, and you can't prove it is fixed afterwards).

All the script did was use fping to ping a group of IP's once per second, then every minute it would record a log of the date/time plus one line for each IP that had one or more dropped packets. This worked nicely for the above purpose, allowing me to easily pinpoint the common machines experiencing the problem, and then eventually solve it.

Now I'd like to extend the script to cover my WAN connections, but I also need more information, and don't want to re-invent the wheel. So, I'm looking for suggestions on how to implement what I need, and/or other products that already do this.

Have you looked at smokeping :-

http://oss.oetiker.ch/smokeping/

This has its own presentation and alerting mechanisms, but we have a Xymon extension similar to https://wiki.xymonton.org/doku.php/monitors:bbsmokeping which integrates into a Xymon page so we can manage alerting and history.

-- 
Andy

list Olivier Audry · Thu, 05 Jun 2014 09:44:08 +0200 ·

hello

if you use cisco devices you can look on ipsla stuff and use the
following template.

oau

▸ quoted from Adam Goryachev


Le jeudi 05 juin 2014 à 10:29 +1000, Adam Goryachev a écrit :

Hi All,

First some background, then sharing some scripts I've written/used, and finally asking for some advice please.

Some time ago I was having a LAN issue (dropped packets) which I wrote a small script to measure, and quantify the problem. (If you can't see the problem, you can't fix it, and you can't prove it is fixed afterwards).

All the script did was use fping to ping a group of IP's once per second, then every minute it would record a log of the date/time plus one line for each IP that had one or more dropped packets. This worked nicely for the above purpose, allowing me to easily pinpoint the common machines experiencing the problem, and then eventually solve it.

Now I'd like to extend the script to cover my WAN connections, but I also need more information, and don't want to re-invent the wheel. So, I'm looking for suggestions on how to implement what I need, and/or other products that already do this.

Specifically, I now want to record at least the following data into RRD's for later viewing:
1) Maximum ping time per minute
2) Average ping time per minute
3) Minimum ping time per minute
4) Packet loss per minute

Now the first three could be done by using my script to calculate the value and then record those three values per minute, or I could record 60 values per minute and let RRD do the calculation. One thing that does happen is obviously drift, ie, the processing time of my script will take a fraction of a second, so I won't really get a value for every single second, but then that is probably overkill anyway, if I can get one value for 99% of seconds, then I should get a clear picture of my links, performance, and any issues.

The second part of this question is what values for the above 4 things do you use for xymon as alarms? What is acceptable, what is marginal, and what is downright awful? In my case I'm using connections for RDP (Windows Remote Desktop).

BTW, currently the script doesn't actually integrate with xymon, that is still doing it's own standard network ping monitoring, but obviously this is a lot more intense/detailed, and I'd like to integrate the result (to get alerting/history/etc).

The current script I'm using which is started by /etc/rc.local at boot with "nohup /usr/local/bin/pingmon.sh >> /var/log/pingmon.log
#!/bin/bash
HOSTLIST="x.x.x.10
x.x.y.254
x.x.z.254"

HOSTLIST=$HOSTLIST
function doping
{
         START=`date '+%Y%m%d-%H:%M:%S'`
         result=`fping -C 60 -q ${HOSTLIST} 2>&1`
         echo "${result}" | grep -q -- - 2>&1 > /dev/null
         res=$?

         if [ $res == 0 ]
         then
                 echo -en "${START}\n${result}" | grep -- -
         else
                 echo "${START}"
         fi
}
while /bin/true
do
         doping >> /var/log/pingmon.log
done

I also wrote a report generator which was supposed to parse the log file and generate a summary/report in perl. I've attached that script here, but I can't claim that it is bug free, it also hard codes some business parameters (ie, business hours/days/etc), search for XXXX to find most things you will want to change.

Regards,
Adam

list Adam Goryachev · Fri, 06 Jun 2014 01:55:36 +1000 ·

▸ quoted from Andy Smith

On 05/06/14 16:13, Andy Smith wrote:

Adam Goryachev wrote:

Hi All,

First some background, then sharing some scripts I've written/used,
and finally asking for some advice please.

Some time ago I was having a LAN issue (dropped packets) which I wrote
a small script to measure, and quantify the problem. (If you can't see
the problem, you can't fix it, and you can't prove it is fixed
afterwards).

All the script did was use fping to ping a group of IP's once per
second, then every minute it would record a log of the date/time plus
one line for each IP that had one or more dropped packets. This worked
nicely for the above purpose, allowing me to easily pinpoint the
common machines experiencing the problem, and then eventually solve it.

Now I'd like to extend the script to cover my WAN connections, but I
also need more information, and don't want to re-invent the wheel. So,
I'm looking for suggestions on how to implement what I need, and/or
other products that already do this.

Have you looked at smokeping :-

http://oss.oetiker.ch/smokeping/

This has its own presentation and alerting mechanisms, but we have a
Xymon extension similar to
https://wiki.xymonton.org/doku.php/monitors:bbsmokeping which integrates
into a Xymon page so we can manage alerting and history.

Thank you for the suggestion, it does look useful, however, similar to
MRTG, (perhaps I haven't looked enough) it gives a great overview, but
not sufficient level of detail to "see" transient errors.

In any case, I've modified my current script (and kept backwards
compatibility for the old log file format). It is definitely a lot
slower, but thanks to an off-list tip from someone it will now start the
test at the beginning of every minute, so processing time "doesn't
matter" as long as overall it completes in less than one minute.

I've also added some very basic xymon integration. I think the following
improvements could be made:
1) Lookup IP address using some xymon tool to get the hostname
2) Ask xymon for a list of IP's to test (perhaps using a new tag in the
bb-hosts file)
3) Use a better method to get the xymon environment, possibly even get
xymon to start the script with xymonlaunch like a normal ext script
4) Tidy up the code/optimise to improve efficiency. I make a few calls
to bc for floating point comparison/calculation, but there is probably a
better solution for this
6) Probably a better way to config the red/yellow/green levels within
xymon instead of hardcoding in the script. I'm not sure my version of
xymon supports all the new features from the current release (I'm still
on debian stable which is 4.3.0~beta2.dfsg-9.1, as an aside it would be
nice if a newer version could be uploaded to testing for the next release).
7) Use xymon to create the rrd files and graphs of the various values
(max/avg/min/loss). Probably seeing a graph with the first 3, and a
second graph with the loss value would provide a good idea of how well
the link is going.

If anyone has any suggestions or ideas, I'd be happy to hear them.
One thing I'm not sure of, but want to achieve is to be able to keep the
right amount of data so I can go back to the WAN supplier and say "link
X was not performing satisfactorily at time abc (eg, latency too high,
or packet loss too high, etc). At the moment the only way I get that is
from the text log files.

Regards,
Adam

-- 
Adam Goryachev
Website Managers
www.websitemanagers.com.au

list Jeremy Laidman · Fri, 6 Jun 2014 10:27:22 +1000 ·

Whoops, forgot to CC this to the list.  I hate it when that happens.  So in
case it helps someone else, my off-list email is below.

And just for the record, I still reckon Smokeping is the go.  I see no
reason why it wouldn't detect a lot of transient errors, more likely if you
adjust the parameters for step and ping count.  There's no way of
guaranteeing that you would see a transient error, unless you happen to be
sending every packet received by the device you're testing!  Instead you
should be monitoring the error rates on the switch port and the device NIC,
probably using SNMP.

====

Adam


On Thursday, 5 June 2014, Adam Goryachev <

▸ quoted from Olivier Audry

user-92fd6827f6ae@xymon.invalid> wrote:

Specifically, I now want to record at least the following data into RRD's
for later viewing:
1) Maximum ping time per minute
2) Average ping time per minute
3) Minimum ping time per minute
4) Packet loss per minute


Seems to me that this is exactly what Smokeping can provide for you.  Have
a look at the demo site, drill down to a single device, and have a look at
the graphs.  eg:
http://oss.oetiker.ch/smokeping-demo/?target=Customers.OP.octopus

▸ quoted from Olivier Audry


One thing that does happen is obviously drift, ie, the processing time of

my script will take a fraction of a second, so I won't really get a value
for every single second


One way to overcome this is to run the probe in background, so that it
doesn't really matter how long it takes (as long as you're not accumulating
processes over time). Like this:

#!/bin/sh
doping() {
  ...
}
while true; do
  SECONDS=`date +%s` # in case not bash
  sleep `expr 60 - $SECONDS % 60`
  doping >> /var/log/pingmon.log &
done

This runs the subroutine do_ping in the background, but first waits how
ever long it needs until the clock ticks over for the next minute.  You
would always have the subroutine run at the start of the minute.

Cheers
Jeremy

WAN performance/monitoring 🔗 link

WAN performance/monitoring