server fails to receive all of client message

5 messages in this thread

list Adam Goryachev · Mon, 05 May 2008 13:31:41 +1000 ·

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I have a hobbit server which has been running for a long time quite
nicely. Recently, I noticed it was consuming approx 8G worth of data per
month (from all the remote clients reporting their data). This was
costing quite a bit of money (we pay per MB), so I modified all the
client to report using a different IP (and hence provider which has much
cheaper rates).

Anyway, the problem is that approximately since then, a number of client
reports are not completely received. Sometimes some of the ps output is
truncated, sometimes the ports sections is truncated, etc. This leads to
false positive alerts (ie, procs goes red because some monitored procs
are not running since they were after the truncated section).

I've increased the timeout on the hobbitd (--timeout=60) but this
doesn't seem to have helped. The only common factor between the clients
which have this problem are:

1) Most of them are running bbproxy and passing status messages from a
number of clients.
2) The rest of them are on very slow connections, or frequently very
busy connections.

Around the same time I actually 'fixed' bbproxy to on the remote sites,
prior to this the clients were reporting directly to both hobbit servers.

I've looked for an option to stop bbproxy from 'cacheing and combining'
multiple clients into a single connection, but this doesn't seem to be
possible. I don't seem to get any logs/alerts from hobbit when this happens.

Can anyone suggest where I should look, what I can do to try and resolve
this?

(My main problem is that I've started ignoring the late night SMS
notifications, and I'm sure I will end up missing something important
because of that).

Running hobbit version 4.2.0 from package 4.2.0-1 on the server.

Thanks,
Adam

- --
Adam Goryachev
Website Managers
www.websitemanagers.com.au
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIHn+dGyoxogrTyiURAg/tAJ9Lgz930wMxCssZwOKQI6Tb05bncACfcRxJ
Z6ofUwHrkxKvfJ9aSEVg4Nc=
=pxM2
-----END PGP SIGNATURE-----

list Adam Goryachev · Tue, 06 May 2008 01:46:36 +1000 ·

▸ quoted from Adam Goryachev

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Adam Goryachev wrote:

Anyway, the problem is that approximately since then, a number of client
reports are not completely received. Sometimes some of the ps output is
truncated, sometimes the ports sections is truncated, etc. This leads to
false positive alerts (ie, procs goes red because some monitored procs
are not running since they were after the truncated section).

I've increased the timeout on the hobbitd (--timeout=60) but this
doesn't seem to have helped. The only common factor between the clients
which have this problem are:

1) Most of them are running bbproxy and passing status messages from a
number of clients.
2) The rest of them are on very slow connections, or frequently very
busy connections.


I have made some 'progress' of sorts.

I've increased the MAX values as I was getting some "Oversize ...
truncated" messages in my log file. I then went home thinking "Great, I
managed to solve this one thing today at least". Except, I started
getting messages a few hours later.

So after further investigation, I've decided I really can't work out
what is happening, and why it isn't working. I've enabled debug output
from bbproxy, but I don't really know what it all means.

I can see that if I set bbproxy to only forward messages to 127.0.0.1
the local hobbit server gets all the data correctly. If I add the remote
server, then some things don't work properly. Since it is likely all a
big jumbled mess by now, I'll post a few sections of config files, and
hopefully someone will notice my stupid mistake (or multiple mistakes)...

I have a network 10.x.x.x which has a hobbit server at 10.30.10.9, all
client machines report to 10.30.10.9 as the BBDISPLAY/BBPAGER (most are
windows PC's using the BB windows client), one is a linux hobbit-client
and of course 10.30.10.9 is a hobbit client (plus a couple of old ext
scripts using the old BB env). I think all this is working fine, since
nothing goes randomly purple/red.

10.30.10.9 is behind NAT but has complete access to the internet.

I have a remote server behind a NAT router which has port 1984 port
forwarded to it. It is receiving reports from around 20 other hobbit
client machines perfectly, so I don't suspect the NAT router/hobbit
config itself.

Some config from 10.30.10.9:

hobbitserver.cfg:
BBSERVERIP="127.0.0.1"
BBDISP="127.0.0.1"
BBDISPLAYS=""
MAXLINE="32768"

hobbitclient.cfg
BBDISP="10.30.10.9"
BBDISPLAYS=""
BB="$BBHOME/bin/bb --debug --timeout=60"
MAXLINE="32768"

hobbitlaunch.cfg
[hobbitd]
        ENVFILE /usr/lib/hobbit/server/etc/hobbitserver.cfg
        CMD hobbitd --pidfile=$BBSERVERLOGS/hobbitd.pid
- --restart=$BBTMP/hobbitd.chk --checkpoint-file=$BBTMP/hobbitd.chk
- --checkpoint-interval=600 --log=$BBSERVERLOGS/hobbitd.log
- --admin-senders=127.0.0.1,$BBSERVERIP --store-clientlogs=!msgs
- --listen=127.0.0.1


[bbproxy]
        ENVFILE /usr/lib/hobbit/server/etc/hobbitserver.cfg
        CMD $BBHOME/bin/bbproxy --hobbitd
- --bbdisplay=123.234.456.567,127.0.0.1 --listen=10.30.10.9
- --report=$MACHINE.bbproxy --no-daemon --timeout=30
- --pidfile=$BBSERVERLOGS/bbproxy.pid --debug --log-details
        CMD $BBHOME/bin/bbproxy --hobbitd --bbdisplay=127.0.0.1
- --listen=10.30.10.9 --report=$MACHINE.bbproxy --no-daemon --timeout=30
- --pidfile=$BBSERVERLOGS/bbproxy.pid --debug --log-details
        LOGFILE $BBSERVERLOGS/bbproxy.log

[hobbitclient]
        ENVFILE /usr/lib/hobbit/client/etc/hobbitclient.cfg
        NEEDS hobbitd
        CMD /usr/lib/hobbit/client/bin/hobbitclient.sh
        LOGFILE $BBSERVERLOGS/hobbitclient.log
        INTERVAL 5m


On the remote hobbit server with the public IP I have:
hobbitserver.cfg
BBSERVERIP="192.168.2.6"
BBDISP="192.168.2.6"
BBDISPLAYS=""
MAXLINE="32768"
MAXMSG_STATUS="1024"
MAXMSG_CLIENT="1024"
MAXMSG_DATA="512"

hobbitlaunch.cfg
[hobbitd]
        HEARTBEAT
        ENVFILE /usr/lib/hobbit/server/etc/hobbitserver.cfg
        CMD hobbitd --pidfile=$BBSERVERLOGS/hobbitd.pid
- --restart=$BBTMP/hobbitd.chk --checkpoint-file=$BBTMP/hobbitd.chk
- --checkpoint-interval=600 --log=$BBSERVERLOGS/hobbitd.log
- --admin-senders=127.0.0.1,$BBSERVERIP
- --maint-senders=127.0.0.1,$BBSERVERIP -www-senders=127.0.0.1,$BBSERVERIP
- --store-clientlogs=!msgs --timeout=60

Any suggestions as to what is going wrong would be really appreciated.

BTW, bbnet tests from the 10.30.10.9 host are not submitted to the
bbproxy at all because of the BBDISP setting in the hobbitserver.cfg,
but if I change this to point to 10.30.10.9 then it seems to break the
web interface. I'm not really too concerned about this right now though....

Thanks for any tips/pointers/etc

Regards,
Adam

▸ quoted from Adam Goryachev

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org


iD8DBQFIHyvcGyoxogrTyiURAhpyAKCsnO4px+b4Ml04yjzZvXgFxeuaogCeKwy6
KwOEboPhIXFb4YVgdA0ndlk=
=T5Lc
-----END PGP SIGNATURE-----

list Adam Goryachev · Fri, 09 May 2008 15:22:20 +1000 ·

▸ quoted from Adam Goryachev

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Adam Goryachev wrote:

Adam Goryachev wrote:

Anyway, the problem is that approximately since then, a number of client
reports are not completely received. Sometimes some of the ps output is
truncated, sometimes the ports sections is truncated, etc. This leads to
false positive alerts (ie, procs goes red because some monitored procs
are not running since they were after the truncated section).

I've increased the timeout on the hobbitd (--timeout=60) but this
doesn't seem to have helped. The only common factor between the clients
which have this problem are:

1) Most of them are running bbproxy and passing status messages from a
number of clients.
2) The rest of them are on very slow connections, or frequently very
busy connections.

I've made some more possible progress, I still don't really know how to
approach this problem, or try to solve it....

Basically, I used tcpdump to catch all traffic sent to port 1984 on my
local server. I then used wireshark to analyse the data to find the
specific stream of packets that lead to hobbit getting a red alert due
to truncated client report.

It now seems to point toward some sort of transport 'problem' in that I
get a number of 'errors' such as "TCP Previous segment lost" and "TCP
Dup ACK" and "TCP Retransmission" and the final packet is a "RST" which
I assume is when you would normally get a "Connection reset by peer"
type error.

I would love to publish the trace, but don't know how to obfuscate it's
contents to conceal some of the details (ie, the contents of the hobbit
client status that was being reported).

However, I do have the following questions:
1) If the connection died due to an error, why does hobbit still use the
contents of what it received? (Is this the better to know half the
information than none, or we can't tell the difference between
connection closed due to an error and connection closed at end of
transport?)

2) From what I know, TCP is meant to be fairly robust in the face of
lost packets, and other errors. The fact I am seeing these sort of
failures concerns me that my network must be unhappy in some way. Yet,
from a user experience point of view, everything seems to be working
perfectly..... ie, web browsing/ssh connections /etc...

BTW, the network connection is quite busy during the times when these
errors happen due to remote backups being done at those times. Could
that be the cause of the problem?

Any comments, suggestions, etc, would be greatly appreciated.

Regards,

▸ quoted from Adam Goryachev

Adam

- --
Adam Goryachev
Website Managers
www.websitemanagers.com.au
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org


iD8DBQFII9+MGyoxogrTyiURAoZ5AJ4uwxQMAIuEvF32XWxZuBPqBU3bYQCfYtVy
T4RIJ40hdntCZtTIXRouCtY=
=Begp
-----END PGP SIGNATURE-----

list Rodolfo Pilas · Thu, 22 May 2008 11:37:05 -0300 ·

Adam, take a look at:

http://en.wikibooks.org/wiki/System_Monitoring_with_Hobbit/FAQ#Q._How_do_I_fix_.22Oversize_status_msg_from_192.168.1.31_for_test.my.com:ports_truncated_.28n.3D508634.2C_limit.3D262144.29.22

▸ quoted from Adam Goryachev


Adam Goryachev escribió:

Adam Goryachev wrote:

Anyway, the problem is that approximately since then, a number of client
reports are not completely received. Sometimes some of the ps output is
truncated, sometimes the ports sections is truncated, etc. This leads to
false positive alerts (ie, procs goes red because some monitored procs
are not running since they were after the truncated section).

I've increased the timeout on the hobbitd (--timeout=60) but this
doesn't seem to have helped. The only common factor between the clients
which have this problem are:

1) Most of them are running bbproxy and passing status messages from a
number of clients.
2) The rest of them are on very slow connections, or frequently very
busy connections.


I have made some 'progress' of sorts.

I've increased the MAX values as I was getting some "Oversize ...
truncated" messages in my log file. I then went home thinking "Great, I
managed to solve this one thing today at least". Except, I started
getting messages a few hours later.

So after further investigation, I've decided I really can't work out
what is happening, and why it isn't working. I've enabled debug output
from bbproxy, but I don't really know what it all means.

I can see that if I set bbproxy to only forward messages to 127.0.0.1
the local hobbit server gets all the data correctly. If I add the remote
server, then some things don't work properly. Since it is likely all a
big jumbled mess by now, I'll post a few sections of config files, and
hopefully someone will notice my stupid mistake (or multiple mistakes)...

I have a network 10.x.x.x which has a hobbit server at 10.30.10.9, all
client machines report to 10.30.10.9 as the BBDISPLAY/BBPAGER (most are
windows PC's using the BB windows client), one is a linux hobbit-client
and of course 10.30.10.9 is a hobbit client (plus a couple of old ext
scripts using the old BB env). I think all this is working fine, since
nothing goes randomly purple/red.

10.30.10.9 is behind NAT but has complete access to the internet.

I have a remote server behind a NAT router which has port 1984 port
forwarded to it. It is receiving reports from around 20 other hobbit
client machines perfectly, so I don't suspect the NAT router/hobbit
config itself.

Some config from 10.30.10.9:

hobbitserver.cfg:
BBSERVERIP="127.0.0.1"
BBDISP="127.0.0.1"
BBDISPLAYS=""
MAXLINE="32768"

hobbitclient.cfg
BBDISP="10.30.10.9"
BBDISPLAYS=""
BB="$BBHOME/bin/bb --debug --timeout=60"
MAXLINE="32768"

hobbitlaunch.cfg
[hobbitd]
        ENVFILE /usr/lib/hobbit/server/etc/hobbitserver.cfg
        CMD hobbitd --pidfile=$BBSERVERLOGS/hobbitd.pid
--restart=$BBTMP/hobbitd.chk --checkpoint-file=$BBTMP/hobbitd.chk
--checkpoint-interval=600 --log=$BBSERVERLOGS/hobbitd.log
--admin-senders=127.0.0.1,$BBSERVERIP --store-clientlogs=!msgs
--listen=127.0.0.1


[bbproxy]
        ENVFILE /usr/lib/hobbit/server/etc/hobbitserver.cfg
        CMD $BBHOME/bin/bbproxy --hobbitd
--bbdisplay=123.234.456.567,127.0.0.1 --listen=10.30.10.9
--report=$MACHINE.bbproxy --no-daemon --timeout=30
--pidfile=$BBSERVERLOGS/bbproxy.pid --debug --log-details
        CMD $BBHOME/bin/bbproxy --hobbitd --bbdisplay=127.0.0.1
--listen=10.30.10.9 --report=$MACHINE.bbproxy --no-daemon --timeout=30
--pidfile=$BBSERVERLOGS/bbproxy.pid --debug --log-details
        LOGFILE $BBSERVERLOGS/bbproxy.log

[hobbitclient]
        ENVFILE /usr/lib/hobbit/client/etc/hobbitclient.cfg
        NEEDS hobbitd
        CMD /usr/lib/hobbit/client/bin/hobbitclient.sh
        LOGFILE $BBSERVERLOGS/hobbitclient.log
        INTERVAL 5m


On the remote hobbit server with the public IP I have:
hobbitserver.cfg
BBSERVERIP="192.168.2.6"
BBDISP="192.168.2.6"
BBDISPLAYS=""
MAXLINE="32768"
MAXMSG_STATUS="1024"
MAXMSG_CLIENT="1024"
MAXMSG_DATA="512"

hobbitlaunch.cfg
[hobbitd]
        HEARTBEAT
        ENVFILE /usr/lib/hobbit/server/etc/hobbitserver.cfg
        CMD hobbitd --pidfile=$BBSERVERLOGS/hobbitd.pid
--restart=$BBTMP/hobbitd.chk --checkpoint-file=$BBTMP/hobbitd.chk
--checkpoint-interval=600 --log=$BBSERVERLOGS/hobbitd.log
--admin-senders=127.0.0.1,$BBSERVERIP
--maint-senders=127.0.0.1,$BBSERVERIP -www-senders=127.0.0.1,$BBSERVERIP
--store-clientlogs=!msgs --timeout=60

Any suggestions as to what is going wrong would be really appreciated.

BTW, bbnet tests from the 10.30.10.9 host are not submitted to the
bbproxy at all because of the BBDISP setting in the hobbitserver.cfg,
but if I change this to point to 10.30.10.9 then it seems to break the
web interface. I'm not really too concerned about this right now though....

Thanks for any tips/pointers/etc

Regards,
Adam

list Adam Goryachev · Tue, 16 Dec 2008 16:54:03 +1100 ·

▸ quoted from Adam Goryachev

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Rodolfo Pilas wrote:

Adam, take a look at:

http://en.wikibooks.org/wiki/System_Monitoring_with_Hobbit/FAQ#Q._How_do_I_fix_.22Oversize_status_msg_from_192.168.1.31_for_test.my.com:ports_truncated_.28n.3D508634.2C_limit.3D262144.29.22

I've checked that, and it does not seem to be the problem.

All original info is below, a quick recap, my hobbit server doesn't
receive the complete client data, and so procs (and sometimes ports) go
red since the data is missing, the procs are not found).

Currently, the bbproxy server (which is running a hobbit on 127.0.0.1)
shows a green for procs for itself (the bbclient running locally) which
means the bbclient passed a message to 10.30.10.9 (bbproxy) which passed
the message to 127.0.0.1 (hobbit) which displayed it correctly.

The same bbproxy failed to send that message to the remote hobbit
server, which has the following as the "Client Data" when I click at the
bottom of the red procs page:

client mail,servername,com,au.linux linux
[date]
Tue Dec 16 15:56:14 EST 2008
[uname]
Linux mail 2.6.18-6-686 i686
[osversion]
Debian 4.0
[uptime]
 15:56:14 up 11 days,  6:53,  1 user,  load average: 0.22, 0.39, 0.43
[who]
userag   pts/0        Dec 16 13:29 (123.123.123.12.static.net.au)
[df]
Filesystem         1024-blocks      Used Available Capacity Mounted on
/dev/md0               9775120   7792092   1983028      80% /
/dev/md1             146002196  23881268 122120928      17% /home
[mount]
/dev/md0 on / type reiserfs (rw,notail)
tmpfs on /lib/init/rw type tmpfs (rw,nosuid,mode=0755)
proc on /proc type proc (rw,noexec,nosuid,nodev)
sysfs on /sys type sysfs (rw,noexec,nosuid,nodev)
procbususb on /proc/bus/usb type usbfs (rw)
udev on /dev type tmpfs (rw,mode=0755)
tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev)
devpts on /dev/pts type devpts (rw,noexec,nosuid,gid=5,mode=620)
/dev/md1 on /home type reiserfs (rw)
//ptserver/Shared$ on /mnt/ptserver/shared type smbfs (ro)
//ptserver/Shared$ on /mnt/ptserver/sharedrw type smbfs (rw)
[free]
             total       used       free     shared    buffers     cached
Mem:       3112548    2735760     376788          0     451828    1188956
- -/+ buffers/cache:    1094976    2017572
Swap:      1012072         88    1011984
[ifconfig]
eth0      Link encap:Ethernet  HWaddr 00:13:20:5F:EC:F3
          inet addr:192.168.1.2  Bcast:192.168.1.255  Mask:255.255.255.0
          inet6 addr: fe80::213:20ff:fe5f:ecf3/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:14847337 errors:0 dropped:0 overruns:0 frame:0
          TX packets:15440612 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:795753357 (758.8 MiB)  TX bytes:4174004073 (3.8 GiB)

eth1      Link encap:Ethernet  HWaddr 00:09:5B:1A:16:26
          inet addr:10.30.10.9  Bcast:10.30.15.255  Mask:255.255.240.0
          inet6 addr: fe80::209:5bff:fe1a:1626/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:18232874 errors:0 dropped:0 overruns:0 frame:0
          TX packets:25222742 errors:7 dropped:0 overruns:7 carrier:7
          collisions:0 txqueuelen:1000
          RX bytes:771706320 (735.9 MiB)  TX bytes:2584475027 (2.4 GiB)
          Interrupt:74 Base address:0xc000

lo        Link encap:Local Loopback
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:7545664 errors:0 dropped:0 overruns:0 frame:0
          TX packets:7545664 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:3978669335 (3.7 GiB)  TX bytes:3978669335 (3.7 GiB)

tun0      Link encap:UNSPEC  HWaddr
00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00
          inet addr:10.30.99.1  P-t-P:10.30.99.2  Mask:255.255.255.255
          UP POINTOPOINT RUNNING NOARP MULTICAST  MTU:1500  Metric:1
          RX packets:767727 errors:0 dropped:0 overruns:0 frame:0
          TX packets:873992 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:100
          RX bytes:57714836 (55.0 MiB)  TX bytes:210654878 (200.8 MiB)

[route]
Kernel IP routing table
Destination     Gateway         Genmask         Flags   MSS Window  irtt
Iface
10.30.99.2      0.0.0.0         255.255.255.255 UH        0 0          0
tun0
10.30.99.0      10.30.99.2      255.255.255.0   UG        0 0          0
tun0
192.168.1.0     0.0.0.0         255.255.255.0   U         0 0          0
eth0
10.30.0.0       0.0.0.0         255.255.240.0   U         0 0          0
eth1
10.30.0.0       10.30.10.254    255.255.0.0     UG        0 0          0
eth1
0.0.0.0         192.168.1.1     0.0.0.0         UG        0 0          0
eth0
[netstat]
Ip:
    40747731 total packets received
    138429 with invalid addresses
    1633704 forwarded
    1 with unknown protocol
    0 incoming packets discarded
    38950970 incoming packets delivered
    50359204 requests sent out
    15 outgoing packets dropped
    1 fragments dropped after timeout
    1831 reassemblies required
    915 packets reassembled ok
    1 packet reassembles failed
Icmp:
    1657334 ICMP messages received
    3902 input ICMP message failed.
    ICMP input histogram:
        destination unreachable: 1566573
        timeout in transit: 54
        redirects: 123
        echo requests: 26880
        echo replies: 61402
    139635 ICMP messages sent
    0 ICMP messages failed
    ICMP output histogram:
        destination unreachable: 112755
        echo replies: 26880
Tcp:
    1669446 active connections openings
    1247657 passive connection openings
    361872 failed connection attempts
    132167 connection resets received
    96 connections established
    33096061 segments received
    40931113 segments send out
    715004 segments retransmited
    0 bad segments received.
    125842 resets sent
Udp:
    4173585 packets received
    23298 packets to unknown port received.
    62 packet receive errors
    6868647 packets sent
TcpExt:
    482 resets received for embryonic SYN_RECV sockets
    268 packets pruned from receive queue because of socket buffer overrun
    14 ICMP packets dropped because they were out-of-window
    1084404 TCP sockets finished time wait in fast timer
    2236 time wait sockets recycled by time stamp
    354 packets rejects in established connections because of timestamp
    366541 delayed acks sent
    179 delayed acks further delayed because of locked socket
    Quick ack mode was activated 111492 times
    2510467 packets directly queued to recvmsg prequeue.
    3003729 of bytes directly received from backlog
    380000212 of bytes directly received from prequeue
    7194390 packet headers predicted
    724319 packets header predicted and directly queued to user
    8998070 acknowledgments not containing data received
    5065326 predicted acknowledgments
    1472 times recovered from packet loss due to fast retransmit
    42465 times recovered from packet loss due to SACK data
    161 bad SACKs received
    Detected reordering 165 times using FACK
    Detected reordering 181 times using SACK
    Detected reordering 444 times using reno fast retransmit
    Detected reordering 3197 times using time stamp
    1530 congestion windows fully recovered
    17437 congestion windows partially recovered using Hoe heuristic
    TCPDSACKUndo: 155
    73943 congestion windows recovered after partial ack
    29880 TCP data loss events
    TCPLostRetransmit: 42
    541 timeouts after reno fast retransmit
    60499 timeouts after SACK recovery
    11423 timeouts in loss state
    84462 fast retransmits
    6777 forward retransmits
    45329 retransmits in slow start
    261506 other TCP timeouts
    TCPRenoRecoveryFail: 203
    14386 sack retransmits failed
    2738 times receiver scheduled too late for direct processing
    13320 packets collapsed in receive queue due to low socket buffer
    89544 DSACKs sent for old packets
    1294 DSACKs sent for out of order packets
    21665 DSACKs received
    446 DSACKs for out of order packets received
    34120 connections reset due to unexpected data
    3110 connections reset due to early user close
    34312 connections aborted due to timeout
[ports]
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address
State
tcp        0      0 0.0.0.0:20000           0.0.0.0:*
LISTEN
tcp        0      0 127.0.0.1:1984          0.0.0.0:*
LISTEN
tcp        0      0 10.30.10.9:1984         0.0.0.0:*
LISTEN
tcp        0      0 0.0.0.0:20002           0.0.0.0:*
LISTEN
tcp        0      0 0.0.0.0:55555           0.0.0.0:*
LISTEN
tcp        0      0 0.0.0.0:37              0.0.0.0:*
LISTEN
tcp        0      0 0.0.0.0:389             0.0.0.0:*
LISTEN
tcp        0      0 0.0.0.0:21000           0.0.0.0:*
LISTEN
tcp        0      0 0.0.0.0:873             0.0.0.0:*
LISTEN
tcp        0      0 0.0.0.0:9               0.0.0.0:*
LISTEN
tcp        0      0 0.0.0.0:139             0.0.0.0:*
LISTEN
tcp        0      0 0.0.0.0:13              0.0.0.0:*
LISTEN
tcp        0      0 127.0.0.1:783           0.0.0.0:*
LISTEN
tcp        0      0 0.0.0.0:80              0.0.0.0:*
LISTEN
tcp        0      0 10.30.10.9:8080         0.0.0.0:*
LISTEN
tcp        0      0 0.0.0.0:113             0.0.0.0:*
LISTEN
tcp        0      0 127.0.0.1:53            0.0.0.0:*
LISTEN
tcp        0      0 10.30.10.9:1080         0.0.0.0:*
LISTEN
tcp        0      0 0.0.0.0:5432            0.0.0.0:*
LISTEN
tcp        0      0 0.0.0.0:25              0.0.0.0:*
LISTEN
tcp        0      0 0.0.0.0:445             0.0.0.0:*
LISTEN
tcp        0      0 0.0.0.0:14238           0.0.0.0:*
LISTEN
tcp        0      0 192.168.1.2:56464       123.123.32.80:80
ESTABLISHED
tcp        0      0 192.168.1.2:51389       123.123.176.189:80
ESTABLISHED
tcp        0      0 192.168.1.2:52865       123.123.28.123:80
ESTABLISHED
tcp        0      0 10.30.10.9:8080         10.30.10.16:1028
TIME_WAIT
tcp        0      0 192.168.1.2:40408       123.213.88.51:80
ESTABLISHED
tcp        0      0 192.168.1.2:53207       123.123.50.28:443
ESTABLISHED
tcp        0      0 10.30.10.9:33905        10.30.10.1:139
ESTABLISHED
tcp        0      0 192.168.1.2:49885       123.123.28.124:80
ESTABLISHED
tcp        0      0 192.168.1.2:53071       123.123.176.176:80
ESTABLISHED
tcp        0      0 127.0.0.1:5432          127.0.0.1:53017
ESTABLISHED
tcp        0      0 192.168.1.2:25          123.123.132.183:57226
TIME_WAIT
tcp        0      0 10.30.10.9:8080         10.30.10.14:2826
ESTABLISHED
tcp        0      0 127.0.0.1:54646         127.0.0.1:1984
TIME_WAIT
tcp        0      0 127.0.0.1:54645         127.0.0.1:1984
TIME_WAIT
tcp        0      0 127.0.0.1:54655         127.0.0.1:1984
TIME_WAIT
tcp        0      0 127.0.0.1:54639         127.0.0.1:1984
TIME_WAIT
tcp        0      0 10.30.10.9:8080         10.30.10.14:2829
ESTABLISHED
tcp        0      0 127.0.0.1:54664         127.0.0.1:1984
TIME_WAIT
tcp        0      0 127.0.0.1:54665         127.0.0.1:1984
TIME_WAIT
tcp        0      0 10.30.10.9:8080         10.30.10.14:2830
ESTABLISHED
tcp        0      0 192.168.1.2:41020       123.123.38.94:80
ESTABLISHED
tcp        0      0 10.30.10.9:8080         10.30.10.16:1050
ESTABLISHED
tcp        0      0 192.168.1.2:35851       123.123.88.59:80
TIME_WAIT
tcp        0      0 10.30.10.9:8080         10.30.10.16:1051
ESTABLISHED
tcp        0      0 192.168.1.2:47514       123.123.88.51:80
ESTABLISHED
tcp        0      0 192.168.1.2:37393       123.123.176.45:443
ESTABLISHED
tcp        0      0 10.30.10.9:49509        10.30.10.9:995
ESTABLISHED
tcp        0      0 192.168.1.2:38569       123.123.15.124:80
ESTABLISHED
tcp        0      0 10.30.10.9:8080         10.30.10.16:1068
ESTABLISHED
tcp        0   1640 192.168.1.2:40677       123.213.176.45:443
ESTABLISHED
tcp        0      1 192.168.1.2:59642       123.123.176.176:80
LAST_ACK
tcp        0      0 10.30.10.9:8080         10.30.10.16:1066
ESTABLISHED
tcp        0      0 192.168.1.2:46183       123.123.88.51:80
ESTABLISHED
tcp        0      0 10.30.10.9:8080         10.30.10.16:1067
ESTABLISHED
tcp        0   1640 192.168.1.2:32966       123.123.176.45:443
ESTABLISHED
tcp        0      0 192.168.1.2:41246       123.123.88.56:80
ESTABLISHED
tcp        0      0 192.168.1.2:35872       123.123.38.94:80
TIME_WAIT
tcp        0      0 10.30.10.9:139          10.30.10.1:3133
ESTABLISHED
tcp        0      0 10.30.10.9:47825        10.30.10.1:139
TIME_WAIT
tcp        0      0 10.30.10.9:47828        10.30.10.1:139
TIME_WAIT
tcp        0      0 192.168.1.2:33758       123.123.38.94:80
TIME_WAIT
tcp        0      0 10.30.10.9:8080         10.30.10.17:3132
ESTABLISHED
tcp

Note, this is clearly truncated mid-line/mid-report!

So, the procs test on the remote hobbit shows red, and alerts/etc...
The interesting thing to note this time is that the actual procs report
showed most of the procs were actually found with some (truncated) ps
output data. Which means, hobbit actually had more data than is
displayed on the Client Data page....

If anyone can advise how to resolve this, I would be exceptionally keen
to hear about it.

A upgrade to the bandwidth for one end is scheduled for 2 weeks time,
which may help this site, but I am still seeing the same problem with
other hosts in other locations.

Two additional questions:
1) Can some sort of checksum/verification marker be added to the end of
the client data so the server knows that if it doesn't see the end
marker to discard the entire message?

2) Compression of the client data?
Nevermind, I'll post this as a separate email so it can get a little
better visibility...

Thanks,
Adam

▸ quoted from Adam Goryachev

Adam Goryachev escribió:

Adam Goryachev wrote:

Anyway, the problem is that approximately since then, a number of client
reports are not completely received. Sometimes some of the ps output is
truncated, sometimes the ports sections is truncated, etc. This leads to
false positive alerts (ie, procs goes red because some monitored procs
are not running since they were after the truncated section).

I've increased the timeout on the hobbitd (--timeout=60) but this
doesn't seem to have helped. The only common factor between the clients
which have this problem are:

1) Most of them are running bbproxy and passing status messages from a
number of clients.
2) The rest of them are on very slow connections, or frequently very
busy connections.


I have made some 'progress' of sorts.

I've increased the MAX values as I was getting some "Oversize ...
truncated" messages in my log file. I then went home thinking "Great, I
managed to solve this one thing today at least". Except, I started
getting messages a few hours later.

So after further investigation, I've decided I really can't work out
what is happening, and why it isn't working. I've enabled debug output
from bbproxy, but I don't really know what it all means.

I can see that if I set bbproxy to only forward messages to 127.0.0.1
the local hobbit server gets all the data correctly. If I add the remote
server, then some things don't work properly. Since it is likely all a
big jumbled mess by now, I'll post a few sections of config files, and
hopefully someone will notice my stupid mistake (or multiple mistakes)...

I have a network 10.x.x.x which has a hobbit server at 10.30.10.9, all
client machines report to 10.30.10.9 as the BBDISPLAY/BBPAGER (most are
windows PC's using the BB windows client), one is a linux hobbit-client
and of course 10.30.10.9 is a hobbit client (plus a couple of old ext
scripts using the old BB env). I think all this is working fine, since
nothing goes randomly purple/red.

10.30.10.9 is behind NAT but has complete access to the internet.

I have a remote server behind a NAT router which has port 1984 port
forwarded to it. It is receiving reports from around 20 other hobbit
client machines perfectly, so I don't suspect the NAT router/hobbit
config itself.

Some config from 10.30.10.9:

hobbitserver.cfg:
BBSERVERIP="127.0.0.1"
BBDISP="127.0.0.1"
BBDISPLAYS=""
MAXLINE="32768"

hobbitclient.cfg
BBDISP="10.30.10.9"
BBDISPLAYS=""
BB="$BBHOME/bin/bb --debug --timeout=60"
MAXLINE="32768"

hobbitlaunch.cfg
[hobbitd]
        ENVFILE /usr/lib/hobbit/server/etc/hobbitserver.cfg
        CMD hobbitd --pidfile=$BBSERVERLOGS/hobbitd.pid
--restart=$BBTMP/hobbitd.chk --checkpoint-file=$BBTMP/hobbitd.chk
--checkpoint-interval=600 --log=$BBSERVERLOGS/hobbitd.log
--admin-senders=127.0.0.1,$BBSERVERIP --store-clientlogs=!msgs
--listen=127.0.0.1


[bbproxy]
        ENVFILE /usr/lib/hobbit/server/etc/hobbitserver.cfg
        CMD $BBHOME/bin/bbproxy --hobbitd
--bbdisplay=123.234.456.567,127.0.0.1 --listen=10.30.10.9
--report=$MACHINE.bbproxy --no-daemon --timeout=30
--pidfile=$BBSERVERLOGS/bbproxy.pid --debug --log-details
        CMD $BBHOME/bin/bbproxy --hobbitd --bbdisplay=127.0.0.1
--listen=10.30.10.9 --report=$MACHINE.bbproxy --no-daemon --timeout=30
--pidfile=$BBSERVERLOGS/bbproxy.pid --debug --log-details
        LOGFILE $BBSERVERLOGS/bbproxy.log

[hobbitclient]
        ENVFILE /usr/lib/hobbit/client/etc/hobbitclient.cfg
        NEEDS hobbitd
        CMD /usr/lib/hobbit/client/bin/hobbitclient.sh
        LOGFILE $BBSERVERLOGS/hobbitclient.log
        INTERVAL 5m


On the remote hobbit server with the public IP I have:
hobbitserver.cfg
BBSERVERIP="192.168.2.6"
BBDISP="192.168.2.6"
BBDISPLAYS=""
MAXLINE="32768"
MAXMSG_STATUS="1024"
MAXMSG_CLIENT="1024"
MAXMSG_DATA="512"

hobbitlaunch.cfg
[hobbitd]
        HEARTBEAT
        ENVFILE /usr/lib/hobbit/server/etc/hobbitserver.cfg
        CMD hobbitd --pidfile=$BBSERVERLOGS/hobbitd.pid
--restart=$BBTMP/hobbitd.chk --checkpoint-file=$BBTMP/hobbitd.chk
--checkpoint-interval=600 --log=$BBSERVERLOGS/hobbitd.log
--admin-senders=127.0.0.1,$BBSERVERIP
--maint-senders=127.0.0.1,$BBSERVERIP -www-senders=127.0.0.1,$BBSERVERIP
--store-clientlogs=!msgs --timeout=60

Any suggestions as to what is going wrong would be really appreciated.

BTW, bbnet tests from the 10.30.10.9 host are not submitted to the
bbproxy at all because of the BBDISP setting in the hobbitserver.cfg,
but if I change this to point to 10.30.10.9 then it seems to break the
web interface. I'm not really too concerned about this right now
though....

Thanks for any tips/pointers/etc

Regards,
Adam


- --
Adam Goryachev
Website Managers


Ph: +XX X XXXX XXXX                            user-eaec2ffb4cbc@xymon.invalid
Fax: +XX X XXXX XXXX                            www.websitemanagers.com.au
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAklHQnsACgkQGyoxogrTyiVq/QCgjU1lIamzEs5lq5bKGo9K3Hgt
XQAAnAvUe9sQC3PHeTeJjFi2+XGG50dT
=65Io
-----END PGP SIGNATURE-----

server fails to receive all of client message 🔗 link

server fails to receive all of client message