server fails to receive all of client message
list Adam Goryachev
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 I have a hobbit server which has been running for a long time quite nicely. Recently, I noticed it was consuming approx 8G worth of data per month (from all the remote clients reporting their data). This was costing quite a bit of money (we pay per MB), so I modified all the client to report using a different IP (and hence provider which has much cheaper rates). Anyway, the problem is that approximately since then, a number of client reports are not completely received. Sometimes some of the ps output is truncated, sometimes the ports sections is truncated, etc. This leads to false positive alerts (ie, procs goes red because some monitored procs are not running since they were after the truncated section). I've increased the timeout on the hobbitd (--timeout=60) but this doesn't seem to have helped. The only common factor between the clients which have this problem are: 1) Most of them are running bbproxy and passing status messages from a number of clients. 2) The rest of them are on very slow connections, or frequently very busy connections. Around the same time I actually 'fixed' bbproxy to on the remote sites, prior to this the clients were reporting directly to both hobbit servers. I've looked for an option to stop bbproxy from 'cacheing and combining' multiple clients into a single connection, but this doesn't seem to be possible. I don't seem to get any logs/alerts from hobbit when this happens. Can anyone suggest where I should look, what I can do to try and resolve this? (My main problem is that I've started ignoring the late night SMS notifications, and I'm sure I will end up missing something important because of that). Running hobbit version 4.2.0 from package 4.2.0-1 on the server. Thanks, Adam - -- Adam Goryachev Website Managers www.websitemanagers.com.au -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFIHn+dGyoxogrTyiURAg/tAJ9Lgz930wMxCssZwOKQI6Tb05bncACfcRxJ Z6ofUwHrkxKvfJ9aSEVg4Nc= =pxM2 -----END PGP SIGNATURE-----
list Adam Goryachev
▸
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Adam Goryachev wrote:
Anyway, the problem is that approximately since then, a number of client reports are not completely received. Sometimes some of the ps output is truncated, sometimes the ports sections is truncated, etc. This leads to false positive alerts (ie, procs goes red because some monitored procs are not running since they were after the truncated section). I've increased the timeout on the hobbitd (--timeout=60) but this doesn't seem to have helped. The only common factor between the clients which have this problem are: 1) Most of them are running bbproxy and passing status messages from a number of clients. 2) The rest of them are on very slow connections, or frequently very busy connections.
I have made some 'progress' of sorts.
I've increased the MAX values as I was getting some "Oversize ...
truncated" messages in my log file. I then went home thinking "Great, I
managed to solve this one thing today at least". Except, I started
getting messages a few hours later.
So after further investigation, I've decided I really can't work out
what is happening, and why it isn't working. I've enabled debug output
from bbproxy, but I don't really know what it all means.
I can see that if I set bbproxy to only forward messages to 127.0.0.1
the local hobbit server gets all the data correctly. If I add the remote
server, then some things don't work properly. Since it is likely all a
big jumbled mess by now, I'll post a few sections of config files, and
hopefully someone will notice my stupid mistake (or multiple mistakes)...
I have a network 10.x.x.x which has a hobbit server at 10.30.10.9, all
client machines report to 10.30.10.9 as the BBDISPLAY/BBPAGER (most are
windows PC's using the BB windows client), one is a linux hobbit-client
and of course 10.30.10.9 is a hobbit client (plus a couple of old ext
scripts using the old BB env). I think all this is working fine, since
nothing goes randomly purple/red.
10.30.10.9 is behind NAT but has complete access to the internet.
I have a remote server behind a NAT router which has port 1984 port
forwarded to it. It is receiving reports from around 20 other hobbit
client machines perfectly, so I don't suspect the NAT router/hobbit
config itself.
Some config from 10.30.10.9:
hobbitserver.cfg:
BBSERVERIP="127.0.0.1"
BBDISP="127.0.0.1"
BBDISPLAYS=""
MAXLINE="32768"
hobbitclient.cfg
BBDISP="10.30.10.9"
BBDISPLAYS=""
BB="$BBHOME/bin/bb --debug --timeout=60"
MAXLINE="32768"
hobbitlaunch.cfg
[hobbitd]
ENVFILE /usr/lib/hobbit/server/etc/hobbitserver.cfg
CMD hobbitd --pidfile=$BBSERVERLOGS/hobbitd.pid
- --restart=$BBTMP/hobbitd.chk --checkpoint-file=$BBTMP/hobbitd.chk
- --checkpoint-interval=600 --log=$BBSERVERLOGS/hobbitd.log
- --admin-senders=127.0.0.1,$BBSERVERIP --store-clientlogs=!msgs
- --listen=127.0.0.1
[bbproxy]
ENVFILE /usr/lib/hobbit/server/etc/hobbitserver.cfg
CMD $BBHOME/bin/bbproxy --hobbitd
- --bbdisplay=123.234.456.567,127.0.0.1 --listen=10.30.10.9
- --report=$MACHINE.bbproxy --no-daemon --timeout=30
- --pidfile=$BBSERVERLOGS/bbproxy.pid --debug --log-details
CMD $BBHOME/bin/bbproxy --hobbitd --bbdisplay=127.0.0.1
- --listen=10.30.10.9 --report=$MACHINE.bbproxy --no-daemon --timeout=30
- --pidfile=$BBSERVERLOGS/bbproxy.pid --debug --log-details
LOGFILE $BBSERVERLOGS/bbproxy.log
[hobbitclient]
ENVFILE /usr/lib/hobbit/client/etc/hobbitclient.cfg
NEEDS hobbitd
CMD /usr/lib/hobbit/client/bin/hobbitclient.sh
LOGFILE $BBSERVERLOGS/hobbitclient.log
INTERVAL 5m
On the remote hobbit server with the public IP I have:
hobbitserver.cfg
BBSERVERIP="192.168.2.6"
BBDISP="192.168.2.6"
BBDISPLAYS=""
MAXLINE="32768"
MAXMSG_STATUS="1024"
MAXMSG_CLIENT="1024"
MAXMSG_DATA="512"
hobbitlaunch.cfg
[hobbitd]
HEARTBEAT
ENVFILE /usr/lib/hobbit/server/etc/hobbitserver.cfg
CMD hobbitd --pidfile=$BBSERVERLOGS/hobbitd.pid
- --restart=$BBTMP/hobbitd.chk --checkpoint-file=$BBTMP/hobbitd.chk
- --checkpoint-interval=600 --log=$BBSERVERLOGS/hobbitd.log
- --admin-senders=127.0.0.1,$BBSERVERIP
- --maint-senders=127.0.0.1,$BBSERVERIP -www-senders=127.0.0.1,$BBSERVERIP
- --store-clientlogs=!msgs --timeout=60
Any suggestions as to what is going wrong would be really appreciated.
BTW, bbnet tests from the 10.30.10.9 host are not submitted to the
bbproxy at all because of the BBDISP setting in the hobbitserver.cfg,
but if I change this to point to 10.30.10.9 then it seems to break the
web interface. I'm not really too concerned about this right now though....
Thanks for any tips/pointers/etc
Regards,
Adam
▸
-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iD8DBQFIHyvcGyoxogrTyiURAhpyAKCsnO4px+b4Ml04yjzZvXgFxeuaogCeKwy6
KwOEboPhIXFb4YVgdA0ndlk=
=T5Lc
-----END PGP SIGNATURE-----
list Adam Goryachev
▸
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Adam Goryachev wrote:
Adam Goryachev wrote:Anyway, the problem is that approximately since then, a number of client reports are not completely received. Sometimes some of the ps output is truncated, sometimes the ports sections is truncated, etc. This leads to false positive alerts (ie, procs goes red because some monitored procs are not running since they were after the truncated section).I've increased the timeout on the hobbitd (--timeout=60) but this doesn't seem to have helped. The only common factor between the clients which have this problem are:1) Most of them are running bbproxy and passing status messages from a number of clients. 2) The rest of them are on very slow connections, or frequently very busy connections.
I've made some more possible progress, I still don't really know how to approach this problem, or try to solve it.... Basically, I used tcpdump to catch all traffic sent to port 1984 on my local server. I then used wireshark to analyse the data to find the specific stream of packets that lead to hobbit getting a red alert due to truncated client report. It now seems to point toward some sort of transport 'problem' in that I get a number of 'errors' such as "TCP Previous segment lost" and "TCP Dup ACK" and "TCP Retransmission" and the final packet is a "RST" which I assume is when you would normally get a "Connection reset by peer" type error. I would love to publish the trace, but don't know how to obfuscate it's contents to conceal some of the details (ie, the contents of the hobbit client status that was being reported). However, I do have the following questions: 1) If the connection died due to an error, why does hobbit still use the contents of what it received? (Is this the better to know half the information than none, or we can't tell the difference between connection closed due to an error and connection closed at end of transport?) 2) From what I know, TCP is meant to be fairly robust in the face of lost packets, and other errors. The fact I am seeing these sort of failures concerns me that my network must be unhappy in some way. Yet, from a user experience point of view, everything seems to be working perfectly..... ie, web browsing/ssh connections /etc... BTW, the network connection is quite busy during the times when these errors happen due to remote backups being done at those times. Could that be the cause of the problem? Any comments, suggestions, etc, would be greatly appreciated. Regards,
▸
Adam - -- Adam Goryachev Website Managers www.websitemanagers.com.au -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iD8DBQFII9+MGyoxogrTyiURAoZ5AJ4uwxQMAIuEvF32XWxZuBPqBU3bYQCfYtVy
T4RIJ40hdntCZtTIXRouCtY=
=Begp
-----END PGP SIGNATURE-----
list Rodolfo Pilas
Adam, take a look at: http://en.wikibooks.org/wiki/System_Monitoring_with_Hobbit/FAQ#Q._How_do_I_fix_.22Oversize_status_msg_from_192.168.1.31_for_test.my.com:ports_truncated_.28n.3D508634.2C_limit.3D262144.29.22
▸
Adam Goryachev escribió:Adam Goryachev wrote:Anyway, the problem is that approximately since then, a number of client reports are not completely received. Sometimes some of the ps output is truncated, sometimes the ports sections is truncated, etc. This leads to false positive alerts (ie, procs goes red because some monitored procs are not running since they were after the truncated section).I've increased the timeout on the hobbitd (--timeout=60) but this doesn't seem to have helped. The only common factor between the clients which have this problem are:1) Most of them are running bbproxy and passing status messages from a number of clients. 2) The rest of them are on very slow connections, or frequently very busy connections.I have made some 'progress' of sorts. I've increased the MAX values as I was getting some "Oversize ... truncated" messages in my log file. I then went home thinking "Great, I managed to solve this one thing today at least". Except, I started getting messages a few hours later. So after further investigation, I've decided I really can't work out what is happening, and why it isn't working. I've enabled debug output from bbproxy, but I don't really know what it all means. I can see that if I set bbproxy to only forward messages to 127.0.0.1 the local hobbit server gets all the data correctly. If I add the remote server, then some things don't work properly. Since it is likely all a big jumbled mess by now, I'll post a few sections of config files, and hopefully someone will notice my stupid mistake (or multiple mistakes)... I have a network 10.x.x.x which has a hobbit server at 10.30.10.9, all client machines report to 10.30.10.9 as the BBDISPLAY/BBPAGER (most are windows PC's using the BB windows client), one is a linux hobbit-client and of course 10.30.10.9 is a hobbit client (plus a couple of old ext scripts using the old BB env). I think all this is working fine, since nothing goes randomly purple/red. 10.30.10.9 is behind NAT but has complete access to the internet. I have a remote server behind a NAT router which has port 1984 port forwarded to it. It is receiving reports from around 20 other hobbit client machines perfectly, so I don't suspect the NAT router/hobbit config itself. Some config from 10.30.10.9: hobbitserver.cfg: BBSERVERIP="127.0.0.1" BBDISP="127.0.0.1" BBDISPLAYS="" MAXLINE="32768" hobbitclient.cfg BBDISP="10.30.10.9" BBDISPLAYS="" BB="$BBHOME/bin/bb --debug --timeout=60" MAXLINE="32768" hobbitlaunch.cfg [hobbitd] ENVFILE /usr/lib/hobbit/server/etc/hobbitserver.cfg CMD hobbitd --pidfile=$BBSERVERLOGS/hobbitd.pid --restart=$BBTMP/hobbitd.chk --checkpoint-file=$BBTMP/hobbitd.chk --checkpoint-interval=600 --log=$BBSERVERLOGS/hobbitd.log --admin-senders=127.0.0.1,$BBSERVERIP --store-clientlogs=!msgs --listen=127.0.0.1 [bbproxy] ENVFILE /usr/lib/hobbit/server/etc/hobbitserver.cfg CMD $BBHOME/bin/bbproxy --hobbitd --bbdisplay=123.234.456.567,127.0.0.1 --listen=10.30.10.9 --report=$MACHINE.bbproxy --no-daemon --timeout=30 --pidfile=$BBSERVERLOGS/bbproxy.pid --debug --log-details CMD $BBHOME/bin/bbproxy --hobbitd --bbdisplay=127.0.0.1 --listen=10.30.10.9 --report=$MACHINE.bbproxy --no-daemon --timeout=30 --pidfile=$BBSERVERLOGS/bbproxy.pid --debug --log-details LOGFILE $BBSERVERLOGS/bbproxy.log [hobbitclient] ENVFILE /usr/lib/hobbit/client/etc/hobbitclient.cfg NEEDS hobbitd CMD /usr/lib/hobbit/client/bin/hobbitclient.sh LOGFILE $BBSERVERLOGS/hobbitclient.log INTERVAL 5m On the remote hobbit server with the public IP I have: hobbitserver.cfg BBSERVERIP="192.168.2.6" BBDISP="192.168.2.6" BBDISPLAYS="" MAXLINE="32768" MAXMSG_STATUS="1024" MAXMSG_CLIENT="1024" MAXMSG_DATA="512" hobbitlaunch.cfg [hobbitd] HEARTBEAT ENVFILE /usr/lib/hobbit/server/etc/hobbitserver.cfg CMD hobbitd --pidfile=$BBSERVERLOGS/hobbitd.pid --restart=$BBTMP/hobbitd.chk --checkpoint-file=$BBTMP/hobbitd.chk --checkpoint-interval=600 --log=$BBSERVERLOGS/hobbitd.log --admin-senders=127.0.0.1,$BBSERVERIP --maint-senders=127.0.0.1,$BBSERVERIP -www-senders=127.0.0.1,$BBSERVERIP --store-clientlogs=!msgs --timeout=60 Any suggestions as to what is going wrong would be really appreciated. BTW, bbnet tests from the 10.30.10.9 host are not submitted to the bbproxy at all because of the BBDISP setting in the hobbitserver.cfg, but if I change this to point to 10.30.10.9 then it seems to break the web interface. I'm not really too concerned about this right now though.... Thanks for any tips/pointers/etc Regards, Adam
list Adam Goryachev
▸
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Rodolfo Pilas wrote:
Adam, take a look at: http://en.wikibooks.org/wiki/System_Monitoring_with_Hobbit/FAQ#Q._How_do_I_fix_.22Oversize_status_msg_from_192.168.1.31_for_test.my.com:ports_truncated_.28n.3D508634.2C_limit.3D262144.29.22
I've checked that, and it does not seem to be the problem.
All original info is below, a quick recap, my hobbit server doesn't
receive the complete client data, and so procs (and sometimes ports) go
red since the data is missing, the procs are not found).
Currently, the bbproxy server (which is running a hobbit on 127.0.0.1)
shows a green for procs for itself (the bbclient running locally) which
means the bbclient passed a message to 10.30.10.9 (bbproxy) which passed
the message to 127.0.0.1 (hobbit) which displayed it correctly.
The same bbproxy failed to send that message to the remote hobbit
server, which has the following as the "Client Data" when I click at the
bottom of the red procs page:
client mail,servername,com,au.linux linux
[date]
Tue Dec 16 15:56:14 EST 2008
[uname]
Linux mail 2.6.18-6-686 i686
[osversion]
Debian 4.0
[uptime]
15:56:14 up 11 days, 6:53, 1 user, load average: 0.22, 0.39, 0.43
[who]
userag pts/0 Dec 16 13:29 (123.123.123.12.static.net.au)
[df]
Filesystem 1024-blocks Used Available Capacity Mounted on
/dev/md0 9775120 7792092 1983028 80% /
/dev/md1 146002196 23881268 122120928 17% /home
[mount]
/dev/md0 on / type reiserfs (rw,notail)
tmpfs on /lib/init/rw type tmpfs (rw,nosuid,mode=0755)
proc on /proc type proc (rw,noexec,nosuid,nodev)
sysfs on /sys type sysfs (rw,noexec,nosuid,nodev)
procbususb on /proc/bus/usb type usbfs (rw)
udev on /dev type tmpfs (rw,mode=0755)
tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev)
devpts on /dev/pts type devpts (rw,noexec,nosuid,gid=5,mode=620)
/dev/md1 on /home type reiserfs (rw)
//ptserver/Shared$ on /mnt/ptserver/shared type smbfs (ro)
//ptserver/Shared$ on /mnt/ptserver/sharedrw type smbfs (rw)
[free]
total used free shared buffers cached
Mem: 3112548 2735760 376788 0 451828 1188956
- -/+ buffers/cache: 1094976 2017572
Swap: 1012072 88 1011984
[ifconfig]
eth0 Link encap:Ethernet HWaddr 00:13:20:5F:EC:F3
inet addr:192.168.1.2 Bcast:192.168.1.255 Mask:255.255.255.0
inet6 addr: fe80::213:20ff:fe5f:ecf3/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:14847337 errors:0 dropped:0 overruns:0 frame:0
TX packets:15440612 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:795753357 (758.8 MiB) TX bytes:4174004073 (3.8 GiB)
eth1 Link encap:Ethernet HWaddr 00:09:5B:1A:16:26
inet addr:10.30.10.9 Bcast:10.30.15.255 Mask:255.255.240.0
inet6 addr: fe80::209:5bff:fe1a:1626/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:18232874 errors:0 dropped:0 overruns:0 frame:0
TX packets:25222742 errors:7 dropped:0 overruns:7 carrier:7
collisions:0 txqueuelen:1000
RX bytes:771706320 (735.9 MiB) TX bytes:2584475027 (2.4 GiB)
Interrupt:74 Base address:0xc000
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:16436 Metric:1
RX packets:7545664 errors:0 dropped:0 overruns:0 frame:0
TX packets:7545664 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:3978669335 (3.7 GiB) TX bytes:3978669335 (3.7 GiB)
tun0 Link encap:UNSPEC HWaddr
00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00
inet addr:10.30.99.1 P-t-P:10.30.99.2 Mask:255.255.255.255
UP POINTOPOINT RUNNING NOARP MULTICAST MTU:1500 Metric:1
RX packets:767727 errors:0 dropped:0 overruns:0 frame:0
TX packets:873992 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:100
RX bytes:57714836 (55.0 MiB) TX bytes:210654878 (200.8 MiB)
[route]
Kernel IP routing table
Destination Gateway Genmask Flags MSS Window irtt
Iface
10.30.99.2 0.0.0.0 255.255.255.255 UH 0 0 0
tun0
10.30.99.0 10.30.99.2 255.255.255.0 UG 0 0 0
tun0
192.168.1.0 0.0.0.0 255.255.255.0 U 0 0 0
eth0
10.30.0.0 0.0.0.0 255.255.240.0 U 0 0 0
eth1
10.30.0.0 10.30.10.254 255.255.0.0 UG 0 0 0
eth1
0.0.0.0 192.168.1.1 0.0.0.0 UG 0 0 0
eth0
[netstat]
Ip:
40747731 total packets received
138429 with invalid addresses
1633704 forwarded
1 with unknown protocol
0 incoming packets discarded
38950970 incoming packets delivered
50359204 requests sent out
15 outgoing packets dropped
1 fragments dropped after timeout
1831 reassemblies required
915 packets reassembled ok
1 packet reassembles failed
Icmp:
1657334 ICMP messages received
3902 input ICMP message failed.
ICMP input histogram:
destination unreachable: 1566573
timeout in transit: 54
redirects: 123
echo requests: 26880
echo replies: 61402
139635 ICMP messages sent
0 ICMP messages failed
ICMP output histogram:
destination unreachable: 112755
echo replies: 26880
Tcp:
1669446 active connections openings
1247657 passive connection openings
361872 failed connection attempts
132167 connection resets received
96 connections established
33096061 segments received
40931113 segments send out
715004 segments retransmited
0 bad segments received.
125842 resets sent
Udp:
4173585 packets received
23298 packets to unknown port received.
62 packet receive errors
6868647 packets sent
TcpExt:
482 resets received for embryonic SYN_RECV sockets
268 packets pruned from receive queue because of socket buffer overrun
14 ICMP packets dropped because they were out-of-window
1084404 TCP sockets finished time wait in fast timer
2236 time wait sockets recycled by time stamp
354 packets rejects in established connections because of timestamp
366541 delayed acks sent
179 delayed acks further delayed because of locked socket
Quick ack mode was activated 111492 times
2510467 packets directly queued to recvmsg prequeue.
3003729 of bytes directly received from backlog
380000212 of bytes directly received from prequeue
7194390 packet headers predicted
724319 packets header predicted and directly queued to user
8998070 acknowledgments not containing data received
5065326 predicted acknowledgments
1472 times recovered from packet loss due to fast retransmit
42465 times recovered from packet loss due to SACK data
161 bad SACKs received
Detected reordering 165 times using FACK
Detected reordering 181 times using SACK
Detected reordering 444 times using reno fast retransmit
Detected reordering 3197 times using time stamp
1530 congestion windows fully recovered
17437 congestion windows partially recovered using Hoe heuristic
TCPDSACKUndo: 155
73943 congestion windows recovered after partial ack
29880 TCP data loss events
TCPLostRetransmit: 42
541 timeouts after reno fast retransmit
60499 timeouts after SACK recovery
11423 timeouts in loss state
84462 fast retransmits
6777 forward retransmits
45329 retransmits in slow start
261506 other TCP timeouts
TCPRenoRecoveryFail: 203
14386 sack retransmits failed
2738 times receiver scheduled too late for direct processing
13320 packets collapsed in receive queue due to low socket buffer
89544 DSACKs sent for old packets
1294 DSACKs sent for out of order packets
21665 DSACKs received
446 DSACKs for out of order packets received
34120 connections reset due to unexpected data
3110 connections reset due to early user close
34312 connections aborted due to timeout
[ports]
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address Foreign Address
State
tcp 0 0 0.0.0.0:20000 0.0.0.0:*
LISTEN
tcp 0 0 127.0.0.1:1984 0.0.0.0:*
LISTEN
tcp 0 0 10.30.10.9:1984 0.0.0.0:*
LISTEN
tcp 0 0 0.0.0.0:20002 0.0.0.0:*
LISTEN
tcp 0 0 0.0.0.0:55555 0.0.0.0:*
LISTEN
tcp 0 0 0.0.0.0:37 0.0.0.0:*
LISTEN
tcp 0 0 0.0.0.0:389 0.0.0.0:*
LISTEN
tcp 0 0 0.0.0.0:21000 0.0.0.0:*
LISTEN
tcp 0 0 0.0.0.0:873 0.0.0.0:*
LISTEN
tcp 0 0 0.0.0.0:9 0.0.0.0:*
LISTEN
tcp 0 0 0.0.0.0:139 0.0.0.0:*
LISTEN
tcp 0 0 0.0.0.0:13 0.0.0.0:*
LISTEN
tcp 0 0 127.0.0.1:783 0.0.0.0:*
LISTEN
tcp 0 0 0.0.0.0:80 0.0.0.0:*
LISTEN
tcp 0 0 10.30.10.9:8080 0.0.0.0:*
LISTEN
tcp 0 0 0.0.0.0:113 0.0.0.0:*
LISTEN
tcp 0 0 127.0.0.1:53 0.0.0.0:*
LISTEN
tcp 0 0 10.30.10.9:1080 0.0.0.0:*
LISTEN
tcp 0 0 0.0.0.0:5432 0.0.0.0:*
LISTEN
tcp 0 0 0.0.0.0:25 0.0.0.0:*
LISTEN
tcp 0 0 0.0.0.0:445 0.0.0.0:*
LISTEN
tcp 0 0 0.0.0.0:14238 0.0.0.0:*
LISTEN
tcp 0 0 192.168.1.2:56464 123.123.32.80:80
ESTABLISHED
tcp 0 0 192.168.1.2:51389 123.123.176.189:80
ESTABLISHED
tcp 0 0 192.168.1.2:52865 123.123.28.123:80
ESTABLISHED
tcp 0 0 10.30.10.9:8080 10.30.10.16:1028
TIME_WAIT
tcp 0 0 192.168.1.2:40408 123.213.88.51:80
ESTABLISHED
tcp 0 0 192.168.1.2:53207 123.123.50.28:443
ESTABLISHED
tcp 0 0 10.30.10.9:33905 10.30.10.1:139
ESTABLISHED
tcp 0 0 192.168.1.2:49885 123.123.28.124:80
ESTABLISHED
tcp 0 0 192.168.1.2:53071 123.123.176.176:80
ESTABLISHED
tcp 0 0 127.0.0.1:5432 127.0.0.1:53017
ESTABLISHED
tcp 0 0 192.168.1.2:25 123.123.132.183:57226
TIME_WAIT
tcp 0 0 10.30.10.9:8080 10.30.10.14:2826
ESTABLISHED
tcp 0 0 127.0.0.1:54646 127.0.0.1:1984
TIME_WAIT
tcp 0 0 127.0.0.1:54645 127.0.0.1:1984
TIME_WAIT
tcp 0 0 127.0.0.1:54655 127.0.0.1:1984
TIME_WAIT
tcp 0 0 127.0.0.1:54639 127.0.0.1:1984
TIME_WAIT
tcp 0 0 10.30.10.9:8080 10.30.10.14:2829
ESTABLISHED
tcp 0 0 127.0.0.1:54664 127.0.0.1:1984
TIME_WAIT
tcp 0 0 127.0.0.1:54665 127.0.0.1:1984
TIME_WAIT
tcp 0 0 10.30.10.9:8080 10.30.10.14:2830
ESTABLISHED
tcp 0 0 192.168.1.2:41020 123.123.38.94:80
ESTABLISHED
tcp 0 0 10.30.10.9:8080 10.30.10.16:1050
ESTABLISHED
tcp 0 0 192.168.1.2:35851 123.123.88.59:80
TIME_WAIT
tcp 0 0 10.30.10.9:8080 10.30.10.16:1051
ESTABLISHED
tcp 0 0 192.168.1.2:47514 123.123.88.51:80
ESTABLISHED
tcp 0 0 192.168.1.2:37393 123.123.176.45:443
ESTABLISHED
tcp 0 0 10.30.10.9:49509 10.30.10.9:995
ESTABLISHED
tcp 0 0 192.168.1.2:38569 123.123.15.124:80
ESTABLISHED
tcp 0 0 10.30.10.9:8080 10.30.10.16:1068
ESTABLISHED
tcp 0 1640 192.168.1.2:40677 123.213.176.45:443
ESTABLISHED
tcp 0 1 192.168.1.2:59642 123.123.176.176:80
LAST_ACK
tcp 0 0 10.30.10.9:8080 10.30.10.16:1066
ESTABLISHED
tcp 0 0 192.168.1.2:46183 123.123.88.51:80
ESTABLISHED
tcp 0 0 10.30.10.9:8080 10.30.10.16:1067
ESTABLISHED
tcp 0 1640 192.168.1.2:32966 123.123.176.45:443
ESTABLISHED
tcp 0 0 192.168.1.2:41246 123.123.88.56:80
ESTABLISHED
tcp 0 0 192.168.1.2:35872 123.123.38.94:80
TIME_WAIT
tcp 0 0 10.30.10.9:139 10.30.10.1:3133
ESTABLISHED
tcp 0 0 10.30.10.9:47825 10.30.10.1:139
TIME_WAIT
tcp 0 0 10.30.10.9:47828 10.30.10.1:139
TIME_WAIT
tcp 0 0 192.168.1.2:33758 123.123.38.94:80
TIME_WAIT
tcp 0 0 10.30.10.9:8080 10.30.10.17:3132
ESTABLISHED
tcp
Note, this is clearly truncated mid-line/mid-report!
So, the procs test on the remote hobbit shows red, and alerts/etc...
The interesting thing to note this time is that the actual procs report
showed most of the procs were actually found with some (truncated) ps
output data. Which means, hobbit actually had more data than is
displayed on the Client Data page....
If anyone can advise how to resolve this, I would be exceptionally keen
to hear about it.
A upgrade to the bandwidth for one end is scheduled for 2 weeks time,
which may help this site, but I am still seeing the same problem with
other hosts in other locations.
Two additional questions:
1) Can some sort of checksum/verification marker be added to the end of
the client data so the server knows that if it doesn't see the end
marker to discard the entire message?
2) Compression of the client data?
Nevermind, I'll post this as a separate email so it can get a little
better visibility...
Thanks,
Adam
▸
Adam Goryachev escribió:Adam Goryachev wrote:Anyway, the problem is that approximately since then, a number of client reports are not completely received. Sometimes some of the ps output is truncated, sometimes the ports sections is truncated, etc. This leads to false positive alerts (ie, procs goes red because some monitored procs are not running since they were after the truncated section).I've increased the timeout on the hobbitd (--timeout=60) but this doesn't seem to have helped. The only common factor between the clients which have this problem are:1) Most of them are running bbproxy and passing status messages from a number of clients. 2) The rest of them are on very slow connections, or frequently very busy connections.I have made some 'progress' of sorts. I've increased the MAX values as I was getting some "Oversize ... truncated" messages in my log file. I then went home thinking "Great, I managed to solve this one thing today at least". Except, I started getting messages a few hours later. So after further investigation, I've decided I really can't work out what is happening, and why it isn't working. I've enabled debug output from bbproxy, but I don't really know what it all means. I can see that if I set bbproxy to only forward messages to 127.0.0.1 the local hobbit server gets all the data correctly. If I add the remote server, then some things don't work properly. Since it is likely all a big jumbled mess by now, I'll post a few sections of config files, and hopefully someone will notice my stupid mistake (or multiple mistakes)... I have a network 10.x.x.x which has a hobbit server at 10.30.10.9, all client machines report to 10.30.10.9 as the BBDISPLAY/BBPAGER (most are windows PC's using the BB windows client), one is a linux hobbit-client and of course 10.30.10.9 is a hobbit client (plus a couple of old ext scripts using the old BB env). I think all this is working fine, since nothing goes randomly purple/red. 10.30.10.9 is behind NAT but has complete access to the internet. I have a remote server behind a NAT router which has port 1984 port forwarded to it. It is receiving reports from around 20 other hobbit client machines perfectly, so I don't suspect the NAT router/hobbit config itself. Some config from 10.30.10.9: hobbitserver.cfg: BBSERVERIP="127.0.0.1" BBDISP="127.0.0.1" BBDISPLAYS="" MAXLINE="32768" hobbitclient.cfg BBDISP="10.30.10.9" BBDISPLAYS="" BB="$BBHOME/bin/bb --debug --timeout=60" MAXLINE="32768" hobbitlaunch.cfg [hobbitd] ENVFILE /usr/lib/hobbit/server/etc/hobbitserver.cfg CMD hobbitd --pidfile=$BBSERVERLOGS/hobbitd.pid --restart=$BBTMP/hobbitd.chk --checkpoint-file=$BBTMP/hobbitd.chk --checkpoint-interval=600 --log=$BBSERVERLOGS/hobbitd.log --admin-senders=127.0.0.1,$BBSERVERIP --store-clientlogs=!msgs --listen=127.0.0.1 [bbproxy] ENVFILE /usr/lib/hobbit/server/etc/hobbitserver.cfg CMD $BBHOME/bin/bbproxy --hobbitd --bbdisplay=123.234.456.567,127.0.0.1 --listen=10.30.10.9 --report=$MACHINE.bbproxy --no-daemon --timeout=30 --pidfile=$BBSERVERLOGS/bbproxy.pid --debug --log-details CMD $BBHOME/bin/bbproxy --hobbitd --bbdisplay=127.0.0.1 --listen=10.30.10.9 --report=$MACHINE.bbproxy --no-daemon --timeout=30 --pidfile=$BBSERVERLOGS/bbproxy.pid --debug --log-details LOGFILE $BBSERVERLOGS/bbproxy.log [hobbitclient] ENVFILE /usr/lib/hobbit/client/etc/hobbitclient.cfg NEEDS hobbitd CMD /usr/lib/hobbit/client/bin/hobbitclient.sh LOGFILE $BBSERVERLOGS/hobbitclient.log INTERVAL 5m On the remote hobbit server with the public IP I have: hobbitserver.cfg BBSERVERIP="192.168.2.6" BBDISP="192.168.2.6" BBDISPLAYS="" MAXLINE="32768" MAXMSG_STATUS="1024" MAXMSG_CLIENT="1024" MAXMSG_DATA="512" hobbitlaunch.cfg [hobbitd] HEARTBEAT ENVFILE /usr/lib/hobbit/server/etc/hobbitserver.cfg CMD hobbitd --pidfile=$BBSERVERLOGS/hobbitd.pid --restart=$BBTMP/hobbitd.chk --checkpoint-file=$BBTMP/hobbitd.chk --checkpoint-interval=600 --log=$BBSERVERLOGS/hobbitd.log --admin-senders=127.0.0.1,$BBSERVERIP --maint-senders=127.0.0.1,$BBSERVERIP -www-senders=127.0.0.1,$BBSERVERIP --store-clientlogs=!msgs --timeout=60 Any suggestions as to what is going wrong would be really appreciated. BTW, bbnet tests from the 10.30.10.9 host are not submitted to the bbproxy at all because of the BBDISP setting in the hobbitserver.cfg, but if I change this to point to 10.30.10.9 then it seems to break the web interface. I'm not really too concerned about this right now though.... Thanks for any tips/pointers/etc Regards, Adam
- -- Adam Goryachev Website Managers
Ph: +XX X XXXX XXXX user-eaec2ffb4cbc@xymon.invalid Fax: +XX X XXXX XXXX www.websitemanagers.com.au -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAklHQnsACgkQGyoxogrTyiVq/QCgjU1lIamzEs5lq5bKGo9K3Hgt XQAAnAvUe9sQC3PHeTeJjFi2+XGG50dT =65Io -----END PGP SIGNATURE-----