Xymon Mailing List Archive search

bbtest-net to hobbitd problem

3 messages in this thread

list Olivier Beau · Mon, 1 Aug 2005 13:36:03 +0200 ·
Hi Henrik,

i'm having a problem:
-bbtest-net reports one or two "Whoops ! bb failed to send message - timeout" in
the report
-which is causing a bunch of net test to go purple (pretty embarrasing..)
-i tried to play with BBMAXMSGSPERCOMBO and BBSLEEPBETWEENMSGS, but doesnt seem
to have any effect...
-once in while bbtest-net does report everything fine to hobbitd, without any
changes on the server


here's the output from bbtest-net --debug where the whoops happens:
2005-08-01 13:15:24 Recipient listed as '127.0.0.1'
2005-08-01 13:15:24 Standard BB protocol on port 1985
2005-08-01 13:15:24 Will connect to address 127.0.0.1 port 1985
2005-08-01 13:15:24 Connect status is 0
2005-08-01 13:15:24 Sent 65532 bytes
2005-08-01 13:15:24 Sent 81921 bytes
2005-08-01 13:15:24 Sent 49152 bytes
2005-08-01 13:15:29 Whoops ! bb failed to send message - timeout


it looks like bbtest-net actually connected to hobbitd !
-> could bbtest-net re-open a connection and resend the affected statuses when a
oops happens ?


later in the bbtest-net log i see this, which is different since i suppose
bbtest-net got a connection closed the first try:
2005-08-01 13:15:29 Recipient listed as '127.0.0.1'
2005-08-01 13:15:29 Standard BB protocol on port 1985
2005-08-01 13:15:29 Will connect to address 127.0.0.1 port 1985
2005-08-01 13:15:34 Timeout while talking to bbd at 127.0.0.1:1985 - retrying
2005-08-01 13:15:35 Will connect to address 127.0.0.1 port 1985
2005-08-01 13:15:35 Connect status is 0
2005-08-01 13:15:35 Sent 466 bytes
2005-08-01 13:15:35 Closing connection


Any idea of what could be going in hobbitd ?
(my understanding is that hobbitd kind of drops heavy status connections..)


--
Olivier Beau
list Henrik Størner · Mon, 1 Aug 2005 14:36:33 +0200 ·
Hi Oliver,

what version of Hobbit ? And what OS/hardware are you running on ?
quoted from Olivier Beau

On Mon, Aug 01, 2005 at 01:36:03PM +0200, Olivier Beau wrote:
i'm having a problem:
-bbtest-net reports one or two "Whoops ! bb failed to send message - timeout" in
the report
-which is causing a bunch of net test to go purple (pretty embarrasing..)
-i tried to play with BBMAXMSGSPERCOMBO and BBSLEEPBETWEENMSGS, but doesnt seem
to have any effect...
-once in while bbtest-net does report everything fine to hobbitd, without any
changes on the server

here's the output from bbtest-net --debug where the whoops happens:
2005-08-01 13:15:24 Recipient listed as '127.0.0.1'
2005-08-01 13:15:24 Standard BB protocol on port 1985
2005-08-01 13:15:24 Will connect to address 127.0.0.1 port 1985
2005-08-01 13:15:24 Connect status is 0
2005-08-01 13:15:24 Sent 65532 bytes
2005-08-01 13:15:24 Sent 81921 bytes
2005-08-01 13:15:24 Sent 49152 bytes
2005-08-01 13:15:29 Whoops ! bb failed to send message - timeout
Is there an equivalent number of "Bogus/Timeout" messages reported in
the Hobbit servers' "hobbitd" status column ? Are there any unusual
messages in the hobbitd.log file ?


The timeout that bbtest-net hits is a 5 second timeout which is the
default one used whenever a message is sent off to the Hobbit daemon.
The 5 secs was chosen back when bbtest-net was sending to the Big
Brother daemon, and considering that fact that Hobbit can generate much
larger messages it might be worth a try to increase that timeout
somewhat. Unfortunately, that one is set at compile-time and cannot be
changed easily - so could you try editing the lib/sendmsg.h file and
change the line
    #define BBTALK_TIMEOUT 5
to
    #define BBTALK_TIMEOUT 15
Then run "make clean; make" and as root "make install" to build and
install the tools with the new timeout setting.

Also, on the Hobbit server it might be necessary to up the timeout on
the receiver side - so add a "--timeout=30" to the hobbitd command in
~hobbit/server/etc/hobbitlaunch.cfg
quoted from Olivier Beau
it looks like bbtest-net actually connected to hobbitd !
-> could bbtest-net re-open a connection and resend the affected statuses when a
oops happens ?
It's tricky. Basically these timeouts should not happen (especially not
when we're connecting to "localhost"), so I'd rather try and figure out 
why they happen.
quoted from Olivier Beau
later in the bbtest-net log i see this, which is different since i suppose
bbtest-net got a connection closed the first try:
2005-08-01 13:15:29 Recipient listed as '127.0.0.1'
2005-08-01 13:15:29 Standard BB protocol on port 1985
2005-08-01 13:15:29 Will connect to address 127.0.0.1 port 1985
2005-08-01 13:15:34 Timeout while talking to bbd at 127.0.0.1:1985 - retrying
2005-08-01 13:15:35 Will connect to address 127.0.0.1 port 1985
2005-08-01 13:15:35 Connect status is 0
2005-08-01 13:15:35 Sent 466 bytes
2005-08-01 13:15:35 Closing connection
Yes, this is a situation where the first connection attempt fails. This
is retried and the second connection attempt succeeds and sends the
message.
quoted from Olivier Beau
Any idea of what could be going in hobbitd ?
(my understanding is that hobbitd kind of drops heavy status connections..)
Not really. hobbitd is a single-thread application that is designed to
do as little disk I/O as possible - the only real disk I/O it performs
is to read the bb-hosts file - and instead handle everything in memory.
5 seconds is a very long time; you can do a lot of cpu- and memory-bound
activity during that time - *if* the hobbitd process is scheduled to
run. I have seen some situations where a broken disk driver would cause
the entire box to freeze up for several seconds at a time, and hobbitd
doesn't like that at all ... 


Henrik
list Olivier Beau · Mon, 1 Aug 2005 15:41:47 +0200 ·
what version of Hobbit ? And what OS/hardware are you running on ?
version 4.1.1, redhat3.0 on a fairly good server compaq (2x3Gh intel cpu)
quoted from Henrik Størner

Is there an equivalent number of "Bogus/Timeout" messages reported in
the Hobbit servers' "hobbitd" status column ? 
no,
i had 1 hobbitd report with a  "Bogus/Timeout =1" this morning
and over 50 bbtest-net reports with 1 or 2 whoops..

Are there any unusual messages in the hobbitd.log file ?
nothing in hobbitd.log
quoted from Henrik Størner

The timeout that bbtest-net hits is a 5 second timeout which is the
default one used whenever a message is sent off to the Hobbit daemon.
The 5 secs was chosen back when bbtest-net was sending to the Big
Brother daemon, and considering that fact that Hobbit can generate much
larger messages it might be worth a try to increase that timeout
somewhat. Unfortunately, that one is set at compile-time and cannot be
changed easily - so could you try editing the lib/sendmsg.h file and
change the line
    #define BBTALK_TIMEOUT 5
to
    #define BBTALK_TIMEOUT 15
Then run "make clean; make" and as root "make install" to build and
install the tools with the new timeout setting.

Also, on the Hobbit server it might be necessary to up the timeout on
the receiver side - so add a "--timeout=30" to the hobbitd command in
~hobbit/server/etc/hobbitlaunch.cfg
ok, i've changed those to what you recommended (15 and 30)
up to now, bbtest-net doesnt whoops anymore
quoted from Henrik Størner

it looks like bbtest-net actually connected to hobbitd !
-> could bbtest-net re-open a connection and resend the affected statuses
when a
oops happens ?
It's tricky. Basically these timeouts should not happen (especially not
when we're connecting to "localhost"), so I'd rather try and figure out why they happen.
yes, i understand and agree with you.
let me know if i can do anything on this.


one thing that seems pretty long in my bbtest-net report is "test result
transmitted" :


Statistics:
 Hosts total           :     1629
 Hosts with no tests   :        0
 Total test count      :     4511
 Status messages       :     4851
 Alert status msgs     :        0
 Transmissions         :      522

TIME SPENT
Event                                            Starttime          Duration
bbtest-net startup                       1122897713.037280                 -
Service definitions loaded               1122897713.040386          0.003106 Tests loaded                             1122897713.568623          0.528237 DNS lookups completed                    1122897723.673199         10.104576 Test engine setup completed              1122897723.737976          0.064777 TCP tests completed                      1122897747.000639         23.262663 PING test completed (1569 hosts)         1122897792.655792         45.655153 PING test results sent                   1122897795.920521          3.264729 Test result collection completed         1122897795.921481          0.000960 LDAP test engine setup completed         1122897795.921485          0.000004 LDAP tests executed                      1122897795.921487          0.000002 LDAP tests result collection completed   1122897795.921488          0.000001 NTP tests executed                       1122897796.143392          0.221904 DIG tests executed                       1122897796.399747          0.256355 NSLOOKUP tests executed                  1122897796.534172          0.134425 Test results transmitted                 1122897824.069917         27.535745 bbtest-net completed                     1122897824.074708          0.004791 TIME TOTAL                                                        111.037428 


--
Olivier Beau