Xymon Mailing List Archive search

Hobbit server crashing

list Henrik Størner
Thu, 9 Oct 2008 13:17:48 +0000 (UTC)
Message-Id: <gcl09s$f4n$user-e356fad9864f@xymon.invalid>

In <user-1ee6c893bc5e@xymon.invalid> "Everett, Vernon" <user-9da1a1882f49@xymon.invalid> writes:
My Hobbit server crashed and died.
This happened before, a few months ago, and I shrugged it off - sometimes
sh1t happens.
Then it happened last week again. This time I was concerned.
Now it has just happened again, about 40 minutes ago.
I tried to restart hobbit, without much luck, then I walked away, put my son
into bed, and then tried again.
This time it worked.
The logs never showed anything conclusive, but maybe I just don't know what
I am looking for.
The symptoms were the same all three times.
All "passive" server based tests go purple.
By passive server based, I mean conn, http, content, ssh, ftp, ftps, etc.
The tests that do not rely on a client.
Also went purple, was bbd and bbtest.
All client based tests were unaffected. Graphing worked as normal. And 
alerts were being sent out.

Your description sounds very much as if the only thing that stopped were 
the network tests (bbtest-net). Since the client-side tests are updating,
network tests go purple and alerts go out, I think that is where the
problem is. "bbtest" going purple also points in this direction.

Next time it happens, see if there's a "bbtest-net" process running (and possible 
a "hobbitping" or "fping" process as well); if there is, kill it with a "kill -6"
to make it dump core. Then do the usual stuff of getting a stacktrace from the
core file ( http://www.hswn.dk/hobbit/help/known-issues.html#bugreport )

Are you running bbtest-net with the "--no-ares" option ? Then a hung/slow DNS server
can make your network tests run very slowly.


Henrik