Xymon Mailing List Archive search

xymon.com and the mailinglists back online

3 messages in this thread

list Henrik Størner · Mon, 27 Aug 2012 17:19:55 +0200 ·
Hi,

the xymon.com server had a minor disk "hiccup" last Saturday. Unfortunately, this triggered a kernel panic and things went pretty bad after that - eventually causing the whole server to die last Monday Aug. 20th.

Unfortunately, I was 4000 km away - and even though I did manage to get an SSH session opened to the server, all attempts to reboot it remotely just gave me a "Bus error".

My apologies for the inconvenience, it was a classic case of Murphy's Law that everything will go bad, at the worst possible time.

I expect the mails submitted to the mailing list will show up over the next 24 hours or so, once the various mailservers retry their connection to xymon.com


Regards,
Henrik
list Josh Luthman · Mon, 27 Aug 2012 11:21:32 -0400 ·
Thanks for the notice!

Josh Luthman
Office: XXX-XXX-XXXX
Direct: XXX-XXX-XXXX
XXXX Wayne St
Suite XXXX
Troy, OH XXXXX
quoted from Henrik Størner


On Mon, Aug 27, 2012 at 11:19 AM, Henrik Størner <user-ce4a2c883f75@xymon.invalid> wrote:
Hi,

the xymon.com server had a minor disk "hiccup" last Saturday.
Unfortunately, this triggered a kernel panic and things went pretty bad
after that - eventually causing the whole server to die last Monday Aug.
20th.

Unfortunately, I was 4000 km away - and even though I did manage to get an
SSH session opened to the server, all attempts to reboot it remotely just
gave me a "Bus error".

My apologies for the inconvenience, it was a classic case of Murphy's Law
that everything will go bad, at the worst possible time.

I expect the mails submitted to the mailing list will show up over the
next 24 hours or so, once the various mailservers retry their connection to
xymon.com


Regards,
Henrik
______________________________**

Xymon at xymon.com<
list Henrik Størner · Tue, 28 Aug 2012 13:30:22 +0200 ·
quoted from Henrik Størner
On 27-08-2012 17:19, Henrik Størner wrote:
the xymon.com server had a minor disk "hiccup" last Saturday.
Unfortunately, this triggered a kernel panic and things went pretty bad
after that - eventually causing the whole server to die last Monday Aug.
20th.
Turns out it was more than just a hiccup - I was bitten by a firmware bug in my Crucial M4 SSD disk http://forum.crucial.com/t5/Solid-State-Drives-SSD/Firmware-Update-Notifications/m-p/80282#M24370

"an incorrect response to a SMART counter will cause the m4 drive to become unresponsive after 5184 hours of Power-on time. The drive will recover after a power cycle, however, this failure will repeat once per hour after reaching this point."

If any of you have Crucial M4 SSD disks in use, I'd recommend checking the firmware version ASAP - it must be version 0309 or 000F. "smartctl -a" on Linux can tell you.


Regards,
Henrik