Xymon Mailing List Archive search

hobbitlaunch segfault / timewarp happend again

list Sebastian Auriol
Thu, 13 Nov 2008 19:30:12 -0000
Message-Id: <!~!UENERkVCMDkAAQACAAAAAAAAAAAAAAAAAC4AAAAAAAAAvrTCuIsz1xGxNWiNbQAAAAEAUR1BVCzq0xGwsQCgzFqsgwAAAAGTZwAAEAAAADbHGk2SnXpAuOr40Qr9fPMBAAAAAA==@syntec.co.uk>

Hi Henrik,

I suffered from the problem referred to the start of this thread (originally
reported at http://www.hswn.dk/hobbiton/2008/01/msg00570.html), except that
it applied not to hobbit-client hobbitlaunch but the server hobbitlaunch,
when the UK changed from BST to GMT on Oct 26 (core dump at 01:22). The
hobbit server is running k9linux that receives NTP broadcasts and the hour
changed back by 1 and then hobbit crashed and didn't come back via
hobbitlaunch. Someone actually drove over a hundred miles to reset the
server on the Sunday morning (although it could have been done remotely).

There was another core dump at the same time (01:22)...

# gdb /home/hobbit/server/bin/hobbitd /home/hobbit/server/core.17286
GNU gdb Red Hat Linux (6.3.0.0-1.96rh)
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain
conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "i386-redhat-linux-gnu"...Using host libthread_db
library "/lib/tls/libthread_db.so.1".

Core was generated by `hobbitd --pidfile=/var/log/hobbit/hobbitd.pid
--restart=/usr/local/hobbit/serve'.
Program terminated with signal 11, Segmentation fault.
Reading symbols from /lib/libpcre.so.0...done.
Loaded symbols for /lib/libpcre.so.0
Reading symbols from /usr/lib/libz.so.1...done.
Loaded symbols for /usr/lib/libz.so.1
Reading symbols from /lib/libssl.so.4...done.
Loaded symbols for /lib/libssl.so.4
Reading symbols from /lib/libcrypto.so.4...done.
Loaded symbols for /lib/libcrypto.so.4
Reading symbols from /lib/tls/libc.so.6...done.
Loaded symbols for /lib/tls/libc.so.6
Reading symbols from /usr/lib/libgssapi_krb5.so.2...done.
Loaded symbols for /usr/lib/libgssapi_krb5.so.2
Reading symbols from /usr/lib/libkrb5.so.3...done.
Loaded symbols for /usr/lib/libkrb5.so.3
Reading symbols from /lib/libcom_err.so.2...done.
Loaded symbols for /lib/libcom_err.so.2
Reading symbols from /usr/lib/libk5crypto.so.3...done.
Loaded symbols for /usr/lib/libk5crypto.so.3
Reading symbols from /lib/libresolv.so.2...done.
Loaded symbols for /lib/libresolv.so.2
Reading symbols from /lib/libdl.so.2...done.
Loaded symbols for /lib/libdl.so.2
Reading symbols from /lib/ld-linux.so.2...done.
Loaded symbols for /lib/ld-linux.so.2
#0  errprintf (fmt=0x8064f8c "Time warp detected: Adjusting returned clock
by %d seconds\n") at errormsg.c:42
42              time_t now = getcurrenttime(NULL);
(gdb)
(gdb)
(gdb)
(gdb) bt
#0  errprintf (fmt=0x8064f8c "Time warp detected: Adjusting returned clock
by %d seconds\n") at errormsg.c:42
#1  0x0805ddf7 in getcurrenttime (retparm=0x0) at timefunc.c:73
#2  0x08058097 in errprintf (fmt=0x8064f8c "Time warp detected: Adjusting
returned clock by %d seconds\n") at errormsg.c:42
#3  0x0805ddf7 in getcurrenttime (retparm=0x0) at timefunc.c:73
#4  0x08058097 in errprintf (fmt=0x8064f8c "Time warp detected: Adjusting
returned clock by %d seconds\n") at errormsg.c:42
#5  0x0805ddf7 in getcurrenttime (retparm=0x0) at timefunc.c:73
#6  0x08058097 in errprintf (fmt=0x8064f8c "Time warp detected: Adjusting
returned clock by %d seconds\n") at errormsg.c:42
#7  0x0805ddf7 in getcurrenttime (retparm=0x0) at timefunc.c:73
#8  0x08058097 in errprintf (fmt=0x8064f8c "Time warp detected: Adjusting
returned clock by %d seconds\n") at errormsg.c:42
#9  0x0805ddf7 in getcurrenttime (retparm=0x0) at timefunc.c:73

Etc.  It continues repeating those two lines for a very long time in the
backtrace.

FYI, here is the backtrace for the core dump similar to the referenced
thread:

# gdb /home/hobbit/server/bin/hobbitlaunch /home/hobbit/server/core.12456
GNU gdb Red Hat Linux (6.3.0.0-1.96rh)
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain
conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "i386-redhat-linux-gnu"...Using host libthread_db
library "/lib/tls/libthread_db.so.1".

Core was generated by `/home/hobbit/server/bin/hobbitlaunch
--config=/home/hobbit/server/etc/hobbitlau'.
Program terminated with signal 11, Segmentation fault.
Reading symbols from /usr/lib/libz.so.1...done.
Loaded symbols for /usr/lib/libz.so.1
Reading symbols from /lib/tls/libc.so.6...done.
Loaded symbols for /lib/tls/libc.so.6
Reading symbols from /lib/ld-linux.so.2...done.
Loaded symbols for /lib/ld-linux.so.2
#0  errprintf (fmt=0x80558cc "Time warp detected: Adjusting returned clock
by %d seconds\n") at errormsg.c:42
42              time_t now = getcurrenttime(NULL);
(gdb)
(gdb)
(gdb) bt
#0  errprintf (fmt=0x80558cc "Time warp detected: Adjusting returned clock
by %d seconds\n") at errormsg.c:42
#1  0x0804e8d3 in getcurrenttime (retparm=0x0) at timefunc.c:73
#2  0x0804b833 in errprintf (fmt=0x80558cc "Time warp detected: Adjusting
returned clock by %d seconds\n") at errormsg.c:42
#3  0x0804e8d3 in getcurrenttime (retparm=0x0) at timefunc.c:73
#4  0x0804b833 in errprintf (fmt=0x80558cc "Time warp detected: Adjusting
returned clock by %d seconds\n") at errormsg.c:42
#5  0x0804e8d3 in getcurrenttime (retparm=0x0) at timefunc.c:73
#6  0x0804b833 in errprintf (fmt=0x80558cc "Time warp detected: Adjusting
returned clock by %d seconds\n") at errormsg.c:42
#7  0x0804e8d3 in getcurrenttime (retparm=0x0) at timefunc.c:73
#8  0x0804b833 in errprintf (fmt=0x80558cc "Time warp detected: Adjusting
returned clock by %d seconds\n") at errormsg.c:42
#9  0x0804e8d3 in getcurrenttime (retparm=0x0) at timefunc.c:73

Etc.  It continues repeating those two lines for a very long time in the
backtrace.

Could you please check the patch that Darin Dugan made and sent to the list
on the 10th April 2008 (http://www.hswn.dk/hobbiton/2008/04/msg00136.html),
and after any changes (or fixes to your original patch) commit to svn trunk
and 4.2 branches and incorporate into 4.2.1?  I would like to have a Henrik
certified patch (TM) for this!  ;) Your original patch is at
http://www.hswn.dk/hobbiton/2008/01/msg00581.html for reference.  Also, my
core dumps are from a May 22nd snapshot of 4.3 so I suppose you never merged
your first patch - my diff suggests not.

Many thanks,

SebA

-----Original Message-----
From: Alexander Keller [mailto:user-1c695ed38cb2@xymon.invalid] 
Sent: 13 April 2008 19:52
To: Dugan, Darin D [EIT]
Subject: Re: [hobbit] hobbitlaunch segfault / timewarp happend again

Hi,

looks great. I applied your patch on a test system. So far it works
perfect for me.

It would be great if Henrik could apply your patch.

Thanks!
 Alexander
I recently brought up a new client that has trouble keeping accurate
time...so I began encountering this time warp issue. As 
pointed out by
Henrik in January, it is definitely an infinite loop where 
errprintf()
calls getcurrenttime() to get its timestamp.
The attached patch modifies the functions in errormsg.c to 
use time()
instead of getcurrenttime(). That avoids any recursion-infinite loop
problems, and logs or prints errors with the system's actual time
instead of a Hobbit-adjusted-for-sanity time. In the 
absence of accurate
time, I think it would be best to log in the system's time 
so that you
can correlate Hobbit logs with other system logs.
Working for me so far, but use at your own risk. Comments?
Cheers.
Darin Dugan
user-b33a1547d27a@xymon.invalid
-----Original Message-----
From: Alexander Keller [mailto:user-1c695ed38cb2@xymon.invalid] 
Sent: Friday, March 21, 2008 10:22 AM
To: user-ae9b8668bcde@xymon.invalid
Subject: Re: [hobbit] hobbitlaunch segfault / timewarp happend again
Hi,
unfortunately nobody answered to my posting, so I did a 
quick'n dirty
hack to prevent timewarp segfaults in hobbitlaunch.
Just comment out the errprintf-statement in lib/timefunc.c:
  if (timewarphappened) {
  /*
   * Tell the world about it.
   * Must do this AFTER changing timewarp and lastresult,
   * or we will start an endless loop triggering a stack
   * overflow because errprintf() calls getcurrenttime().
   */
           /*
           * **** prevent segfault: do not log time warp. ****
           * errprintf("Time warp detected: Adjusting 
returned clock by
%d seconds\n", timewarp);
           */
   }
This is not a real solution, but it works for me. Maybe there is
somebody out, who can fix this issue properly  
Regards
 Alexander
Hi Henrik,
in january I reported a segfault with hobbitlaunch/timefunc.c. You
quickly
provided a patch...
Now I'm having a new error - see core dump:
/opt/hobbit/client# gdb bin/hobbitlaunch core
GNU gdb 6.4-debian
Copyright 2005 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public 
License, and
you are
welcome to change it and/or distribute copies of it under certain
conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for
details.
This GDB was configured as "i486-linux-gnu"...Using host 
libthread_db
library "/lib/tls/i686/cmov/libthread_db.so.1".
Core was generated by `./bin/hobbitlaunch
--config=./etc/clientlaunch.cfg
--log=./logs/clientlaunch.lo'.
Program terminated with signal 11, Segmentation fault.
warning: Can't read pathname for load map: Input/output error.
Reading symbols from /usr/lib/libz.so.1...done.
Loaded symbols for /usr/lib/libz.so.1
Reading symbols from /lib/tls/i686/cmov/libc.so.6...done.
Loaded symbols for /lib/tls/i686/cmov/libc.so.6
Reading symbols from /lib/ld-linux.so.2...done.
Loaded symbols for /lib/ld-linux.so.2
#0  errprintf (fmt=0x6b <Address 0x6b out of bounds>) at 
errormsg.c:42
42              time_t now = getcurrenttime(NULL);
(gdb) bt
#0  errprintf (fmt=0x6b <Address 0x6b out of bounds>) at 
errormsg.c:42
#1  0x0804f125 in getcurrenttime (retparm=0x0) at timefunc.c:73
#2  0x0804b9e0 in errprintf (fmt=0x6b <Address 0x6b out of 
bounds>) at
errormsg.c:42
#3  0x0804f125 in getcurrenttime (retparm=0x0) at timefunc.c:73
#4  0x0804b9e0 in errprintf (fmt=0x6b <Address 0x6b out of 
bounds>) at
errormsg.c:42
#5  0x0804f125 in getcurrenttime (retparm=0x0) at timefunc.c:73
#6  0x0804b9e0 in errprintf (fmt=0x6b <Address 0x6b out of 
bounds>) at
errormsg.c:42
#7  0x0804f125 in getcurrenttime (retparm=0x0) at timefunc.c:73
#8  0x0804b9e0 in errprintf (fmt=0x6b <Address 0x6b out of 
bounds>) at
errormsg.c:42
#9  0x0804f125 in getcurrenttime (retparm=0x0) at timefunc.c:73
#10 0x0804b9e0 in errprintf (fmt=0x6b <Address 0x6b out of 
bounds>) at
errormsg.c:42
[...]
I can reproduce the error with "ntpdate" using a misconfigured ntp
server
(2 min in the past):
1. start hobbit client "runclient.sh start"
2. sync time with "ntpdate <misconfigured-time-server>"
3. get a core dump  
Regards,
 Alexander