Xymon Mailing List Archive search

hobbitd_client core dumps

10 messages in this thread

list Brian Lynch · Thu, 11 Aug 2005 12:04:21 -0700 ·
Since my upgrade to 4.1.1, I've had a problem with the hobbitd_client crashing at least 3-4 times a day. The core files are generated in hobbit/server/tmp and the process is restarted. An alert is also sent under the test name 'hobbitd_client'. Here is the stack trace from the latest core file. Please note that the server name has been masked after the fact. An interesting side note is that it always seems to dump on the same client server. Note that the client is running the new Hobbit software. 
Also, I recently made a change to increase the max message size to 800,000 bytes. 
[root at sac-pmon-01 tmp]# gdb ../bin/hobbitd_client core.19313
GNU gdb Red Hat Linux (6.1post-1.20040607.41rh)
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB. Type "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu"...Using host libthread_db library "/lib64/tls/libthread_db.so.1".

Core was generated by `hobbitd_client'.
Program terminated with signal 6, Aborted.
Reading symbols from /usr/local/lib/libpcre.so.0...done.
Loaded symbols for /usr/local/lib/libpcre.so.0
Reading symbols from /lib64/tls/libc.so.6...done.
Loaded symbols for /lib64/tls/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
#0 0x0000003b1a82e4dd in raise () from /lib64/tls/libc.so.6
(gdb) bt
#0 0x0000003b1a82e4dd in raise () from /lib64/tls/libc.so.6
#1 0x0000003b1a82fc8e in abort () from /lib64/tls/libc.so.6
#2 0x000000000040c9a3 in sigsegv_handler (signum=19313) at sig.c:57
#3 <signal handler called>
#4 0x0000003b1a86eab0 in strchr () from /lib64/tls/libc.so.6
#5 0x00000000004045bb in handle_solaris_client (
hostname=0x513a8c "wal-ddbs-01.x.x.x.com <http://wal-ddbs-01.x.x.x.com>";, hinfo=0x6e5370, sender=0x3d <Address 0x3d out of bounds>, timestamp=4252624, clientdata=0x0) at solaris.c:62
#6 0x0000000000405079 in main (argc=5323443, argv=0x7fffffffd348) at hobbitd_client.c:807
(gdb) 

[root at sac-pmon-01 tmp]# gdb ../bin/hobbitd_client core.11307
GNU gdb Red Hat Linux (6.1post-1.20040607.41rh)
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB. Type "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu"...Using host libthread_db library "/lib64/tls/libthread_db.so.1".

Core was generated by `hobbitd_client'.
Program terminated with signal 6, Aborted.
Reading symbols from /usr/local/lib/libpcre.so.0...done.
Loaded symbols for /usr/local/lib/libpcre.so.0
Reading symbols from /lib64/tls/libc.so.6...done.
Loaded symbols for /lib64/tls/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
#0 0x0000003b1a82e4dd in raise () from /lib64/tls/libc.so.6
(gdb) bt
#0 0x0000003b1a82e4dd in raise () from /lib64/tls/libc.so.6
#1 0x0000003b1a82fc8e in abort () from /lib64/tls/libc.so.6
#2 0x000000000040c9a3 in sigsegv_handler (signum=11307) at sig.c:57
#3 <signal handler called>
#4 0x0000003b1a86eab0 in strchr () from /lib64/tls/libc.so.6
#5 0x00000000004045bb in handle_solaris_client (
hostname=0x513a8c "wal-ddbs-01.x.x.x.com <http://wal-ddbs-01.x.x.x.com>";, hinfo=0x6d8f70, sender=0x3d <Address 0x3d out of bounds>, timestamp=0, clientdata=0x0) at solaris.c:62
#6 0x0000000000405079 in main (argc=5323443, argv=0x7fffffffd348) at hobbitd_client.c:807
(gdb) 
[root at sac-pmon-01 tmp]# gdb ../bin/hobbitd_client core.10241
GNU gdb Red Hat Linux (6.1post-1.20040607.41rh)
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB. Type "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu"...Using host libthread_db library "/lib64/tls/libthread_db.so.1".

Core was generated by `hobbitd_client'.
Program terminated with signal 6, Aborted.
Reading symbols from /usr/local/lib/libpcre.so.0...done.
Loaded symbols for /usr/local/lib/libpcre.so.0
Reading symbols from /lib64/tls/libc.so.6...done.
Loaded symbols for /lib64/tls/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
#0 0x0000003b1a82e4dd in raise () from /lib64/tls/libc.so.6
(gdb) bt
#0 0x0000003b1a82e4dd in raise () from /lib64/tls/libc.so.6
#1 0x0000003b1a82fc8e in abort () from /lib64/tls/libc.so.6
#2 0x000000000040c9a3 in sigsegv_handler (signum=10241) at sig.c:57
#3 <signal handler called>
#4 0x0000003b1a86eab0 in strchr () from /lib64/tls/libc.so.6
#5 0x00000000004045bb in handle_solaris_client (
hostname=0x513a8c "wal-ddbs-01.x.x.x.com <http://wal-ddbs-01.x.x.x.com>";, hinfo=0x6e1b90, sender=0x3d <Address 0x3d out of bounds>, timestamp=-64, clientdata=0x0) at solaris.c:62
#6 0x0000000000405079 in main (argc=5323443, argv=0x7fffffffd348) at hobbitd_client.c:807
(gdb
list David Gore · Thu, 11 Aug 2005 19:42:52 +0000 ·
quoted from Brian Lynch
Brian Lynch wrote:
Since my upgrade to 4.1.1, I've had a problem with the hobbitd_client crashing at least 3-4 times a day.  The core files are generated in hobbit/server/tmp and the process is restarted.  An alert is also sent under the test name 'hobbitd_client'.  Here is the stack trace from the latest core file.  Please note that the server name has been masked after the fact.  An interesting side note is that it always seems to dump on the same client server.  Note that the client is running the new Hobbit software.

Also, I recently made a change to increase the max message size to 800,000 bytes.

[root at sac-pmon-01 tmp]# gdb ../bin/hobbitd_client  core.19313
GNU gdb Red Hat Linux (6.1post-1.20040607.41rh)
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu"...Using host libthread_db library "/lib64/tls/libthread_db.so.1".

Core was generated by `hobbitd_client'.
Program terminated with signal 6, Aborted.
Reading symbols from /usr/local/lib/libpcre.so.0...done.
Loaded symbols for /usr/local/lib/libpcre.so.0
Reading symbols from /lib64/tls/libc.so.6...done.
Loaded symbols for /lib64/tls/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
#0  0x0000003b1a82e4dd in raise () from /lib64/tls/libc.so.6
(gdb) bt
#0  0x0000003b1a82e4dd in raise () from /lib64/tls/libc.so.6
#1  0x0000003b1a82fc8e in abort () from /lib64/tls/libc.so.6
#2  0x000000000040c9a3 in sigsegv_handler (signum=19313) at sig.c:57
#3  <signal handler called>
#4  0x0000003b1a86eab0 in strchr () from /lib64/tls/libc.so.6
#5  0x00000000004045bb in handle_solaris_client (
    hostname=0x513a8c "wal-ddbs-01.x.x.x.com <http://wal-ddbs-01.x.x.x.com>";, hinfo=0x6e5370,
    sender=0x3d <Address 0x3d out of bounds>, timestamp=4252624, clientdata=0x0) at solaris.c:62
#6  0x0000000000405079 in main (argc=5323443, argv=0x7fffffffd348) at hobbitd_client.c:807
(gdb)


[root at sac-pmon-01 tmp]# gdb ../bin/hobbitd_client  core.11307
GNU gdb Red Hat Linux (6.1post-1.20040607.41rh)
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu"...Using host libthread_db library "/lib64/tls/libthread_db.so.1".

Core was generated by `hobbitd_client'.
Program terminated with signal 6, Aborted.
Reading symbols from /usr/local/lib/libpcre.so.0...done.
Loaded symbols for /usr/local/lib/libpcre.so.0
Reading symbols from /lib64/tls/libc.so.6...done.
Loaded symbols for /lib64/tls/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
#0  0x0000003b1a82e4dd in raise () from /lib64/tls/libc.so.6
(gdb) bt
#0  0x0000003b1a82e4dd in raise () from /lib64/tls/libc.so.6
#1  0x0000003b1a82fc8e in abort () from /lib64/tls/libc.so.6
#2  0x000000000040c9a3 in sigsegv_handler (signum=11307) at sig.c:57
#3  <signal handler called>
#4  0x0000003b1a86eab0 in strchr () from /lib64/tls/libc.so.6
#5  0x00000000004045bb in handle_solaris_client (
    hostname=0x513a8c "wal-ddbs-01.x.x.x.com <http://wal-ddbs-01.x.x.x.com>";, hinfo=0x6d8f70,
    sender=0x3d <Address 0x3d out of bounds>, timestamp=0, clientdata=0x0) at solaris.c:62
#6  0x0000000000405079 in main (argc=5323443, argv=0x7fffffffd348) at hobbitd_client.c:807
(gdb)

[root at sac-pmon-01 tmp]# gdb ../bin/hobbitd_client  core.10241
GNU gdb Red Hat Linux (6.1post-1.20040607.41rh)
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu"...Using host libthread_db library "/lib64/tls/libthread_db.so.1".

Core was generated by `hobbitd_client'.
Program terminated with signal 6, Aborted.
Reading symbols from /usr/local/lib/libpcre.so.0...done.
Loaded symbols for /usr/local/lib/libpcre.so.0
Reading symbols from /lib64/tls/libc.so.6...done.
Loaded symbols for /lib64/tls/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
#0  0x0000003b1a82e4dd in raise () from /lib64/tls/libc.so.6
(gdb) bt
#0  0x0000003b1a82e4dd in raise () from /lib64/tls/libc.so.6
#1  0x0000003b1a82fc8e in abort () from /lib64/tls/libc.so.6
#2  0x000000000040c9a3 in sigsegv_handler (signum=10241) at sig.c:57
#3  <signal handler called>
#4  0x0000003b1a86eab0 in strchr () from /lib64/tls/libc.so.6
#5  0x00000000004045bb in handle_solaris_client (
    hostname=0x513a8c "wal-ddbs-01.x.x.x.com <http://wal-ddbs-01.x.x.x.com>";, hinfo=0x6e1b90,
    sender=0x3d <Address 0x3d out of bounds>, timestamp=-64, clientdata=0x0) at solaris.c:62
#6  0x0000000000405079 in main (argc=5323443, argv=0x7fffffffd348) at hobbitd_client.c:807
(gdb
I am having a similar problem.  I am currently running the latest snapshot.  I cannot remember how far back the problem goes.  I was going to grab a hobbitd_client core trace, but hobbitd is coring too, overwriting the hobbitd_client core.

~David
list Henrik Størner · Thu, 11 Aug 2005 23:14:18 +0200 ·
quoted from Brian Lynch
On Thu, Aug 11, 2005 at 12:04:21PM -0700, Brian Lynch wrote:
Since my upgrade to 4.1.1, I've had a problem with the hobbitd_client crashing at least 3-4 times a day. The core files are generated in hobbit/server/tmp and the process is restarted. An alert is also sent under the test name 'hobbitd_client'. Here is the stack trace from the latest core file. Please note that the server name has been masked after the fact. An interesting side note is that it always seems to dump on the same client server. Note that the client is running the new Hobbit software. 
Ouch, you've uncovered an embarassing bit of sloppy programming.
Forgetting to verify your pointers before using them is bad for
stability. A new snapshot will be done in an hours time, with a set of
fixes for this.
Also, I recently made a change to increase the max message size to 800,000 bytes. 
No problem. The current snapshots has bumped the max. size for a
client message to 1 MB, which I think should be adequate for most
systems.


Henrik
list Brian Lynch · Thu, 11 Aug 2005 15:10:29 -0700 ·
Thanks, Henrik! 
- Brian
quoted from Henrik Størner

On 8/11/05, Henrik Stoerner <user-ce4a2c883f75@xymon.invalid> wrote:
On Thu, Aug 11, 2005 at 12:04:21PM -0700, Brian Lynch wrote:
Since my upgrade to 4.1.1, I've had a problem with the hobbitd_client
crashing at least 3-4 times a day. The core files are generated in
hobbit/server/tmp and the process is restarted. An alert is also sent under
the test name 'hobbitd_client'. Here is the stack trace from the latest core
file. Please note that the server name has been masked after the fact. An
interesting side note is that it always seems to dump on the same client
server. Note that the client is running the new Hobbit software.
Ouch, you've uncovered an embarassing bit of sloppy programming.
Forgetting to verify your pointers before using them is bad for
stability. A new snapshot will be done in an hours time, with a set of
fixes for this.
Also, I recently made a change to increase the max message size
to 800,000 bytes.
No problem. The current snapshots has bumped the max. size for a
client message to 1 MB, which I think should be adequate for most
systems.


Henrik

list Brian Lynch · Thu, 11 Aug 2005 16:18:48 -0700 ·
Henrik,
Just launched the snapshot from 8/12/05 00:00 and the client max size is ringing up at 256KB. It looks like the SHAREDBUFSZ_STD is being used for the standard BB 'status' messages. Is that correct behavior? 
quoted from Brian Lynch
- Brian

On 8/11/05, Henrik Stoerner <user-ce4a2c883f75@xymon.invalid> wrote:
On Thu, Aug 11, 2005 at 12:04:21PM -0700, Brian Lynch wrote:
Since my upgrade to 4.1.1, I've had a problem with the hobbitd_client
crashing at least 3-4 times a day. The core files are generated in
hobbit/server/tmp and the process is restarted. An alert is also sent under
the test name 'hobbitd_client'. Here is the stack trace from the latest core
file. Please note that the server name has been masked after the fact. An
interesting side note is that it always seems to dump on the same client
server. Note that the client is running the new Hobbit software.
Ouch, you've uncovered an embarassing bit of sloppy programming.
Forgetting to verify your pointers before using them is bad for
stability. A new snapshot will be done in an hours time, with a set of
fixes for this.
Also, I recently made a change to increase the max message size
to 800,000 bytes.
No problem. The current snapshots has bumped the max. size for a
client message to 1 MB, which I think should be adequate for most
systems.


Henrik

list Henrik Størner · Fri, 12 Aug 2005 07:48:43 +0200 ·
quoted from Brian Lynch
On Thu, Aug 11, 2005 at 04:18:48PM -0700, Brian Lynch wrote:
Henrik,
Just launched the snapshot from 8/12/05 00:00 and the client max size is 
ringing up at 256KB. It looks like the SHAREDBUFSZ_STD is being used for the 
standard BB 'status' messages. Is that correct behavior? 
Yes, the "status" channel has a 256 KB buffer. But the "client" channel
which is the one used for data fed us by the Hobbit client is 1 MB.

The "client" channel is bigger, because it needs to handle large
ps-listings combined with all of the other client output (top,
df etc.)

Do you really have a cpu-, disk-, or procs-column (or any other
individual status) that needs more than 256 KB data in it ?


Regards,
Henrik
list Brian Lynch · Fri, 12 Aug 2005 09:28:30 -0700 ·
Henrik,
We are running the Solaris sar script from deadcat that dumps anywhere from 200K to 900K data via the 'status' channel. And thanks for the fix yesterday, the hobbitd_client channel has not core dumped since I put the latest snapshot in place.

Best,
quoted from Henrik Størner
Brian

On 8/11/05, Henrik Stoerner <user-ce4a2c883f75@xymon.invalid> wrote:
On Thu, Aug 11, 2005 at 04:18:48PM -0700, Brian Lynch wrote:
Henrik,
Just launched the snapshot from 8/12/05 00:00 and the client max size is
ringing up at 256KB. It looks like the SHAREDBUFSZ_STD is being used for the
standard BB 'status' messages. Is that correct behavior?
Yes, the "status" channel has a 256 KB buffer. But the "client" channel
which is the one used for data fed us by the Hobbit client is 1 MB.

The "client" channel is bigger, because it needs to handle large
ps-listings combined with all of the other client output (top,
df etc.)

Do you really have a cpu-, disk-, or procs-column (or any other
individual status) that needs more than 256 KB data in it ?


Regards,
Henrik

list Henrik Størner · Sat, 13 Aug 2005 18:10:58 +0200 ·
quoted from Brian Lynch
On Fri, Aug 12, 2005 at 09:28:30AM -0700, Brian Lynch wrote:
We are running the Solaris sar script from deadcat that dumps anywhere from 
200K to 900K data via the 'status' channel.
Yikes! that's a lot more than I thought would go in a status message.

I've now made these settings configurable in hobbitserver.cfg,
instead of having to change the source and re-compile. The default
for the status channel is still 256 kB, but you just add
  MAXMSG_STATUS="1024"
to your hobbitserver.cfg, and it will use that instead.
quoted from Brian Lynch
the hobbitd_client channel has not core dumped since I put the 
latest snapshot in place.
Good to know.


Thanks,
Henrik
list Frédéric Mangeant · Tue, 16 Aug 2005 10:30:59 +0200 ·
quoted from Henrik Størner
Henrik Stoerner a écrit :
I've now made these settings configurable in hobbitserver.cfg,
instead of having to change the source and re-compile. The default
for the status channel is still 256 kB, but you just add
 MAXMSG_STATUS="1024"
to your hobbitserver.cfg, and it will use that instead.
 
Hi Henrik

thanks for making all theses sizes configurable. Howerver, the MAXLINE 
variable still appears in the default hobbitserver.cfg (with a value of 
32768). Is it still used ?

-- 

Frédéric Mangeant

Steria EDC Sophia-Antipolis
list Brian Lynch · Tue, 16 Aug 2005 10:24:31 -0700 ·
Henrik,
I spoke too soon on the core dump. I just had another one go this morning. Here is the stack trace (with the domain commented out). 

Cheers,
Brian

[root at sac-pmon-02 tmp]# gdb ../bin/hobbitd_client core.29614 GNU gdb Red Hat Linux (6.1post-1.20040607.41rh)
quoted from David Gore
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB. Type "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu"...Using host libthread_db library "/lib64/tls/libthread_db.so.1".

Core was generated by `hobbitd_client'.
Program terminated with signal 6, Aborted.
Reading symbols from /usr/local/lib/libpcre.so.0...done.
Loaded symbols for /usr/local/lib/libpcre.so.0
Reading symbols from /lib64/tls/libc.so.6...done.
Loaded symbols for /lib64/tls/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2

#0 0x000000340232e4dd in raise () from /lib64/tls/libc.so.6
(gdb) bt
#0 0x000000340232e4dd in raise () from /lib64/tls/libc.so.6
#1 0x000000340232fc8e in abort () from /lib64/tls/libc.so.6
#2 0x000000000040d723 in sigsegv_handler (signum=29614) at sig.c:57
#3 <signal handler called>
#4 0x0000000000402c86 in unix_disk_report (
hostname=0x514aca "wal-dapp-02.x.x.x.com <http://wal-dapp-02.x.x.x.com>";, hinfo=0x72ab40, fromline=0x7fffffffca70 "\nStatus message received from 10.3.3.146\n", timestr=0x514b26 "Tue Aug 16 15:38:43 GMT 2005", capahdr=0x40f036 "Capacity", mnthdr=0x40f02e "Mounted", dfstr=0x514cc6 "Filesystem", ' ' <repeats 11 times>, "1024-blocks Used Available Capacity Mounted on\n/dev/md/dsk/d0", ' ' <repeats 11 times>, "3010671 1515862 1434596 52% /\n/dev/md/dsk/d6", ' ' <repeats 11 times>, "1988887 1219506 709"...) at hobbitd_client.c:299
#5 0x000000000040478b in handle_solaris_client (
hostname=0x514aca "wal-dapp-02.x.x.x.com <http://wal-dapp-02.x.x.x.com>";, hinfo=0x72ab40, sender=0x2aaaaabca268 "\2009@\0024", timestamp=0, clientdata=0x0) at solaris.c:52
#6 0x0000000000405e2a in main (argc=5327601, argv=0x7fffffffd258)
at hobbitd_client.c:827
(gdb) 
quoted from Henrik Størner
On 8/13/05, Henrik Stoerner <user-ce4a2c883f75@xymon.invalid> wrote:
On Fri, Aug 12, 2005 at 09:28:30AM -0700, Brian Lynch wrote:
We are running the Solaris sar script from deadcat that dumps anywhere from
200K to 900K data via the 'status' channel.
Yikes! that's a lot more than I thought would go in a status message.

I've now made these settings configurable in hobbitserver.cfg,
instead of having to change the source and re-compile. The default
for the status channel is still 256 kB, but you just add
MAXMSG_STATUS="1024"
to your hobbitserver.cfg, and it will use that instead.
the hobbitd_client channel has not core dumped since I put the
latest snapshot in place.
Good to know.


Thanks,
Henrik