Xymon Mailing List Archive search

Hobbitd crashing

9 messages in this thread

list Elizabeth Schwartz · Sat, 21 May 2011 01:58:05 -0400 ·
I was playing around with hobbit-clients.cfg file trying to create a
LOG rule to  ignore this alert:
May 21 00:54:21 redirect1-bo3.dl2.example.com monit[2029]: [ID 111343
daemon.error] 'gmond-sample.xml' timestamp test failed for
/usr/local/Ganglia/logs/gmond-sample.xml

I **think the rule that put it into conniptions was
HOST=%redirect.*bo3.dl2.example.com
LOG /var/adm/messages COLOR=yellow IGNORE=%(repeated|gmond|monit|puppetd)

Also, I am experiencing something I've seen a few other times this
week - a service that is not reporting, that was signed out, stays
blue even when signed back in. I can't get rid of the xymond_client
blue. Where is blue status stored? (it does not appear as blue on the
enable/disable page but I have a blue dot on the host page and a blue
report when I drill in)

[xymon at netmon2 server]$ gdb bin/hobbitd_client tmp/core.24453
GNU gdb (GDB) Red Hat Enterprise Linux (7.0.1-23.el5_5.2)
Copyright (C) 2009 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>;
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>;...
Reading symbols from /u1/xymon/server/bin/hobbitd_client...done.
Reading symbols from /lib64/libpcre.so.0...(no debugging symbols found)...done.
Loaded symbols for /lib64/libpcre.so.0
Reading symbols from /lib64/librt.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib64/librt.so.1
Reading symbols from /lib64/libc.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libc.so.6
Reading symbols from /lib64/libpthread.so.0...(no debugging symbols
found)...done.
Loaded symbols for /lib64/libpthread.so.0
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging
symbols found)...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Core was generated by `hobbitd_client'.
Program terminated with signal 6, Aborted.
#0  0x0000003833430265 in raise () from /lib64/libc.so.6
(gdb) bt
#0  0x0000003833430265 in raise () from /lib64/libc.so.6
#1  0x0000003833431d10 in abort () from /lib64/libc.so.6
#2  0x0000000000427133 in sigsegv_handler (signum=<value optimized
out>) at sig.c:57
#3  <signal handler called>
#4  0x00000000004179f6 in scan_log (hinfo=0x1679440,
    classname=0x2b9863ae507e "sunos", logname=0x2b9863aee44b
"/var/adm/messages",
    logdata=0x2b9863aee45e "May 21 00:57:25
redirect2-bo3.dl2.e-dialog.com last message repeated 36 times\nMay 21
00:57:35 redirect2-bo3.dl2.example.com monit[10418]: [ID 111343
daemon.error] 'gmond-sample.xml' timestamp test fa"...,
    section=<value optimized out>, summarybuf=0x1683a80) at client_config.c:2491
#5  0x0000000000408d0a in msgs_report (
    hostname=0x2b9863ae5059 "redirect2-bo3.dl2.example.com",
    clientclass=0x2b9863ae507e "sunos", os=<value optimized out>,
hinfo=0x1679440,
    fromline=0x7fff00bf2c50 "\nStatus message received from 10.200.32.51\n",
    timestr=0x2b9863ae50be "Sat May 21 01:11:24 EDT 2011", msgsstr=0x0)
    at xymond_client.c:1221
#6  0x000000000040fd6a in handle_solaris_client (
    hostname=0x2b9863ae5059 "redirect2-bo3.dl2.example.com",
    clienttype=0x2b9863ae507e "sunos", os=OS_SOLARIS, hinfo=0x1679440,
    sender=<value optimized out>, timestamp=<value optimized out>,
    clientdata=0x2b9863ae5085 "client
redirect2-bo3,dl2,example,com.sunos sunos\n[date") at
client/solaris.c:69
#7  0x0000000000411e5f in main (argc=<value optimized out>, argv=0x7fff00bf3368)
    at xymond_client.c:2199
list Henrik Størner · Sat, 21 May 2011 09:00:32 +0200 ·
Hi Elizabeth,
I was playing around with hobbit-clients.cfg [...]
Which version of Xymon is this ? Since you're referring to 
hobbit-clients.cfg and hobbitd_client, I assume it is 4.2.something, but 
that doesn't match with some of the linenumbers ?

So I'll assume it's 4.3.something - the interesting line hasn't changed 
between the 4.3.x releases:
quoted from Elizabeth Schwartz
[xymon at netmon2 server]$ gdb bin/hobbitd_client tmp/core.24453
GNU gdb (GDB) Red Hat Enterprise Linux (7.0.1-23.el5_5.2)
#2  0x0000000000427133 in sigsegv_handler (signum=<value optimized
out>) at sig.c:57

#3<signal handler called>
quoted from Elizabeth Schwartz
#4  0x00000000004179f6 in scan_log (hinfo=0x1679440,
     classname=0x2b9863ae507e "sunos", logname=0x2b9863aee44b
"/var/adm/messages",
     logdata=0x2b9863aee45e "May 21 00:57:25
redirect2-bo3.dl2.e-dialog.com last message repeated 36 times\nMay 21
00:57:35 redirect2-bo3.dl2.example.com monit[10418]: [ID 111343
daemon.error] 'gmond-sample.xml' timestamp test fa"...,
     section=<value optimized out>, summarybuf=0x1683a80) at 
client_config.c:2491
#5  0x0000000000408d0a in msgs_report (
     hostname=0x2b9863ae5059 "redirect2-bo3.dl2.example.com",
     clientclass=0x2b9863ae507e "sunos", os=<value optimized out>,
hinfo=0x1679440,
     fromline=0x7fff00bf2c50 "\nStatus message received from 
10.200.32.51\n",
     timestr=0x2b9863ae50be "Sat May 21 01:11:24 EDT 2011", msgsstr=0x0)
     at xymond_client.c:1221
Looking at xymond/client_config.c line 2491 reads:

    /* Next, check for a match anywhere in the data*/
    if (!patternmatch(logdata, rule->rule.log.matchexp->pattern,
			rule->rule.log.matchexp->exp)) continue;

So I'd like to know a bit more about the state of some of those 
variables. Could you go back into gdb and then instead of getting the 
callstack, run these three commands:

    p rule
    p *rule
    p *(rule->rule.log.matchexp)

If I'm unlucky, the "rule" variable will have been optimized out....
quoted from Elizabeth Schwartz

Also, I am experiencing something I've seen a few other times this
week - a service that is not reporting, that was signed out, stays
blue even when signed back in.
A blue status won't change to another color until it gets a status 
update (red, yellow or green).
I can't get rid of the xymond_client blue.
The xymond_client status shows up because you had a crash of the 
xymond_client module. Use
    xymon 127.0.0.1 "drop YOURXYMONSERVER xymond_client"
to get rid of it.


Regards,
Henrik
list Elizabeth Schwartz · Sat, 21 May 2011 13:23:34 -0400 ·
Thanks Hendrik!
I'm running 4.3.2

(tried sending a green status to get rid of the blue, and after some
hours it turned purple, waking us up again)
quoted from Elizabeth Schwartz
  p rule
  p *rule
  p *(rule->rule.log.matchexp)
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Core was generated by `hobbitd_client'.
Program terminated with signal 6, Aborted.
#0  0x0000003833430265 in raise () from /lib64/libc.so.6

(gdb) p rule
No symbol "rule" in current context.
(gdb) p *rule
No symbol "rule" in current context.
(gdb) p *(rule->rule.log.matchexp)
No symbol "rule" in current context.

(I can recompile with other flags if you point me to it)

thanks much
Betsy
list Elizabeth Schwartz · Sat, 21 May 2011 13:59:06 -0400 ·
PS just to be clear doing the *drop* did work. I'd tried the green
status last night

On Sat, May 21, 2011 at 1:23 PM, Elizabeth Schwartz
quoted from Elizabeth Schwartz
<user-c61747246f66@xymon.invalid> wrote:
Thanks Hendrik!
I'm running 4.3.2

(tried sending a green status to get rid of the blue, and after some
hours it turned purple, waking us up again)
list Henrik Størner · Sat, 21 May 2011 22:39:59 +0200 ·
quoted from Elizabeth Schwartz
   p rule
   p *rule
   p *(rule->rule.log.matchexp)
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Core was generated by `hobbitd_client'.
Program terminated with signal 6, Aborted.
#0  0x0000003833430265 in raise () from /lib64/libc.so.6
(gdb) p rule
No symbol "rule" in current context.
Doh, sorry. You have to do a "fr 4" first to select that stack-frame.

Again, please ?


Thanks,
Henrik
list Elizabeth Schwartz · Sat, 21 May 2011 21:13:02 -0400 ·
quoted from Henrik Størner
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Core was generated by `hobbitd_client'.
Program terminated with signal 6, Aborted.
#0  0x0000003833430265 in raise () from /lib64/libc.so.6

(gdb) fr 4
quoted from Henrik Størner
#4  0x00000000004179f6 in scan_log (hinfo=0x1679440,
    classname=0x2b9863ae507e "sunos", logname=0x2b9863aee44b
"/var/adm/messages",
    logdata=0x2b9863aee45e "May 21 00:57:25
redirect2-bo3.dl2.e-dialog.com last message repeated 36 times\nMay 21

00:57:35 redirect2-bo3.dl2.e-dialog.com monit[10418]: [ID 111343
daemon.error] 'gmond-sample.xml' timestamp test fa"...,
    section=<value optimized out>, summarybuf=0x1683a80) at client_config.c:2491
2491                    if (!patternmatch(logdata,
rule->rule.log.matchexp->pattern, rule->rule.log.matchexp->exp))
continue;
(gdb) p rule
$1 = (c_rule_t *) 0x168ad90
(gdb) p *rule
$2 = {hostexp = 0x168a7c0, exhostexp = 0x0, pageexp = 0x0, expageexp = 0x0,
  classexp = 0x0, exclassexp = 0x0, timespec = 0x0, statustext = 0x0,
  rrdidstr = 0x0, groups = 0x0, ruletype = C_LOG, cfid = 435, flags = 0,
  next = 0x168afa0, rule = {load = {warnlevel = 4.27369127e-38,
paniclevel = 0},
    uptime = {recentlimit = 23637648, ancientlimit = 0}, clock = {
      maxdiff = 23637648}, disk = {fsexp = 0x168ae90, warnlevel = 0,
paniclevel = 0,
      abswarn = 23637712, abspanic = 0, dmin = 4, dmax = 0, dcount =
0, color = 0,
      ignored = 0}, inode = {fsexp = 0x168ae90, warnlevel = 0, paniclevel = 0,
      abswarn = 23637712, abspanic = 0, imin = 4, imax = 0, icount =
0, color = 0,
      ignored = 0}, mem = {memtype = 23637648, warnlevel = 0, paniclevel = 0},
    zos_mem = {zos_memtype = 23637648, warnlevel = 0, paniclevel = 0},
zvse_vsize = {
      warnlevel = 23637648, paniclevel = 0}, zvse_getvis = {partid = 0x168ae90,
      warnlevel = 0, paniclevel = 0, anywarnlevel = 0, anypaniclevel =
0}, cics = {
      applid = 0x168ae90, dsawarnlevel = 0, dsapaniclevel = 0,
edsawarnlevel = 0,
      edsapaniclevel = 0}, asid = {asidtype = 23637648, warnlevel = 0,
      paniclevel = 0}, proc = {procexp = 0x168ae90, pmin = 0, pmax =
0, pcount = 0,
      color = 0}, log = {logfile = 0x168ae90, matchexp = 0x0, matchone = 0x0,
      ignoreexp = 0x168aed0, color = 4}, fcheck = {filename =
0x168ae90, color = 0,
      ftype = 0, minsize = 0, maxsize = 23637712, eqlsize = 4, minlinks = 0,
      maxlinks = 0, eqllinks = 0, fmode = 0, ownerid = 0, groupid = 0,
      ownerstr = 0x0, groupstr = 0x0, minctimedif = 0, maxctimedif = 0,
      ctimeeql = 0, minmtimedif = 0, maxmtimedif = 0, mtimeeql = 0,
minatimedif = 0,
      maxatimedif = 0, atimeeql = 0, md5hash = 0x0, sha1hash = 0x0,
      rmd160hash = 0x0}, dcheck = {filename = 0x168ae90, color = 0,
maxsize = 0,
---Type <return> to continue, or q <return> to quit---
      minsize = 23637712}, port = {localexp = 0x168ae90, exlocalexp = 0x0,
      remoteexp = 0x0, exremoteexp = 0x168aed0, stateexp = 0x4,
exstateexp = 0x0,
      pmin = 0, pmax = 0, pcount = 0, color = 0}, svc = {svcexp = 0x168ae90,
      stateexp = 0x0, startupexp = 0x0, svcname = 0x168aed0 "\360\256h\001",
      startup = 0x4 <Address 0x4 out of bounds>, state = 0x0, scount = 0,
      color = 0}, paging = {warnlevel = 23637648, paniclevel = 0}, mibval = {
      mibvalexp = 0x168ae90, keyexp = 0x0, color = 0, minval =
23637712, maxval = 4,
      matchexp = 0x0, havetree = 0, valdeftree = 0x0}, rrdds = {rrdkey
= 0x168ae90,
      rrdds = 0x0, column = 0x0, color = 23637712,
      limitval = 1.9762625833649862e-323, limitval2 = 0}, mqqueue = {
      qmgrname = 0x168ae90, qname = 0x0, warnlen = 0, critlen = 0,
      warnage = 23637712, critage = 0}, mqchannel = {qmgrname = 0x168ae90,
      chnname = 0x0, warnstates = 0x0, alertstates = 0x168aed0}}}
(gdb) p *(rule->rule.log.matchexp)
Cannot access memory at address 0x0
(gdb)
list Henrik Størner · Sun, 22 May 2011 15:49:01 +0200 ·
Hi Elizabeth,
(gdb) p rule
$1 = (c_rule_t *) 0x168ad90
OK.
(gdb) p *rule
quoted from Elizabeth Schwartz
[snip]
       log = {logfile = 0x168ae90, matchexp = 0x0, matchone = 0x0,
       ignoreexp = 0x168aed0, color = 4},
[snip]
(gdb) p *(rule->rule.log.matchexp)
Cannot access memory at address 0x0
Definitely not OK. The LOG check comes without any expression to match the log data against ("matchexp" is a NULL pointer). Which explains why it crashes when we try to use to expression in line 2491:

    if (!patternmatch(logdata, rule->rule.log.matchexp->pattern,
             rule->rule.log.matchexp->exp)) continue;

Now, the "matchexp" setting is built from the regex in the LOG statement. If this turns out to be an invalid regex, it should log a file in the xymond_client logfile like

    pcre compile 'your-pattern-here' failed (offset N): <error message>

So could you check if there's such a message in your log?

Still, it is incovenient that xymond_client crashes because of a configuration error. I'll look into improving that.


Regards,
Henrik
list Henrik Størner · Sun, 22 May 2011 16:05:33 +0200 ·
Hi,
quoted from Elizabeth Schwartz
I was playing around with hobbit-clients.cfg file trying to create a
LOG rule to  ignore this alert:
May 21 00:54:21 redirect1-bo3.dl2.example.com monit[2029]: [ID 111343
daemon.error] 'gmond-sample.xml' timestamp test failed for
/usr/local/Ganglia/logs/gmond-sample.xml

I **think the rule that put it into conniptions was
HOST=%redirect.*bo3.dl2.example.com
LOG /var/adm/messages COLOR=yellow IGNORE=%(repeated|gmond|monit|puppetd)
This would trigger it, because there is no match-pattern, only an 
ignore-pattern.

If you want to match all lines, use something like

    LOG /var/adm/messages %. COLOR=yellow IGNORE=%...


Regards,
Henrik
list Elizabeth Schwartz · Sun, 22 May 2011 11:56:06 -0400 ·
Thank you! I've fixed the offending test. Not knowing how to make it
stop being blue/purple was the biggest  problem, which you've also
answered
 pcre compile 'your-pattern-here' failed (offset N): <error message>
No errors with pcre anywhere in the log directory (possibly because it
crashed first)
Don't have a xymond_client log, do have a hobbitclient.log but it's empty.

Ideally what I want to do is this:
     Ignore *any* errors from lpr
     go yellow on any level of error from puppet or gmond
     go yellow on any other  "warning" message
     go red on any other "error" message

I think I need to use "greedier" regexp's for the match that include
white space - am I understanding correctly that if I match on %ERROR
then only the string "error" gets passed to the IGNORE statement?

thanks Betsy
PS I would enjoy seeing other people's LOG tests if anyone has good ones