RRD crashing high availability hobbit

6 messages in this thread

list E-mail j.sansford · Thu, 20 Aug 2009 11:06:30 +0100 ·

Hi again all,

I need some help configuring/debugging why our hobbit servers are crashing (due to rrd, which I shall explain shortly) and how to get around this. We have 3 hobbit servers with proxies, however I will simplify this explanation with just 2 hobbits and no proxies (as we discovered the same thing happens).

Detail of theoretical setup:

1) 2 datacentres. Each datacentre contains a single hobbit server instance.
2) Each client reports to their local datacentre hobbit server.
3) Each hobbit server is configured such that they know about the other hobbit (through BBDISPLAYS).

The issue is that for what looks like most server side tests, such as vmstat etc, that we are getting feedback loops between the hobbit servers.

For instance: A hobbit server in DC1 tests a client in DC1 using vmstat. The client reports back to hobbit in DC1 and hobbit then also reports this data to the hobbit in DC2. The hobbit in DC2 however is configured to report to DC1 and so bounces the message back (i think). Therefore the server tries to update the rrd twice within a second resulting in errors. Eventually this will crash the server. An example of the rrd error messages:

2009-08-20 11:04:04 RRD error updating /export/home/hobbit/data/rrd/h3-avm-dbx/ifstat.mac.rrd from 10.6.60.1: illegal attempt to update using time 1250762644
when last update time is 1250762644 (minimum one second step)
2009-08-20 11:04:06 RRD error updating /export/home/hobbit/data/rrd/h2-emu13/ifstat.mac.rrd from 10.6.60.1: illegal attempt to update using time 1250762646 when last update time is 1250762646 (minimum one second step)
2009-08-20 11:04:06 RRD error updating /export/home/hobbit/data/rrd/h2-emu13/ifstat.mac.rrd from 10.6.60.1: illegal attempt to update using time 1250762646 when last update time is 1250762646 (minimum one second step)
2009-08-20 11:04:06 RRD error updating /export/home/hobbit/data/rrd/h2-emu13/ifstat.mac.rrd from 10.6.60.1: illegal attempt to update using time 1250762646 when last update time is 1250762646 (minimum one second step)

My question is - how can we stop this happening? Also, why is this happening? Is there a way we can disable rrd graphing on one server so just one hobbit server handles the graphing?

I hope that makes sense. If you need further clarification please let me know.

Cheers
James

list Buchan Milne · Thu, 20 Aug 2009 13:55:40 +0100 ·

▸ quoted from E-mail j.sansford

On Thursday, 20 August 2009 11:06:30 user-c15424b7e83a@xymon.invalid wrote:

Hi again all,

I need some help configuring/debugging why our hobbit servers are crashing
(due to rrd, which I shall explain shortly) and how to get around this. We
have 3 hobbit servers with proxies, however I will simplify this
explanation with just 2 hobbits and no proxies (as we discovered the same
thing happens).

Detail of theoretical setup:

1) 2 datacentres. Each datacentre contains a single hobbit server instance.
2) Each client reports to their local datacentre hobbit server.
3) Each hobbit server is configured such that they know about the other
hobbit (through BBDISPLAYS).


The issue is that for what looks like most server side tests, such as
vmstat etc, that we are getting feedback loops between the hobbit servers.

For instance: A hobbit server in DC1 tests a client in DC1 using vmstat.
The client reports back to hobbit in DC1 and hobbit then also reports this
data to the hobbit in DC2. The hobbit in DC2 however is configured to
report to DC1 and so bounces the message back (i think). Therefore the
server tries to update the rrd twice within a second resulting in errors.
Eventually this will crash the server.

How did you determine that this is what is "crashing" the server?

▸ quoted from E-mail j.sansford

An example of the rrd error
messages:

2009-08-20 11:04:04 RRD error updating
/export/home/hobbit/data/rrd/h3-avm-dbx/ifstat.mac.rrd from 10.6.60.1:
illegal attempt to update using time 1250762644 when last update time is
1250762644 (minimum one second step)
2009-08-20 11:04:06 RRD error updating
/export/home/hobbit/data/rrd/h2-emu13/ifstat.mac.rrd from 10.6.60.1:
illegal attempt to update using time 1250762646 when last update time is
1250762646 (minimum one second step) 2009-08-20 11:04:06 RRD error updating
/export/home/hobbit/data/rrd/h2-emu13/ifstat.mac.rrd from 10.6.60.1:
illegal attempt to update using time 1250762646 when last update time is
1250762646 (minimum one second step) 2009-08-20 11:04:06 RRD error updating
/export/home/hobbit/data/rrd/h2-emu13/ifstat.mac.rrd from 10.6.60.1:
illegal attempt to update using time 1250762646 when last update time is
1250762646 (minimum one second step)

I have a number of setups where messages like this are common, due to running 
network tests and SNMP polling at intervals smaller than 5 minutes (without 
adjusting all the RRD files to cater to this), and I have not seen hobbit 
"crash" due to this.

What is the behaviour you see when it "crashes the server" ? Does hobbitd_rrd 
die and leave a status message? Or, does something else occur? Does the server 
reboot? Does the OS hang? How often does this occur?

My question is - how can we stop this happening?

You would first need to tell us what is happening ...

▸ quoted from E-mail j.sansford

Also, why is this
happening? Is there a way we can disable rrd graphing on one server so just
one hobbit server handles the graphing?

I hope that makes sense. If you need further clarification please let me
know.


If hobbitd or hobbitd_rrd or some other process actually crashes, you should 
be able to get a core file, from which you can get a backtrace (e.g. with gdb), 
which would allow someone to see why it is crashing, and possibly fix it.

Regards,
Buchan

list E-mail j.sansford · Thu, 20 Aug 2009 17:33:44 +0100 ·

Hi Buchan,

We get a core dump, running a pstack gives the following info:

core 'core' of 11142:   hobbitd_rrd --rrddir=/export/home/hobbit/data/rrd
 fed28a17 _lwp_kill (1, 6) + 7
 fecd1d63 raise    (6) + 1f
 fecb1bad abort    (806fe88, fecd55f6, 8768eb0, 806a6ca, fed901c0, 0) + cd
 08060291 xstrdup  (0, 806a6ca, 87d9d1c, 8081cc0, 84ed451, 0) + 31
 0805bf7c do_netapp_extratest_rrd (84ec4ff, 806af10, 84ec8fa, 4a8b1bbf, 8081a00, 8081cc0) + 200
 0805c1c9 do_netapp_extrastats_rrd (84ec4ff, 84ec509, 84ec511, 4a8b1bbf, 84ec4f4, 4a8b1bbf) + e1
 0805e0ea update_rrd (84ec4ff, 84ec509, 84ec511, 4a8b1bbf, 84ec4f4, 0) + 7d6
 08054044 main     (2, 804613c, 8046148) + 4dc
 080539fc _start   (2, 8046484, 8046490, 0, 80464b6, 80464f6) + 80


Note that as of 5.30pm today the logs for rrd-status.log is 127MB full of errors, which span over 607625 lines (this is just for today, we roll the logs each night). This seems abnormally large to me and I think eventually this is crashing the server. 

Hope this helps. I will try and take a deeper look at the logs next time it happens...it seems to happen around once or twice a week.

Cheers
James.

▸ quoted from Buchan Milne


---- Buchan Milne <user-9b139aff4dec@xymon.invalid> wrote:

On Thursday, 20 August 2009 11:06:30 user-c15424b7e83a@xymon.invalid wrote:

Hi again all,

I need some help configuring/debugging why our hobbit servers are crashing
(due to rrd, which I shall explain shortly) and how to get around this. We
have 3 hobbit servers with proxies, however I will simplify this
explanation with just 2 hobbits and no proxies (as we discovered the same
thing happens).

Detail of theoretical setup:

1) 2 datacentres. Each datacentre contains a single hobbit server instance.
2) Each client reports to their local datacentre hobbit server.
3) Each hobbit server is configured such that they know about the other
hobbit (through BBDISPLAYS).


The issue is that for what looks like most server side tests, such as
vmstat etc, that we are getting feedback loops between the hobbit servers.

For instance: A hobbit server in DC1 tests a client in DC1 using vmstat.
The client reports back to hobbit in DC1 and hobbit then also reports this
data to the hobbit in DC2. The hobbit in DC2 however is configured to
report to DC1 and so bounces the message back (i think). Therefore the
server tries to update the rrd twice within a second resulting in errors.
Eventually this will crash the server.

How did you determine that this is what is "crashing" the server?

An example of the rrd error
messages:

2009-08-20 11:04:04 RRD error updating
/export/home/hobbit/data/rrd/h3-avm-dbx/ifstat.mac.rrd from 10.6.60.1:
illegal attempt to update using time 1250762644 when last update time is
1250762644 (minimum one second step)
2009-08-20 11:04:06 RRD error updating
/export/home/hobbit/data/rrd/h2-emu13/ifstat.mac.rrd from 10.6.60.1:
illegal attempt to update using time 1250762646 when last update time is
1250762646 (minimum one second step) 2009-08-20 11:04:06 RRD error updating
/export/home/hobbit/data/rrd/h2-emu13/ifstat.mac.rrd from 10.6.60.1:
illegal attempt to update using time 1250762646 when last update time is
1250762646 (minimum one second step) 2009-08-20 11:04:06 RRD error updating
/export/home/hobbit/data/rrd/h2-emu13/ifstat.mac.rrd from 10.6.60.1:
illegal attempt to update using time 1250762646 when last update time is
1250762646 (minimum one second step)

I have a number of setups where messages like this are common, due to running 
network tests and SNMP polling at intervals smaller than 5 minutes (without 
adjusting all the RRD files to cater to this), and I have not seen hobbit 
"crash" due to this.

What is the behaviour you see when it "crashes the server" ? Does hobbitd_rrd 
die and leave a status message? Or, does something else occur? Does the server 
reboot? Does the OS hang? How often does this occur?

My question is - how can we stop this happening?

You would first need to tell us what is happening ...

Also, why is this
happening? Is there a way we can disable rrd graphing on one server so just
one hobbit server handles the graphing?

I hope that makes sense. If you need further clarification please let me
know.


If hobbitd or hobbitd_rrd or some other process actually crashes, you should 
be able to get a core file, from which you can get a backtrace (e.g. with gdb), 
which would allow someone to see why it is crashing, and possibly fix it.

Regards,
Buchan

list David Baldwin · Fri, 21 Aug 2009 09:42:59 +1000 ·

▸ quoted from E-mail j.sansford

user-c15424b7e83a@xymon.invalid wrote:

Hi Buchan,

We get a core dump, running a pstack gives the following info:

core 'core' of 11142:   hobbitd_rrd --rrddir=/export/home/hobbit/data/rrd
 fed28a17 _lwp_kill (1, 6) + 7
 fecd1d63 raise    (6) + 1f
 fecb1bad abort    (806fe88, fecd55f6, 8768eb0, 806a6ca, fed901c0, 0) + cd
 08060291 xstrdup  (0, 806a6ca, 87d9d1c, 8081cc0, 84ed451, 0) + 31
 0805bf7c do_netapp_extratest_rrd (84ec4ff, 806af10, 84ec8fa, 4a8b1bbf, 8081a00, 8081cc0) + 200
 0805c1c9 do_netapp_extrastats_rrd (84ec4ff, 84ec509, 84ec511, 4a8b1bbf, 84ec4f4, 4a8b1bbf) + e1
 0805e0ea update_rrd (84ec4ff, 84ec509, 84ec511, 4a8b1bbf, 84ec4f4, 0) + 7d6
 08054044 main     (2, 804613c, 8046148) + 4dc
 080539fc _start   (2, 8046484, 8046490, 0, 80464b6, 80464f6) + 80

That looks like you are running extratest for a netapp which from what I
can see in hobbitd/do_rrd.c is what handles the xtstats column reported
by netapp.pl - just from a cursory glance at the code - I don't use it
myself. You really need to look at the C code to check it's doing the
right thing. You have 2 choices - quick fix is to disable just that test
in netapp.pl - other option is to work out what format it should be and
fix the test.

In 4.2.3 for example, the do_devmon.c RRD code doesn't actually
implement what is documented and I use a perl script with --extra-script
instead

Various RRD handlers are in hobbitd/rrd/do_*.c
Looking at the code for xstrdup in lib/memory.c as below you should
check your logs - it's probably getting called with a NULL pointer
(unlikely you're out of memory), but the logs should tell you.

char *xstrdup(const char *s)
{
        char *result;

        if (s == NULL) {
                errprintf("xstrdup: Cannot dup NULL string\n");
                abort();
        }

        result = strdup(s);
        if (result == NULL) {
                errprintf("xstrdup: Out of memory\n");
                abort();
        }

#ifdef MEMORY_DEBUG
        add_to_memlist(result, strlen(result)+1);
#endif

        return result;

▸ quoted from E-mail j.sansford

Note that as of 5.30pm today the logs for rrd-status.log is 127MB full of errors, which span over 607625 lines (this is just for today, we roll the logs each night). This seems abnormally large to me and I think eventually this is crashing the server. 

Hope this helps. I will try and take a deeper look at the logs next time it happens...it seems to happen around once or twice a week.

Cheers
James.

---- Buchan Milne <user-9b139aff4dec@xymon.invalid> wrote:

On Thursday, 20 August 2009 11:06:30 user-c15424b7e83a@xymon.invalid wrote:

Hi again all,

I need some help configuring/debugging why our hobbit servers are crashing
(due to rrd, which I shall explain shortly) and how to get around this. We
have 3 hobbit servers with proxies, however I will simplify this
explanation with just 2 hobbits and no proxies (as we discovered the same
thing happens).

Detail of theoretical setup:

1) 2 datacentres. Each datacentre contains a single hobbit server instance.
2) Each client reports to their local datacentre hobbit server.
3) Each hobbit server is configured such that they know about the other
hobbit (through BBDISPLAYS).


The issue is that for what looks like most server side tests, such as
vmstat etc, that we are getting feedback loops between the hobbit servers.

For instance: A hobbit server in DC1 tests a client in DC1 using vmstat.
The client reports back to hobbit in DC1 and hobbit then also reports this
data to the hobbit in DC2. The hobbit in DC2 however is configured to
report to DC1 and so bounces the message back (i think). Therefore the
server tries to update the rrd twice within a second resulting in errors.
Eventually this will crash the server.

How did you determine that this is what is "crashing" the server?

An example of the rrd error
messages:

2009-08-20 11:04:04 RRD error updating
/export/home/hobbit/data/rrd/h3-avm-dbx/ifstat.mac.rrd from 10.6.60.1:
illegal attempt to update using time 1250762644 when last update time is
1250762644 (minimum one second step)
2009-08-20 11:04:06 RRD error updating
/export/home/hobbit/data/rrd/h2-emu13/ifstat.mac.rrd from 10.6.60.1:
illegal attempt to update using time 1250762646 when last update time is
1250762646 (minimum one second step) 2009-08-20 11:04:06 RRD error updating
/export/home/hobbit/data/rrd/h2-emu13/ifstat.mac.rrd from 10.6.60.1:
illegal attempt to update using time 1250762646 when last update time is
1250762646 (minimum one second step) 2009-08-20 11:04:06 RRD error updating
/export/home/hobbit/data/rrd/h2-emu13/ifstat.mac.rrd from 10.6.60.1:
illegal attempt to update using time 1250762646 when last update time is
1250762646 (minimum one second step)

I have a number of setups where messages like this are common, due to running 
network tests and SNMP polling at intervals smaller than 5 minutes (without 
adjusting all the RRD files to cater to this), and I have not seen hobbit 
"crash" due to this.

These kinds of messages can also be due to duplicate keys being used in
RRD reporting. You need to look at how the RRD data is generated to get
to the bottom of these. Sometimes the duplicates are in one test,
sometimes multiple tests reporting the same thing or too frequently
(such as your possible loops). It is unlikely thsi will crash the
hobbitd_rrd though.

For example, I had this on MacOSX for ifstat. By default it uses
'netstat -ibn' which is producing multiple lines for the same interface.
I changed that in hobbitclient-darwin.sh to 'netstat -ibn | egrep -v
"^lo|^vmnet|<Link" - note that I had to filter out vmnet interfaces
since netstat -i limits to 5 chars for interface, and there are actually
vmnet1 and vmnet8 :( Luckily I don't really care about those.

bash-3.2# netstat -ibn | egrep -v "^lo|^vmnet|<Link"
Name  Mtu   Network       Address            Ipkts Ierrs     Ibytes   
Opkts Oerrs     Obytes  Coll
en0   1500  fe80::21f:f fe80:6::21f:f3ff:  7709215     - 2307390372
23616260     - 32787591390     -
en0   1500  10.1/16       10.1.75.6        7709215     - 2307390372
23616260     - 32787591390     -
en2   1500  fe80::201:2 fe80:9::201:23ff:        0     -         
0        0     -     781938     -
en2   1500  10.37.129/24  10.37.129.2            0     -         
0        0     -     781938     -
en3   1500  fe80::210:3 fe80:a::210:32ff:        0     -         
0        0     -     792748     -
en3   1500  10.211.55/24  10.211.55.2            0     -         
0        0     -     792748     -
bash-3.2# netstat -ibn         
Name  Mtu   Network       Address            Ipkts Ierrs     Ibytes   
Opkts Oerrs     Obytes  Coll
lo0   16384 <Link#1>                        196623     0   20477947  
196620     0   20477947     0
lo0   16384 fe80::1%lo0 fe80:1::1           196623     -   20477947  
196620     -   20477947     -
lo0   16384 127           127.0.0.1         196623     -   20477947  
196620     -   20477947     -
lo0   16384 ::1/128     ::1                 196623     -   20477947  
196620     -   20477947     -
gif0* 1280  <Link#2>                             0     0         
0        0     0          0     0
stf0* 1280  <Link#3>                             0     0         
0        0     0          0     0
en1   1500  <Link#4>    00:1f:5b:c3:ec:35        0     0         
0        0     0          0     0
fw0   4078  <Link#5>    00:1f:f3:ff:fe:71:5e:18        0     0         
0        0     0        346     0
en0   1500  <Link#6>    00:1f:f3:5c:32:e6  7709242     0 2307393391
23616262     0 32787591586     0
en0   1500  fe80::21f:f fe80:6::21f:f3ff:  7709242     - 2307393391
23616262     - 32787591586     -
en0   1500  10.1/16       10.1.75.6        7709242     - 2307393391
23616262     - 32787591586     -
vmnet 1500  <Link#7>    00:50:56:c0:00:08        0     0         
0        0     0          0     0
vmnet 1500  192.168.149   192.168.149.1          0     -         
0        0     -          0     -
vmnet 1500  <Link#8>    00:50:56:c0:00:01        0     0         
0        0     0          0     0
vmnet 1500  172.16.189/24 172.16.189.1           0     -         
0        0     -          0     -
en2   1500  <Link#9>    00:01:23:45:67:89        0     0         
0        0     0     781938     0
en2   1500  fe80::201:2 fe80:9::201:23ff:        0     -         
0        0     -     781938     -
en2   1500  10.37.129/24  10.37.129.2            0     -         
0        0     -     781938     -
en3   1500  <Link#10>   00:10:32:54:76:98        0     0         
0        0     0     792748     0
en3   1500  fe80::210:3 fe80:a::210:32ff:        0     -         
0        0     -     792748     -
en3   1500  10.211.55/24  10.211.55.2            0     -         
0        0     -     792748     -

▸ quoted from E-mail j.sansford

What is the behaviour you see when it "crashes the server" ? Does hobbitd_rrd 
die and leave a status message? Or, does something else occur? Does the server 
reboot? Does the OS hang? How often does this occur?

My question is - how can we stop this happening?

You would first need to tell us what is happening ...

Also, why is this
happening? Is there a way we can disable rrd graphing on one server so just
one hobbit server handles the graphing?

I hope that makes sense. If you need further clarification please let me
know.

If hobbitd or hobbitd_rrd or some other process actually crashes, you should 
be able to get a core file, from which you can get a backtrace (e.g. with gdb), 
which would allow someone to see why it is crashing, and possibly fix it.

Regards,
Buchan

--


David Baldwin - IT Unit
Australian Sports Commission          www.ausport.gov.au
Tel 02 62147830 Fax 02 62141830       PO Box 176 Belconnen ACT 2616
user-cbbf693f2c89@xymon.invalid          Leverrier Street Bruce ACT 2617


Keep up to date with what's happening in Australian sport visit http://www.ausport.gov.au

This message is intended for the addressee named and may contain confidential and privileged information. If you are not the intended recipient please note that any form of distribution, copying or use of this communication or the information in it is strictly prohibited and may be unlawful. If you receive this message in error, please delete it and notify the sender.

list Buchan Milne · Fri, 21 Aug 2009 14:27:06 +0100 ·

▸ quoted from David Baldwin

On Friday, 21 August 2009 00:42:59 David Baldwin wrote:

user-c15424b7e83a@xymon.invalid wrote:

Hi Buchan,

We get a core dump, running a pstack gives the following info:

core 'core' of 11142:   hobbitd_rrd --rrddir=/export/home/hobbit/data/rrd
 fed28a17 _lwp_kill (1, 6) + 7
 fecd1d63 raise    (6) + 1f
 fecb1bad abort    (806fe88, fecd55f6, 8768eb0, 806a6ca, fed901c0, 0) +
cd 08060291 xstrdup  (0, 806a6ca, 87d9d1c, 8081cc0, 84ed451, 0) + 31
0805bf7c do_netapp_extratest_rrd (84ec4ff, 806af10, 84ec8fa, 4a8b1bbf,
8081a00, 8081cc0) + 200 0805c1c9 do_netapp_extrastats_rrd (84ec4ff,
84ec509, 84ec511, 4a8b1bbf, 84ec4f4, 4a8b1bbf) + e1 0805e0ea update_rrd
(84ec4ff, 84ec509, 84ec511, 4a8b1bbf, 84ec4f4, 0) + 7d6 08054044 main    
(2, 804613c, 8046148) + 4dc
 080539fc _start   (2, 8046484, 8046490, 0, 80464b6, 80464f6) + 80


OK, so it crashed in do_netapp_extratest_rrd from hobbitd/rrd/do_netapp.c . 
I'm not familiar with pstack, but it looks like this may be from a stripped 
binary (or, you may be able to get more information from pstack).

If pstack can't show the values, then you may want to consider running 
hobbitd_rrd with the --debug flag, which should result in some logging of what 
it has received just before it crashes.

▸ quoted from David Baldwin

That looks like you are running extratest for a netapp which from what I
can see in hobbitd/do_rrd.c is what handles the xtstats column reported
by netapp.pl - just from a cursory glance at the code - I don't use it
myself. You really need to look at the C code to check it's doing the
right thing. You have 2 choices - quick fix is to disable just that test
in netapp.pl - other option is to work out what format it should be and
fix the test.

In 4.2.3 for example, the do_devmon.c RRD code doesn't actually
implement what is documented

What is not implemented?

Where do you see this documented?

There is one fix that I have committed in svn (Xymon 4.2 branch, Xymon 4.3 
branch, devmon svn). I am not aware of any other requests or bugs filed on the 
devmon rrd collector.

and I use a perl script with --extra-script
instead

Is this the one shipped with devmon, or would you like to contribute a better 
one?

▸ quoted from David Baldwin

Various RRD handlers are in hobbitd/rrd/do_*.c
Looking at the code for xstrdup in lib/memory.c as below you should
check your logs - it's probably getting called with a NULL pointer
(unlikely you're out of memory), but the logs should tell you.

char *xstrdup(const char *s)
{
        char *result;

        if (s == NULL) {
                errprintf("xstrdup: Cannot dup NULL string\n");
                abort();
        }

        result = strdup(s);
        if (result == NULL) {
                errprintf("xstrdup: Out of memory\n");
                abort();
        }

#ifdef MEMORY_DEBUG
        add_to_memlist(result, strlen(result)+1);
#endif

        return result;
}

xstrdup is called twice in do_netapp_extratest_rrd, but seeing the string that 
it's aborting on would help narrow it down. If you can provide the status 
message that made hobbitd_rrd crash (retrieve it using: bb localhost 
'hobbitdlog hostname.testname') it can be used to reproduce this by someone 
trying to fix the bug.

▸ quoted from David Baldwin

Note that as of 5.30pm today the logs for rrd-status.log is 127MB full of
errors, which span over 607625 lines (this is just for today, we roll the
logs each night). This seems abnormally large to me and I think
eventually this is crashing the server.

It is still unlikely that this has anything to do with hobbitd_rrd crashing.

Regards,
Buchan

list Francesco Duranti · Fri, 21 Aug 2009 17:26:18 +0200 ·

Hi,
I saw that the problem is in the creation of the rrd for xtstats for netapp filers.
Can you check what version of the netapp.pl package you have installed ? Have you applied the latest patch included in the hobbit_perl_client distribution to the hobbit server 4.2.3?

In the last version of the Hobbit_perl_client (v 1.21) there was a correction is the netapp.pl code and also a patch to be applied to a clean 4.2.3 that should solve a hobbit_rrd crashing problem in the xtstats function caused by different kind of data sent by different storage software versions.

If your hobbitd_rrd still crash after the patch application can you run the hobbitd_rrd with the -debug as suggested and try to extract the data regarding the xtstats that make the server crash? (or can you send me the last 5-6 minutes of that logs) so I can analyze what the module is receiving and what is going wrong?

Thanks
Francesco

▸ quoted from David Baldwin

-----Original Message-----
From: user-c15424b7e83a@xymon.invalid [mailto:user-c15424b7e83a@xymon.invalid] 
Sent: giovedì 20 agosto 2009 18.34
To: user-ae9b8668bcde@xymon.invalid; Buchan Milne
Subject: Re: [hobbit] RRD crashing high availability hobbit

Hi Buchan,

We get a core dump, running a pstack gives the following info:

core 'core' of 11142:   hobbitd_rrd --rrddir=/export/home/hobbit/data/rrd
 fed28a17 _lwp_kill (1, 6) + 7
 fecd1d63 raise    (6) + 1f
 fecb1bad abort    (806fe88, fecd55f6, 8768eb0, 806a6ca, fed901c0, 0) + cd
 08060291 xstrdup  (0, 806a6ca, 87d9d1c, 8081cc0, 84ed451, 0) + 31
 0805bf7c do_netapp_extratest_rrd (84ec4ff, 806af10, 84ec8fa, 4a8b1bbf, 8081a00, 8081cc0) + 200
 0805c1c9 do_netapp_extrastats_rrd (84ec4ff, 84ec509, 84ec511, 4a8b1bbf, 84ec4f4, 4a8b1bbf) + e1
 0805e0ea update_rrd (84ec4ff, 84ec509, 84ec511, 4a8b1bbf, 84ec4f4, 0) + 7d6
 08054044 main     (2, 804613c, 8046148) + 4dc
 080539fc _start   (2, 8046484, 8046490, 0, 80464b6, 80464f6) + 80

Note that as of 5.30pm today the logs for rrd-status.log is 127MB full of errors, which span over 607625 lines (this is just for today, we roll the logs each night). This seems abnormally large to me and I think eventually this is crashing the server. 

Hope this helps. I will try and take a deeper look at the logs next time it happens...it seems to happen around once or twice a week.

Cheers
James.

---- Buchan Milne <user-9b139aff4dec@xymon.invalid> wrote:

On Thursday, 20 August 2009 11:06:30 user-c15424b7e83a@xymon.invalid wrote:

Hi again all,

I need some help configuring/debugging why our hobbit servers are crashing
(due to rrd, which I shall explain shortly) and how to get around this. We
have 3 hobbit servers with proxies, however I will simplify this
explanation with just 2 hobbits and no proxies (as we discovered the same
thing happens).

Detail of theoretical setup:

1) 2 datacentres. Each datacentre contains a single hobbit server instance.
2) Each client reports to their local datacentre hobbit server.
3) Each hobbit server is configured such that they know about the other
hobbit (through BBDISPLAYS).


The issue is that for what looks like most server side tests, such as
vmstat etc, that we are getting feedback loops between the hobbit servers.

For instance: A hobbit server in DC1 tests a client in DC1 using vmstat.
The client reports back to hobbit in DC1 and hobbit then also reports this
data to the hobbit in DC2. The hobbit in DC2 however is configured to
report to DC1 and so bounces the message back (i think). Therefore the
server tries to update the rrd twice within a second resulting in errors.
Eventually this will crash the server.

How did you determine that this is what is "crashing" the server?

An example of the rrd error
messages:

2009-08-20 11:04:04 RRD error updating
/export/home/hobbit/data/rrd/h3-avm-dbx/ifstat.mac.rrd from 10.6.60.1:
illegal attempt to update using time 1250762644 when last update time is
1250762644 (minimum one second step)
2009-08-20 11:04:06 RRD error updating
/export/home/hobbit/data/rrd/h2-emu13/ifstat.mac.rrd from 10.6.60.1:
illegal attempt to update using time 1250762646 when last update time is
1250762646 (minimum one second step) 2009-08-20 11:04:06 RRD error updating
/export/home/hobbit/data/rrd/h2-emu13/ifstat.mac.rrd from 10.6.60.1:
illegal attempt to update using time 1250762646 when last update time is
1250762646 (minimum one second step) 2009-08-20 11:04:06 RRD error updating
/export/home/hobbit/data/rrd/h2-emu13/ifstat.mac.rrd from 10.6.60.1:
illegal attempt to update using time 1250762646 when last update time is
1250762646 (minimum one second step)

I have a number of setups where messages like this are common, due to running 
network tests and SNMP polling at intervals smaller than 5 minutes (without 
adjusting all the RRD files to cater to this), and I have not seen hobbit 
"crash" due to this.

What is the behaviour you see when it "crashes the server" ? Does hobbitd_rrd 
die and leave a status message? Or, does something else occur? Does the server 
reboot? Does the OS hang? How often does this occur?

My question is - how can we stop this happening?

You would first need to tell us what is happening ...

Also, why is this
happening? Is there a way we can disable rrd graphing on one server so just
one hobbit server handles the graphing?

I hope that makes sense. If you need further clarification please let me
know.


If hobbitd or hobbitd_rrd or some other process actually crashes, you should 
be able to get a core file, from which you can get a backtrace (e.g. with gdb), 
which would allow someone to see why it is crashing, and possibly fix it.

Regards,
Buchan

RRD crashing high availability hobbit 🔗 link

RRD crashing high availability hobbit