Xymon no longer sending alerts
list Colin Coe
Hi all Our Xymon server has recently stopped sending alert emails. This server is also running Postfix and is our mail relay. From alert.log all I see is: 2024-01-31 02:17:39.813610 Whoops ! Failed to send message (Select(2) failed) 2024-01-31 02:17:39.829027 -> Select failure while sending to Xymon daemon at 10.10.10.10:1984 2024-01-31 02:17:39.829032 -> Recipient '10.10.10.10', timeout 50 2024-01-31 02:17:39.829037 -> 1st line: 'config hosts.cfg' 2024-01-31 02:17:39.829042 Cannot load hosts.cfg from xymond: Select(2) failed 2024-01-31 02:17:39.829049 Failed to load from xymond, reverting to file-load 2024-01-31 02:22:40.932828 Whoops ! Failed to send message (Select(2) failed) 2024-01-31 02:22:40.932863 -> Select failure while sending to Xymon daemon at 10.10.10.10:1984 2024-01-31 02:22:40.932867 -> Recipient '10.10.10.10', timeout 50 2024-01-31 02:22:40.932871 -> 1st line: 'config hosts.cfg' 2024-01-31 02:22:40.932876 Cannot load hosts.cfg from xymond: Select(2) failed 2024-01-31 02:22:40.932881 Failed to load from xymond, reverting to file-load And notifications.log is zero bytes in size. I added "--debug" to the "[alert]" section of /etc/xymon/tasks.cfg and while the verbosity was increased, there was no indication of why alerts are not being sent. Any clues how I can debug this? Thanks
list Kris Springer
Looks like your hosts.cfg file has an error in it. This happens to me sometimes if I'm not careful.? Especially if I'm using SSH and scrolling with my mouse wheel, for some reason it adds weird characters into the file. Kris Springer
▸
On 1/30/24 4:49 PM, Colin Coe wrote:Hi all Our Xymon server has recently stopped sending alert emails. This server is also running Postfix and is our mail relay. From alert.log all I see is: 2024-01-31 02:17:39.813610 Whoops ! Failed to send message (Select(2) failed)
2024-01-31 02:17:39.829027 -> ?Select failure while sending to Xymon daemon at 10.10.10.10:1984 <http://daemon at 10.10.10.10:1984>
▸
2024-01-31 02:17:39.829032 -> ?Recipient '10.10.10.10', timeout 50
2024-01-31 02:17:39.829037 -> ?1st line: 'config hosts.cfg'
2024-01-31 02:17:39.829042 Cannot load hosts.cfg from xymond: Select(2) failed
2024-01-31 02:17:39.829049 Failed to load from xymond, reverting to file-load
2024-01-31 02:22:40.932828 Whoops ! Failed to send message (Select(2) failed)2024-01-31 02:22:40.932863 -> ?Select failure while sending to Xymon daemon at 10.10.10.10:1984 <http://daemon at 10.10.10.10:1984>
▸
2024-01-31 02:22:40.932867 -> ?Recipient '10.10.10.10', timeout 50
2024-01-31 02:22:40.932871 -> ?1st line: 'config hosts.cfg'
2024-01-31 02:22:40.932876 Cannot load hosts.cfg from xymond: Select(2) failed
2024-01-31 02:22:40.932881 Failed to load from xymond, reverting to file-load
And notifications.log is zero bytes in size.
I added "--debug" to the "[alert]" section of?/etc/xymon/tasks.cfg and while the verbosity was increased, there was no indication of why alerts are not being sent.
Any clues how I can debug this?
Thanks
list Jeremy Laidman
Hi Colin From the logs, it appears that xymond_alert is unable to communicate with your Xymon server on 10.10.10.10:1984. It seems to be trying to fetch the hosts.cfg file contents via the BB protocol by sending a "config hosts.cfg" command to xymond, but xymond is not responding. The select() system call is monitoring a file handle or socket for activity, likely the TCP socket with 10.10.10.10:1984. The timeout means that the select() call didn't return a response in the expected time. This suggests that the TCP connection was established correctly (xymond is listening and IP/port are likely correct) and xymond_alert sent the request for the hosts.cfg file, but there was no response. It might be worth checking xymond.log for messages corresponding to the timestamps of the errors from xymond_alert. I'm not convinced this is the reason that you're not getting alert emails. If xymond_alert can't get hosts.cfg from a BB message, it should be able to get it directly from the filesystem, and then carry on. So the messages you're seeing might be a red herring, although I wouldn't expect them to show up on a normally operating Xymon installation. Having said that, my Xymon installation is showing those log messages, yet I've no reason to think that our alerting is broken, so perhaps it's just something that can be ignored. It might be worth taking a look at the man page for xymond_alert, and have a go at the --test, --trace and --dump-config options. In case it's not obvious, I'm really not sure what the problem could be, and I'm just throwing out some ideas in case something helps. J
▸
On Wed, 31 Jan 2024 at 10:50, Colin Coe <user-5b250cd7a540@xymon.invalid> wrote:
Hi all Our Xymon server has recently stopped sending alert emails. This server is also running Postfix and is our mail relay. From alert.log all I see is: 2024-01-31 02:17:39.813610 Whoops ! Failed to send message (Select(2) failed) 2024-01-31 02:17:39.829027 -> Select failure while sending to Xymon daemon at 10.10.10.10:1984 2024-01-31 02:17:39.829032 -> Recipient '10.10.10.10', timeout 50 2024-01-31 02:17:39.829037 -> 1st line: 'config hosts.cfg' 2024-01-31 02:17:39.829042 Cannot load hosts.cfg from xymond: Select(2) failed 2024-01-31 02:17:39.829049 Failed to load from xymond, reverting to file-load 2024-01-31 02:22:40.932828 Whoops ! Failed to send message (Select(2) failed) 2024-01-31 02:22:40.932863 -> Select failure while sending to Xymon daemon at 10.10.10.10:1984 2024-01-31 02:22:40.932867 -> Recipient '10.10.10.10', timeout 50 2024-01-31 02:22:40.932871 -> 1st line: 'config hosts.cfg' 2024-01-31 02:22:40.932876 Cannot load hosts.cfg from xymond: Select(2) failed 2024-01-31 02:22:40.932881 Failed to load from xymond, reverting to file-load And notifications.log is zero bytes in size. I added "--debug" to the "[alert]" section of /etc/xymon/tasks.cfg and while the verbosity was increased, there was no indication of why alerts are not being sent. Any clues how I can debug this? Thanks
list Colin Coe
Hi Kris I've been through the main hosts.cfc and all the included host snippets but found no "funny" characters. Thanks On Wed, 31 Jan 2024 at 08:05, Kris Springer <user-c2caa0a7a8d5@xymon.invalid>
▸
wrote:
Looks like your hosts.cfg file has an error in it. This happens to me sometimes if I'm not careful. Especially if I'm using SSH and scrolling with my mouse wheel, for some reason it adds weird characters into the file. Kris Springer On 1/30/24 4:49 PM, Colin Coe wrote: Hi all Our Xymon server has recently stopped sending alert emails. This server is also running Postfix and is our mail relay. From alert.log all I see is: 2024-01-31 02:17:39.813610 Whoops ! Failed to send message (Select(2) failed) 2024-01-31 02:17:39.829027 -> Select failure while sending to Xymon daemon at 10.10.10.10:1984 2024-01-31 02:17:39.829032 -> Recipient '10.10.10.10', timeout 50 2024-01-31 02:17:39.829037 -> 1st line: 'config hosts.cfg' 2024-01-31 02:17:39.829042 Cannot load hosts.cfg from xymond: Select(2) failed 2024-01-31 02:17:39.829049 Failed to load from xymond, reverting to file-load 2024-01-31 02:22:40.932828 Whoops ! Failed to send message (Select(2) failed) 2024-01-31 02:22:40.932863 -> Select failure while sending to Xymon daemon at 10.10.10.10:1984 2024-01-31 02:22:40.932867 -> Recipient '10.10.10.10', timeout 50 2024-01-31 02:22:40.932871 -> 1st line: 'config hosts.cfg' 2024-01-31 02:22:40.932876 Cannot load hosts.cfg from xymond: Select(2) failed 2024-01-31 02:22:40.932881 Failed to load from xymond, reverting to file-load And notifications.log is zero bytes in size. I added "--debug" to the "[alert]" section of /etc/xymon/tasks.cfg and while the verbosity was increased, there was no indication of why alerts are not being sent. Any clues how I can debug this? Thanks
Xymon mailing user-d459c9d661b6@xymon.invalid
list Colin Coe
Hi Jeremy Running the following gives me the expected result so the server is responding, at least sometimes. xymon 127.0.0.1 "config hosts.cfg" Is this a worry: "Discarding timed-out partial msg from 127.0.0.1"? Getting lots of these... I've added --trace to xymond_alerts and will go through that.
▸
Thanks
On Wed, 31 Jan 2024 at 08:32, Jeremy Laidman <user-0608abae5e7c@xymon.invalid> wrote:
Hi Colin From the logs, it appears that xymond_alert is unable to communicate with your Xymon server on 10.10.10.10:1984. It seems to be trying to fetch the hosts.cfg file contents via the BB protocol by sending a "config hosts.cfg" command to xymond, but xymond is not responding. The select() system call is monitoring a file handle or socket for activity, likely the TCP socket with 10.10.10.10:1984. The timeout means that the select() call didn't return a response in the expected time. This suggests that the TCP connection was established correctly (xymond is listening and IP/port are likely correct) and xymond_alert sent the request for the hosts.cfg file, but there was no response. It might be worth checking xymond.log for messages corresponding to the timestamps of the errors from xymond_alert. I'm not convinced this is the reason that you're not getting alert emails. If xymond_alert can't get hosts.cfg from a BB message, it should be able to get it directly from the filesystem, and then carry on. So the messages you're seeing might be a red herring, although I wouldn't expect them to show up on a normally operating Xymon installation. Having said that, my Xymon installation is showing those log messages, yet I've no reason to think that our alerting is broken, so perhaps it's just something that can be ignored. It might be worth taking a look at the man page for xymond_alert, and have a go at the --test, --trace and --dump-config options. In case it's not obvious, I'm really not sure what the problem could be, and I'm just throwing out some ideas in case something helps. J On Wed, 31 Jan 2024 at 10:50, Colin Coe <user-5b250cd7a540@xymon.invalid> wrote:Hi all Our Xymon server has recently stopped sending alert emails. This server is also running Postfix and is our mail relay. From alert.log all I see is: 2024-01-31 02:17:39.813610 Whoops ! Failed to send message (Select(2) failed) 2024-01-31 02:17:39.829027 -> Select failure while sending to Xymon daemon at 10.10.10.10:1984 2024-01-31 02:17:39.829032 -> Recipient '10.10.10.10', timeout 50 2024-01-31 02:17:39.829037 -> 1st line: 'config hosts.cfg' 2024-01-31 02:17:39.829042 Cannot load hosts.cfg from xymond: Select(2) failed 2024-01-31 02:17:39.829049 Failed to load from xymond, reverting to file-load 2024-01-31 02:22:40.932828 Whoops ! Failed to send message (Select(2) failed) 2024-01-31 02:22:40.932863 -> Select failure while sending to Xymon daemon at 10.10.10.10:1984 2024-01-31 02:22:40.932867 -> Recipient '10.10.10.10', timeout 50 2024-01-31 02:22:40.932871 -> 1st line: 'config hosts.cfg' 2024-01-31 02:22:40.932876 Cannot load hosts.cfg from xymond: Select(2) failed 2024-01-31 02:22:40.932881 Failed to load from xymond, reverting to file-load And notifications.log is zero bytes in size. I added "--debug" to the "[alert]" section of /etc/xymon/tasks.cfg and while the verbosity was increased, there was no indication of why alerts are not being sent. Any clues how I can debug this? Thanks
list Colin Coe
Hi all This is resolved. It was a stupid error in alerts.cfg... Thanks for the suggestions
▸
On Wed, 31 Jan 2024 at 09:21, Colin Coe <user-5b250cd7a540@xymon.invalid> wrote:
Hi Jeremy Running the following gives me the expected result so the server is responding, at least sometimes. xymon 127.0.0.1 "config hosts.cfg" Is this a worry: "Discarding timed-out partial msg from 127.0.0.1"? Getting lots of these... I've added --trace to xymond_alerts and will go through that. Thanks On Wed, 31 Jan 2024 at 08:32, Jeremy Laidman <user-0608abae5e7c@xymon.invalid> wrote:Hi Colin From the logs, it appears that xymond_alert is unable to communicate with your Xymon server on 10.10.10.10:1984. It seems to be trying to fetch the hosts.cfg file contents via the BB protocol by sending a "config hosts.cfg" command to xymond, but xymond is not responding. The select() system call is monitoring a file handle or socket for activity, likely the TCP socket with 10.10.10.10:1984. The timeout means that the select() call didn't return a response in the expected time. This suggests that the TCP connection was established correctly (xymond is listening and IP/port are likely correct) and xymond_alert sent the request for the hosts.cfg file, but there was no response. It might be worth checking xymond.log for messages corresponding to the timestamps of the errors from xymond_alert. I'm not convinced this is the reason that you're not getting alert emails. If xymond_alert can't get hosts.cfg from a BB message, it should be able to get it directly from the filesystem, and then carry on. So the messages you're seeing might be a red herring, although I wouldn't expect them to show up on a normally operating Xymon installation. Having said that, my Xymon installation is showing those log messages, yet I've no reason to think that our alerting is broken, so perhaps it's just something that can be ignored. It might be worth taking a look at the man page for xymond_alert, and have a go at the --test, --trace and --dump-config options. In case it's not obvious, I'm really not sure what the problem could be, and I'm just throwing out some ideas in case something helps. J On Wed, 31 Jan 2024 at 10:50, Colin Coe <user-5b250cd7a540@xymon.invalid> wrote:Hi all Our Xymon server has recently stopped sending alert emails. This server is also running Postfix and is our mail relay. From alert.log all I see is: 2024-01-31 02:17:39.813610 Whoops ! Failed to send message (Select(2) failed) 2024-01-31 02:17:39.829027 -> Select failure while sending to Xymon daemon at 10.10.10.10:1984 2024-01-31 02:17:39.829032 -> Recipient '10.10.10.10', timeout 50 2024-01-31 02:17:39.829037 -> 1st line: 'config hosts.cfg' 2024-01-31 02:17:39.829042 Cannot load hosts.cfg from xymond: Select(2) failed 2024-01-31 02:17:39.829049 Failed to load from xymond, reverting to file-load 2024-01-31 02:22:40.932828 Whoops ! Failed to send message (Select(2) failed) 2024-01-31 02:22:40.932863 -> Select failure while sending to Xymon daemon at 10.10.10.10:1984 2024-01-31 02:22:40.932867 -> Recipient '10.10.10.10', timeout 50 2024-01-31 02:22:40.932871 -> 1st line: 'config hosts.cfg' 2024-01-31 02:22:40.932876 Cannot load hosts.cfg from xymond: Select(2) failed 2024-01-31 02:22:40.932881 Failed to load from xymond, reverting to file-load And notifications.log is zero bytes in size. I added "--debug" to the "[alert]" section of /etc/xymon/tasks.cfg and while the verbosity was increased, there was no indication of why alerts are not being sent. Any clues how I can debug this? Thanks
list Jeremy Laidman
Great news Colin. Have the "Select(2)" messages gone away? Can you share the nature of the error in alerts.cfg, so I know what to look for when I do the same in future?
▸
J
On Wed, 31 Jan 2024 at 13:22, Colin Coe <user-5b250cd7a540@xymon.invalid> wrote:
Hi all This is resolved. It was a stupid error in alerts.cfg... Thanks for the suggestions On Wed, 31 Jan 2024 at 09:21, Colin Coe <user-5b250cd7a540@xymon.invalid> wrote:Hi Jeremy Running the following gives me the expected result so the server is responding, at least sometimes. xymon 127.0.0.1 "config hosts.cfg" Is this a worry: "Discarding timed-out partial msg from 127.0.0.1"? Getting lots of these... I've added --trace to xymond_alerts and will go through that. Thanks On Wed, 31 Jan 2024 at 08:32, Jeremy Laidman <user-0608abae5e7c@xymon.invalid> wrote:Hi Colin From the logs, it appears that xymond_alert is unable to communicate with your Xymon server on 10.10.10.10:1984. It seems to be trying to fetch the hosts.cfg file contents via the BB protocol by sending a "config hosts.cfg" command to xymond, but xymond is not responding. The select() system call is monitoring a file handle or socket for activity, likely the TCP socket with 10.10.10.10:1984. The timeout means that the select() call didn't return a response in the expected time. This suggests that the TCP connection was established correctly (xymond is listening and IP/port are likely correct) and xymond_alert sent the request for the hosts.cfg file, but there was no response. It might be worth checking xymond.log for messages corresponding to the timestamps of the errors from xymond_alert. I'm not convinced this is the reason that you're not getting alert emails. If xymond_alert can't get hosts.cfg from a BB message, it should be able to get it directly from the filesystem, and then carry on. So the messages you're seeing might be a red herring, although I wouldn't expect them to show up on a normally operating Xymon installation. Having said that, my Xymon installation is showing those log messages, yet I've no reason to think that our alerting is broken, so perhaps it's just something that can be ignored. It might be worth taking a look at the man page for xymond_alert, and have a go at the --test, --trace and --dump-config options. In case it's not obvious, I'm really not sure what the problem could be, and I'm just throwing out some ideas in case something helps. J On Wed, 31 Jan 2024 at 10:50, Colin Coe <user-5b250cd7a540@xymon.invalid> wrote:Hi all Our Xymon server has recently stopped sending alert emails. This server is also running Postfix and is our mail relay. From alert.log all I see is: 2024-01-31 02:17:39.813610 Whoops ! Failed to send message (Select(2) failed) 2024-01-31 02:17:39.829027 -> Select failure while sending to Xymon daemon at 10.10.10.10:1984 2024-01-31 02:17:39.829032 -> Recipient '10.10.10.10', timeout 50 2024-01-31 02:17:39.829037 -> 1st line: 'config hosts.cfg' 2024-01-31 02:17:39.829042 Cannot load hosts.cfg from xymond: Select(2) failed 2024-01-31 02:17:39.829049 Failed to load from xymond, reverting to file-load 2024-01-31 02:22:40.932828 Whoops ! Failed to send message (Select(2) failed) 2024-01-31 02:22:40.932863 -> Select failure while sending to Xymon daemon at 10.10.10.10:1984 2024-01-31 02:22:40.932867 -> Recipient '10.10.10.10', timeout 50 2024-01-31 02:22:40.932871 -> 1st line: 'config hosts.cfg' 2024-01-31 02:22:40.932876 Cannot load hosts.cfg from xymond: Select(2) failed 2024-01-31 02:22:40.932881 Failed to load from xymond, reverting to file-load And notifications.log is zero bytes in size. I added "--debug" to the "[alert]" section of /etc/xymon/tasks.cfg and while the verbosity was increased, there was no indication of why alerts are not being sent. Any clues how I can debug this? Thanks
list Colin Coe
The "Select" messages are still there.
The faulty alerts.cfg config was:
HOST=%(...|GS\d\d\d\d)
IGNORE
but should have been:
HOST=%(...GS\d\d\d\d)
IGNORE
so basically everything was being ignored
▸
On Wed, 31 Jan 2024 at 10:39, Jeremy Laidman <user-0608abae5e7c@xymon.invalid> wrote:
Great news Colin. Have the "Select(2)" messages gone away? Can you share the nature of the error in alerts.cfg, so I know what to look for when I do the same in future? J On Wed, 31 Jan 2024 at 13:22, Colin Coe <user-5b250cd7a540@xymon.invalid> wrote:Hi all This is resolved. It was a stupid error in alerts.cfg... Thanks for the suggestions On Wed, 31 Jan 2024 at 09:21, Colin Coe <user-5b250cd7a540@xymon.invalid> wrote:Hi Jeremy Running the following gives me the expected result so the server is responding, at least sometimes. xymon 127.0.0.1 "config hosts.cfg" Is this a worry: "Discarding timed-out partial msg from 127.0.0.1"? Getting lots of these... I've added --trace to xymond_alerts and will go through that. Thanks On Wed, 31 Jan 2024 at 08:32, Jeremy Laidman <user-0608abae5e7c@xymon.invalid> wrote:Hi Colin From the logs, it appears that xymond_alert is unable to communicate with your Xymon server on 10.10.10.10:1984. It seems to be trying to fetch the hosts.cfg file contents via the BB protocol by sending a "config hosts.cfg" command to xymond, but xymond is not responding. The select() system call is monitoring a file handle or socket for activity, likely the TCP socket with 10.10.10.10:1984. The timeout means that the select() call didn't return a response in the expected time. This suggests that the TCP connection was established correctly (xymond is listening and IP/port are likely correct) and xymond_alert sent the request for the hosts.cfg file, but there was no response. It might be worth checking xymond.log for messages corresponding to the timestamps of the errors from xymond_alert. I'm not convinced this is the reason that you're not getting alert emails. If xymond_alert can't get hosts.cfg from a BB message, it should be able to get it directly from the filesystem, and then carry on. So the messages you're seeing might be a red herring, although I wouldn't expect them to show up on a normally operating Xymon installation. Having said that, my Xymon installation is showing those log messages, yet I've no reason to think that our alerting is broken, so perhaps it's just something that can be ignored. It might be worth taking a look at the man page for xymond_alert, and have a go at the --test, --trace and --dump-config options. In case it's not obvious, I'm really not sure what the problem could be, and I'm just throwing out some ideas in case something helps. J On Wed, 31 Jan 2024 at 10:50, Colin Coe <user-5b250cd7a540@xymon.invalid> wrote:Hi all Our Xymon server has recently stopped sending alert emails. This server is also running Postfix and is our mail relay. From alert.log all I see is: 2024-01-31 02:17:39.813610 Whoops ! Failed to send message (Select(2) failed) 2024-01-31 02:17:39.829027 -> Select failure while sending to Xymon daemon at 10.10.10.10:1984 2024-01-31 02:17:39.829032 -> Recipient '10.10.10.10', timeout 50 2024-01-31 02:17:39.829037 -> 1st line: 'config hosts.cfg' 2024-01-31 02:17:39.829042 Cannot load hosts.cfg from xymond: Select(2) failed 2024-01-31 02:17:39.829049 Failed to load from xymond, reverting to file-load 2024-01-31 02:22:40.932828 Whoops ! Failed to send message (Select(2) failed) 2024-01-31 02:22:40.932863 -> Select failure while sending to Xymon daemon at 10.10.10.10:1984 2024-01-31 02:22:40.932867 -> Recipient '10.10.10.10', timeout 50 2024-01-31 02:22:40.932871 -> 1st line: 'config hosts.cfg' 2024-01-31 02:22:40.932876 Cannot load hosts.cfg from xymond: Select(2) failed 2024-01-31 02:22:40.932881 Failed to load from xymond, reverting to file-load And notifications.log is zero bytes in size. I added "--debug" to the "[alert]" section of /etc/xymon/tasks.cfg and while the verbosity was increased, there was no indication of why alerts are not being sent. Any clues how I can debug this? Thanks