Xymon Mailing List Archive search

XYMON Proxy Issue

list Andy Smith
Tue, 20 May 2014 07:28:46 +0100
Message-Id: <user-a78c805794b5@xymon.invalid>

Andy Smith wrote:
Andy Smith wrote:
Hi,

In February, Gautier reported this issue with xymonproxy on Solaris :-

http://lists.xymon.com/pipermail/xymon/2014-February/039160.html

I have come this week to update an installation of 4.2.3 on Solaris 9 
and have encountered the exact same issue as Gautier, but this time on 
the latest 4.3.17 code :-

2014-05-04 13:05:36 xymonproxy version 4.3.17 starting
2014-05-04 13:20:41 Listening on 0.0.0.0:1984 <http://0.0.0.0:1984>;
2014-05-04 13:20:41 Sending to Xymon server(s) xx.xx.xx.xx:1984
2014-05-04 13:20:41 select() failed: Invalid argument
2014-05-04 13:20:41 select() failed: Invalid argument
2014-05-04 13:20:41 select() failed: Invalid argument
2014-05-04 13:20:41 select() failed: Invalid argument
2014-05-04 13:20:41 select() failed: Invalid argument
2014-05-04 13:20:41 select() failed: Invalid argument
2014-05-04 13:20:41 Too many select failures, aborting
2014-05-04 13:20:46 xymonproxy version 4.3.17 starting

I do not suffer the connections in TIME_WAIT, just the constant 
restarting of the proxy every 15 minutes.  Here is the truss as it 
gasps when falling over :-

poll(0xFFBFF208, 1, 1000)                       = 0
time()                                          = 1399206937
poll(0xFFBFF208, 1, 1000)                       = 0
time()                                          = 1399206938
poll(0xFFBFF208, 1, 1000)                       = 0
time()                                          = 1399206939
poll(0xFFBFF208, 1, 1000)                       = 0
time()                                          = 1399206940
poll(0xFFBFF208, 1, 1000)                       = 0
time()                                          = 1399206941
poll(0xFFBFF208, 1, 1000)                       = 0
time()                                          = 1399206942
poll(0xFFBFF208, 1, 1000)                       = 1
accept(3, 0x0003AC60, 0xFFBFF310, 1)            = 4
fcntl(4, F_SETFL, 0x00000080)                   = 0
time()                                          = 1399206942
poll(0xFFBFF200, 2, 1000)                       = 1
read(4, " s t a t u s + 4 5   c s".., 8185)     = 140
time()                                          = 1399206942
poll(0xFFBFF200, 2, 1000)                       = 1
read(4, 0x00038CE2, 8045)                       = 0
time()                                          = 1399206942
shutdown(4, 2, 1)                               = 0
close(4)                                        = 0
poll(0xFFBFF208, 1, 1000)                       = 1
accept(3, 0x0003ACD0, 0xFFBFF310, 1)            = 4
fcntl(4, F_SETFL, 0x00000080)                   = 0
time()                                          = 1399206942
time()                                          = 1399206942
write(2, " 2 0 1 4 - 0 5 - 0 4   1".., 19)      = 19
write(2, "  ", 1)                               = 1
write(2, " s e l e c t ( )   f a i".., 34)      = 34
time()                                          = 1399206942
time()                                          = 1399206942
write(2, " 2 0 1 4 - 0 5 - 0 4   1".., 19)      = 19
write(2, "  ", 1)                               = 1
write(2, " s e l e c t ( )   f a i".., 34)      = 34
time()                                          = 1399206942
time()                                          = 1399206942
write(2, " 2 0 1 4 - 0 5 - 0 4   1".., 19)      = 19
write(2, "  ", 1)                               = 1
write(2, " s e l e c t ( )   f a i".., 34)      = 34
time()                                          = 1399206942
time()                                          = 1399206942
write(2, " 2 0 1 4 - 0 5 - 0 4   1".., 19)      = 19
write(2, "  ", 1)                               = 1
write(2, " s e l e c t ( )   f a i".., 34)      = 34
time()                                          = 1399206942
time()                                          = 1399206942
write(2, " 2 0 1 4 - 0 5 - 0 4   1".., 19)      = 19
write(2, "  ", 1)                               = 1
write(2, " s e l e c t ( )   f a i".., 34)      = 34
time()                                          = 1399206942
time()                                          = 1399206942
write(2, " 2 0 1 4 - 0 5 - 0 4   1".., 19)      = 19
write(2, "  ", 1)                               = 1
write(2, " s e l e c t ( )   f a i".., 34)      = 34
time()                                          = 1399206942
write(2, " 2 0 1 4 - 0 5 - 0 4   1".., 19)      = 19
write(2, "  ", 1)                               = 1
write(2, " T o o   m a n y   s e l".., 35)      = 35
_exit(1)

So, question to Gautier, are you using Solaris 9 and have you managed 
to resolve this?

Another question to the rest of the list, this is actually the only 
proxy I have on Solaris, all the otehrs are on Redhat, is anyone else 
using xymonproxy on Solaris and if so, what version?  For the time 
being, I am running the old bbproxy until I get this fixed, the rest 
of 4.3.17 seems to be working OK.
Done a bit more digging around.  Firstly, if I regress to r#7368 
(4.3.13) then xymonproxy on Solaris is stable.  This just hides the 
problem of course and might be a factor in Gautier's performance issue.

If I modify the code for 4.3.17 to remove the exit after 5 select() 
failures and add in some further debugging, I can observe that on 
Solaris 9 at least :-

- every 900 seconds, select() fails
- select continues to fail for 2 seconds then succeeds and the proxy 
continues as normal.
- during these 2 seconds, there are no further calls to poll(), but 
somewhere in the region of 50,000 calls to time().
- the values for the selecttmo structure and maxfd are reasonable, so 
the invalid argument must be one of the fdread or fdwrite structures.

Continuing to collect information but still not sure if I am looking at 
a Sol9 issue or if this affects later Solaris versions.
This issue affected Solaris 10 as well, the attached patch resolves all 
my xymonproxy stability problems on Solaris platforms, I believe the 
patch is relevant to other platforms also, just that the select() on 
other platforms is more tolerant.

-- 
Andy