All issues moved to NagiosEnterprises Github
|Anonymous | Login | Signup for a new account||2017-12-16 21:12 EST|
|Main | My View | View Issues | Change Log | Roadmap|
|Viewing Issue Simple Details|
|ID||Category||Severity||Reproducibility||Date Submitted||Last Update|
|0000644||[NSCA] Other / Unknown||major||always||2014-09-30 19:59||2016-09-30 13:02|
|Summary||0000644: NSCA close/POLLNVAL/accept bug causes hang|
I believe I've found a significant bug in the nsca daemon. Under heavy connection load we are seeing this hang nsca (forever) every few days, and more frequently freeze until the remote disconnects.
In summary: nsca closes client connection file descriptors on error or eof and relies on poll returning POLLNVAL to clean up after them (in particular to remove their handlers and take them out of the poll set). If (in the same loop) accept() returns the recently closed FD the wrong handler will be run on it.
This can result in calling recv() on an FD with no data to read and O_NONBLOCK not set.
Attached (I hope) is a minimal-change patch that fixes the behavior. It's not the best patch, which would be to redesign nsca to manage the handlers and remove the one-shot behavior where the loop clears them each time (which results in amusing comments such as "DO NOT REMOVE! 01/29/2007 single process daemon will fail if this is removed").
This patch is against 2.7.2 but it should apply cleanly to the current r2763. It includes a fix to another problem I don't think can actually be triggered, where the poll loop could theoretically run a handler for a just-added FD because npfds++ is run inside the loop which tests against <npfds.
I'm happy to go through the bug code path in more detail if anyone wants, it was quite tricky to track down.
|Tags||No tags attached.|
|Attached Files||nsca-2.7.2-uom.patch [^] (1,313 bytes) 2014-09-30 19:59|
I use NSCA pretty extensively and have had this same issue among others with NSCA. I've brought up quite a few of these issues to the Nagios people on their forums but I have not see any forward movement.
I think they would prefer people use NRDP. In any case my fork of NSCA is here with a good number of fixes that we have been running in production for a few months now without issue:
Hi, I have encountered this issue on 2.7.2, applied your patch and haven't seen the freeze so far.
ps: for those having the problem on the same kind of system than me. The patch was applied successfully to a ClusterStor version 1.5.0 SU10.
|Used the supplied patch. Fix is currently in the nsca-2-9-2RC1 branch at https://github.com/NagiosEnterprises/nsca/tree/nsca-2-9-2RC1 [^] via commit https://github.com/NagiosEnterprises/nsca/commit/98dc4795edaf2fd5c5fe123e53a8c9b3a2b0585c [^]|
|2014-09-30 19:59||mib||New Issue|
|2014-09-30 19:59||mib||File Added: nsca-2.7.2-uom.patch|
|2014-10-13 14:57||awiddersheim||Note Added: 0001319|
|2015-09-25 12:14||jfrickson||Status||new => assigned|
|2015-09-25 12:14||jfrickson||Assigned To||=> nagios_staff|
|2015-11-18 16:53||jpdionne||Note Added: 0001514|
|2016-09-30 13:01||jfrickson||Assigned To||nagios_staff => jfrickson|
|2016-09-30 13:02||jfrickson||Note Added: 0001667|
|2016-09-30 13:02||jfrickson||Status||assigned => resolved|
|2016-09-30 13:02||jfrickson||Resolution||open => fixed|
|Mantis 1.1.7[^] Copyright © 2000 - 2008 Mantis Group|