Nagios Bug and Feature Tracker

All issues moved to NagiosEnterprises Github

Bug and Feature Tracker

Viewing Issue Simple Details Jump to Notes ] View Advanced ] Issue History ] Print ]
ID Category Severity Reproducibility Date Submitted Last Update
0000644 [NSCA] Other / Unknown major always 2014-09-30 19:59 2016-09-30 13:02
Reporter mib View Status public  
Assigned To jfrickson
Priority normal Resolution fixed  
Status resolved   Product Version
Summary 0000644: NSCA close/POLLNVAL/accept bug causes hang
Description I believe I've found a significant bug in the nsca daemon. Under heavy connection load we are seeing this hang nsca (forever) every few days, and more frequently freeze until the remote disconnects.

In summary: nsca closes client connection file descriptors on error or eof and relies on poll returning POLLNVAL to clean up after them (in particular to remove their handlers and take them out of the poll set). If (in the same loop) accept() returns the recently closed FD the wrong handler will be run on it.

This can result in calling recv() on an FD with no data to read and O_NONBLOCK not set.

Attached (I hope) is a minimal-change patch that fixes the behavior. It's not the best patch, which would be to redesign nsca to manage the handlers and remove the one-shot behavior where the loop clears them each time (which results in amusing comments such as "DO NOT REMOVE! 01/29/2007 single process daemon will fail if this is removed").

This patch is against 2.7.2 but it should apply cleanly to the current r2763. It includes a fix to another problem I don't think can actually be triggered, where the poll loop could theoretically run a handler for a just-added FD because npfds++ is run inside the loop which tests against <npfds.

I'm happy to go through the bug code path in more detail if anyone wants, it was quite tricky to track down.
Additional Information
Tags No tags attached.
OS
OS Version
Attached Files ? file icon nsca-2.7.2-uom.patch [^] (1,313 bytes) 2014-09-30 19:59

- Relationships

-  Notes
(0001319)
awiddersheim (reporter)
2014-10-13 14:57

I use NSCA pretty extensively and have had this same issue among others with NSCA. I've brought up quite a few of these issues to the Nagios people on their forums but I have not see any forward movement.

http://support.nagios.com/forum/viewtopic.php?f=35&t=25233 [^]

I think they would prefer people use NRDP. In any case my fork of NSCA is here with a good number of fixes that we have been running in production for a few months now without issue:

https://github.com/awiddersheim/nsca-aw [^]
(0001514)
jpdionne (reporter)
2015-11-18 16:53

Hi, I have encountered this issue on 2.7.2, applied your patch and haven't seen the freeze so far.

ps: for those having the problem on the same kind of system than me. The patch was applied successfully to a ClusterStor version 1.5.0 SU10.
(0001667)
jfrickson (manager)
2016-09-30 13:02

Used the supplied patch. Fix is currently in the nsca-2-9-2RC1 branch at https://github.com/NagiosEnterprises/nsca/tree/nsca-2-9-2RC1 [^] via commit https://github.com/NagiosEnterprises/nsca/commit/98dc4795edaf2fd5c5fe123e53a8c9b3a2b0585c [^]

- Issue History
Date Modified Username Field Change
2014-09-30 19:59 mib New Issue
2014-09-30 19:59 mib File Added: nsca-2.7.2-uom.patch
2014-10-13 14:57 awiddersheim Note Added: 0001319
2015-09-25 12:14 jfrickson Status new => assigned
2015-09-25 12:14 jfrickson Assigned To => nagios_staff
2015-11-18 16:53 jpdionne Note Added: 0001514
2016-09-30 13:01 jfrickson Assigned To nagios_staff => jfrickson
2016-09-30 13:02 jfrickson Note Added: 0001667
2016-09-30 13:02 jfrickson Status assigned => resolved
2016-09-30 13:02 jfrickson Resolution open => fixed


Mantis 1.1.7[^]
Copyright © 2000 - 2008 Mantis Group
Powered by Mantis Bugtracker