Nagios Bug and Feature Tracker
Bug and Feature Tracker

Viewing Issue Simple Details Jump to Notes ] View Advanced ] Issue History ] Print ]
ID Category Severity Reproducibility Date Submitted Last Update
0000331 [Nagios Core] Checks major sometimes 2012-05-08 05:03 2013-02-05 05:40
Reporter sbaynes View Status public  
Assigned To estanley
Priority normal Resolution open  
Status assigned   Product Version
Summary 0000331: Long passive checks truncated/split
Description Long passive checks (typically 4k-6k bytes) were being apparently truncated. A more detailed look in the log file shows they are being split into two commands in the command pipe.
Additional Information The issue is Nagios is opening the command pipe with non blocking input but then not allowing for result that the fgets used to read the pipe might not read a complete line at a time.
Checking for end of line having been read and retaining part lines and reading on to them fixes this. A patch we are deploying internally to do this is attached.
Tags No tags attached.
Nagios Version 3.3.1
OS Linux
OS Version 2.6.32.26+drm33.12-3.-smp
Attached Files ? file icon commandpipereadtrunc.patch [^] (3,179 bytes) 2012-05-08 05:03

- Relationships

-  Notes
(0000474)
ageric (administrator)
2012-06-07 07:33

While I'm sure the patch works most of the time, it's really not a stable long-term fix.

If several processes try to write at the same time and one of them fills up the pipe with a half-written command, it's entirely possible that Nagios will read half the command from one process and then continue to read an entirely different command from a different process, leading to one command being missed entirely.

In my opinion, that's worse than not getting all 4K of a service status output line passed onto Nagios.

Fortunately, the up-and-coming I/O-broker may have a solution for this, since it will enable Nagios to have a unix domain socket where much larger dataloads are possible and data can be read from multiple sources at once without the need for differentiating if a command without a new-line is continuated by the next line read, or if that next line is just a new command whose sending process happened to be scheduled first by the kernel.

In the meantime, checking for '[' followed by a timestamp would be a good way to ensure that we don't miss any incoming commands, although without consideration, such a patch might make a local DoS attack possible.
(0000664)
estanley (administrator)
2013-01-12 13:02

I was able to send a 6k passive service check result to Nagios core 3.4.4 and it was not truncated. The input buffer for commands is 8k (and has been since at least 3.3.1), so that makes sense. If you have further information about how to reproduce this issue, please add it to the ticket. Otherwise, I think this should be closed as not a bug (at least not for the size buffer claimed).
(0000666)
sbaynes (reporter)
2013-01-14 10:59

The problem is when multiple requests come in faster than they can be read from the buffer. 2*6k=12k which is > 8k. I guess the second gets split into two writes and if the first of those is then read before the second gets added to the buffer then it goes wrong.

So you will need to send bursts of multiple 6k results to the server to see the problem. Not just a single result.

I have never tried extracting this into a simple setup outside our full setup. I guess I could - but it will take me time to set up some infrastructure.
(0000673)
estanley (administrator)
2013-02-05 05:40

External commands are read from the input buffer very quickly, as quickly as the thread can run in most cases. From there they going into a circular buffer that has configurable number of slots (4096 by default), each of unlimited size. The only case where there is a delay is when the slots are full and that retries every 250 ms.

I agree with Andreas that patch provided has ordering problems, so I don't believe it is the correct approach.

I would first ensure that your command_check_interval is short enough that you can empty the circular buffer quickly. If you're receiving a large number of these large responses in short order, please also verify that the external_command_buffer_slots is high enough to hold them. If that fails try increasing the input buffer size, MAX_EXTERNAL_COMMAND_LENGTH in common.h, to see if that solves the issue. If it does, I believe a patch to increase that or possibly make it a configuration parameter, would be a better approach.

- Issue History
Date Modified Username Field Change
2012-05-08 05:03 sbaynes New Issue
2012-05-08 05:03 sbaynes File Added: commandpipereadtrunc.patch
2012-05-08 05:03 sbaynes Nagios Version => 3.3.1
2012-05-08 05:03 sbaynes OS => Linux
2012-05-08 05:03 sbaynes OS Version => 2.6.32.26+drm33.12-3.-smp
2012-05-15 09:25 bfek18 Issue Monitored: bfek18
2012-06-07 07:33 ageric Note Added: 0000474
2012-09-18 10:11 ageric Category Passive Checks => Checks
2013-01-12 12:58 estanley Status new => assigned
2013-01-12 12:58 estanley Assigned To => estanley
2013-01-12 13:02 estanley Note Added: 0000664
2013-01-14 10:59 sbaynes Note Added: 0000666
2013-02-05 05:40 estanley Note Added: 0000673


Mantis 1.1.7[^]
Copyright © 2000 - 2008 Mantis Group
Powered by Mantis Bugtracker