Nagios Bug and Feature Tracker
Bug and Feature Tracker

Viewing Issue Simple Details Jump to Notes ] View Advanced ] Issue History ] Print ]
ID Category Severity Reproducibility Date Submitted Last Update
0000128 [Nagios Core] Checks minor have not tried 2010-02-18 07:00 2011-01-11 10:56
Reporter cameleon078 View Status public  
Assigned To egalstad
Priority normal Resolution fixed  
Status resolved   Product Version
Summary 0000128: Service in soft state after hard state without recovery
Description Hello,

Here is the history of service CPU status on host test-server (extract of nagios.log in attachement) :

CRITICAL;SOFT;1
CRITICAL;SOFT;2
CRITICAL;HARD;3
CRITICAL;SOFT;1

=> Why I don't have after CRITICAL;HARD;3 and before CRITICAL;SOFT;1 : OK;HARD;3.

I think the host test-server was down and when it's come back the status of service changed from hard to soft ...

Thanks.
Additional Information Here is a piece of code of checks.c in Nagios :

/* ADDED IF STATEMENT 01-17-05 EG */
/* 01-17-05: Services in hard problem states before hosts went down would sometimes come back as soft problem states after */
/* the hosts recovered. This caused problems, so hopefully this will fix it */
if(temp_service->state_type==SOFT_STATE)
  temp_service->current_attempt=1;
}

Is this workaround works fine ?
Tags No tags attached.
Nagios Version 3.2.0
OS Linux CentOS
OS Version 5.4
Attached Files txt file icon extract_nagios_log_01.txt [^] (1,130 bytes) 2010-02-18 07:00

- Relationships

-  Notes
(0000139)
kohts (reporter)
2010-03-02 10:23

The problem comes from this code (base/checks.c):

------
/* put service into a hard state without attempting check retries and don't send out notifications about it */
temp_service->host_problem_at_last_check=TRUE;
temp_service->state_type=HARD_STATE;
temp_service->last_hard_state=temp_service->current_state;
temp_service->current_attempt=1;
------

The code is executed whenever
service is not OK and host is not OK,
and it always updates service retries to 1.

If host goes up while service is still not OK
service becomes SOFT which totally breaks logic.

The code seems to be very old, at least
it is present in first version available in cvs:
http://nagios.cvs.sourceforge.net/viewvc/nagios/nagios/base/checks.c?revision=1.1&view=markup [^]

It seems to be totally unneeded in current version,
because HARD/SOFT transitions are managed
by the code below, independently of host state.

So the only valid meaning of this code can be
host_problem_at_last_check flag management,
everything else should be deleted.

Hence the patch against 3.2.0:

--- checks.c.orig 2009-08-11 20:56:39.000000000 +0400
+++ checks.c 2010-03-02 18:41:42.000000000 +0300
@@ -1412,9 +1412,10 @@

                        /* put service into a hard state without attempting check retries and don't send out notifications about it */
                        temp_service->host_problem_at_last_check=TRUE;
- temp_service->state_type=HARD_STATE;
- temp_service->last_hard_state=temp_service->current_state;
- temp_service->current_attempt=1;
+
+// temp_service->state_type=HARD_STATE;
+// temp_service->last_hard_state=temp_service->current_state;
+// temp_service->current_attempt=1;
                        }
(0000140)
kohts (reporter)
2010-03-02 10:33

Here are nagios.log and debug.log for the described scenario,
host fws1-041.yandex.ru, service unispace.

nagios.log:
-----------
pechora:~# cat /var/log/nagios3/nagios.log | /usr/local/bin/read_nagios_log.pl | grep fws1-041
Tue Mar 2 00:00:00 2010 CURRENT HOST STATE: fws1-041.yandex.ru;UP;HARD;1;FPING OK - fws1-041.yandex.ru (loss=0%, rta=0.430000 ms)
Tue Mar 2 00:00:00 2010 CURRENT SERVICE STATE: fws1-041.yandex.ru;COMP;OK;HARD;1;HTTP OK HTTP/1.1 200 Ok - 11783 bytes in 0.004 seconds
Tue Mar 2 00:00:00 2010 CURRENT SERVICE STATE: fws1-041.yandex.ru;META;OK;HARD;1;OK
Tue Mar 2 00:00:00 2010 CURRENT SERVICE STATE: fws1-041.yandex.ru;gmirror;OK;HARD;1;OK
Tue Mar 2 00:00:00 2010 CURRENT SERVICE STATE: fws1-041.yandex.ru;hw_errs;OK;HARD;1;Ok
Tue Mar 2 00:00:00 2010 CURRENT SERVICE STATE: fws1-041.yandex.ru;smart_temp;OK;HARD;1;ok
Tue Mar 2 00:00:00 2010 CURRENT SERVICE STATE: fws1-041.yandex.ru;ssh;OK;HARD;1;SSH OK - OpenSSH_5.1p1 FreeBSD-20080901 (protocol 1.99)
Tue Mar 2 00:00:00 2010 CURRENT SERVICE STATE: fws1-041.yandex.ru;unispace;OK;HARD;1;ok
Tue Mar 2 00:00:00 2010 CURRENT SERVICE STATE: fws1-041.yandex.ru;watchdog;OK;HARD;1;Ok
Tue Mar 2 03:33:37 2010 SERVICE ALERT: fws1-041.yandex.ru;unispace;CRITICAL;SOFT;1;0% free on /db.
Tue Mar 2 03:33:37 2010 GLOBAL SERVICE EVENT HANDLER: fws1-041.yandex.ru;unispace;CRITICAL;SOFT;1;golem_event_handler
Tue Mar 2 03:37:45 2010 HOST ALERT: fws1-041.yandex.ru;DOWN;SOFT;1;FPING CRITICAL - fws1-041.yandex.ru (loss=100% )
Tue Mar 2 03:37:45 2010 GLOBAL HOST EVENT HANDLER: fws1-041.yandex.ru;DOWN;SOFT;1;golem_event_handler
Tue Mar 2 03:38:29 2010 SERVICE ALERT: fws1-041.yandex.ru;unispace;CRITICAL;HARD;1;0% free on /db.
Tue Mar 2 03:38:35 2010 HOST ALERT: fws1-041.yandex.ru;UP;SOFT;2;FPING OK - fws1-041.yandex.ru (loss=0%, rta=0.450000 ms)
Tue Mar 2 03:38:35 2010 GLOBAL HOST EVENT HANDLER: fws1-041.yandex.ru;UP;SOFT;2;golem_event_handler
Tue Mar 2 03:43:29 2010 SERVICE ALERT: fws1-041.yandex.ru;unispace;CRITICAL;SOFT;1;0% free on /db.
Tue Mar 2 03:43:29 2010 GLOBAL SERVICE EVENT HANDLER: fws1-041.yandex.ru;unispace;CRITICAL;SOFT;1;golem_event_handler
Tue Mar 2 03:48:30 2010 SERVICE NOTIFICATION: search;fws1-041.yandex.ru;unispace;CRITICAL;notify-by-email;0% free on /db.
Tue Mar 2 04:03:28 2010 SERVICE NOTIFICATION: search;fws1-041.yandex.ru;unispace;CRITICAL;notify-by-email;0% free on /db.

excerpts from debug.log:
------------------------
Tue Mar 2 03:33:37 2010 (176849) [016.1] [pid=18915] Handling check result for service 'unispace' on host 'fws1-041.yandex.ru'...
Tue Mar 2 03:33:37 2010 (176853) [016.0] [pid=18915] ** Handling check result for service 'unispace' on host 'fws1-041.yandex.ru'...
Tue Mar 2 03:33:37 2010 (176856) [016.1] [pid=18915] HOST: fws1-041.yandex.ru, SERVICE: unispace, CHECK TYPE: Passive, OPTIONS: 0, SCHEDULED: No, RESCHEDULE: No, EXITED OK:
 Yes, RETURN CODE: 2, OUTPUT: 0% free on /db.
Tue Mar 2 03:33:37 2010 (176865) [016.2] [pid=18915] Parsing check output...
Tue Mar 2 03:33:37 2010 (176870) [016.2] [pid=18915] Short Output: 0% free on /db.
Tue Mar 2 03:33:37 2010 (176873) [016.2] [pid=18915] Long Output: NULL
Tue Mar 2 03:33:37 2010 (176876) [016.2] [pid=18915] Perf Data: NULL
Tue Mar 2 03:33:37 2010 (176880) [016.2] [pid=18915] ST: HARD CA: 1 MA: 2 CS: 2 LS: 0 LHS: 0
Tue Mar 2 03:33:37 2010 (176883) [016.2] [pid=18915] Service has changed state since last check!
Tue Mar 2 03:33:37 2010 (176887) [016.1] [pid=18915] Service is in a non-OK state!
Tue Mar 2 03:33:37 2010 (176890) [016.1] [pid=18915] Host is currently UP, so we'll recheck its state to make sure...
Tue Mar 2 03:33:37 2010 (176893) [016.0] [pid=18915] ** Running async check of host 'fws1-041.yandex.ru'...
Tue Mar 2 03:33:37 2010 (176898) [016.0] [pid=18915] Checking host 'fws1-041.yandex.ru'...
Tue Mar 2 03:33:37 2010 (176902) [016.2] [pid=18915] Adjusting check attempt number for host 'fws1-041.yandex.ru': current attempt=1/5, state=0, state type=1
Tue Mar 2 03:33:37 2010 (176905) [016.2] [pid=18915] New check attempt number = 1
Tue Mar 2 03:33:37 2010 (176922) [2320.2] [pid=18915] Raw Command Input: $USER1$/check_fping $HOSTADDRESS$ 20 60 10.0 100.0 -p 5
Tue Mar 2 03:33:37 2010 (176927) [2320.2] [pid=18915] Expanded Command Output: $USER1$/check_fping $HOSTADDRESS$ 20 60 10.0 100.0 -p 5
Tue Mar 2 03:33:37 2010 (177370) [016.1] [pid=18915] Check result output will be written to '/tmp/checkGvQ1t9' (fd=6)
Tue Mar 2 03:33:37 2010 (183186) [016.2] [pid=18915] Host check is executing in child process (pid=24156)
Tue Mar 2 03:33:37 2010 (183284) [016.1] [pid=18915] Current/Max Attempt(s): 1/2
Tue Mar 2 03:33:37 2010 (183292) [016.1] [pid=18915] Host is UP, so we'll retry the service check...
Tue Mar 2 03:33:37 2010 (183408) [016.1] [pid=18915] Running global event handler for service 'unispace' on host 'fws1-041.yandex.ru'...
Tue Mar 2 03:33:37 2010 (183421) [2320.2] [pid=18915] Raw Command Input: /usr/local/bin/golem_queue_event.sh
Tue Mar 2 03:33:37 2010 (183429) [2320.2] [pid=18915] Expanded Command Output: /usr/local/bin/golem_queue_event.sh
Tue Mar 2 03:33:37 2010 (183455) [016.2] [pid=18915] Raw global service event handler command line: /usr/local/bin/golem_queue_event.sh
Tue Mar 2 03:33:37 2010 (183463) [016.2] [pid=18915] Processed global service event handler command line: /usr/local/bin/golem_queue_event.sh
Tue Mar 2 03:33:37 2010 (235680) [016.2] [pid=24156] Moving temp check result file '/tmp/checkGvQ1t9' to queue file '/var/lib/nagios3/spool/checkresults/cWDb97f'...
Tue Mar 2 03:33:37 2010 (237716) [016.1] [pid=18915] Rescheduling next check of service at Tue Mar 2 03:35:49 2010
Tue Mar 2 03:33:37 2010 (237751) [016.1] [pid=18915] Checking service 'unispace' on host 'fws1-041.yandex.ru' for flapping...
Tue Mar 2 03:33:37 2010 (237757) [016.1] [pid=18915] Checking host 'fws1-041.yandex.ru' for flapping...
Tue Mar 2 03:33:37 2010 (237763) [016.2] [pid=18915] LFT=5.00, HFT=20.00, CPC=0.00, PSC=0.00%
Tue Mar 2 03:33:37 2010 (237806) [016.1] [pid=18915] Deleted check result file '/var/lib/nagios3/spool/checkresults/ctD5CkR'

Tue Mar 2 03:38:29 2010 (050029) [016.1] [pid=18915] Handling check result for service 'unispace' on host 'fws1-041.yandex.ru'...
Tue Mar 2 03:38:29 2010 (050034) [016.0] [pid=18915] ** Handling check result for service 'unispace' on host 'fws1-041.yandex.ru'...
Tue Mar 2 03:38:29 2010 (050037) [016.1] [pid=18915] HOST: fws1-041.yandex.ru, SERVICE: unispace, CHECK TYPE: Passive, OPTIONS: 0, SCHEDULED: No, RESCHEDULE: No, EXITED OK:
 Yes, RETURN CODE: 2, OUTPUT: 0% free on /db.
Tue Mar 2 03:38:29 2010 (050061) [016.2] [pid=18915] Parsing check output...
Tue Mar 2 03:38:29 2010 (050067) [016.2] [pid=18915] Short Output: 0% free on /db.
Tue Mar 2 03:38:29 2010 (050070) [016.2] [pid=18915] Long Output: NULL
Tue Mar 2 03:38:29 2010 (050074) [016.2] [pid=18915] Perf Data: NULL
Tue Mar 2 03:38:29 2010 (050078) [016.2] [pid=18915] ST: SOFT CA: 2 MA: 2 CS: 2 LS: 2 LHS: 0
Tue Mar 2 03:38:29 2010 (050082) [016.2] [pid=18915] Service had a HARD STATE CHANGE!!
Tue Mar 2 03:38:29 2010 (050111) [016.1] [pid=18915] Service is in a non-OK state!
Tue Mar 2 03:38:29 2010 (050116) [016.1] [pid=18915] Host is currently DOWN/UNREACHABLE.
Tue Mar 2 03:38:29 2010 (050119) [016.1] [pid=18915] Assuming host is in same state as before...
Tue Mar 2 03:38:29 2010 (050151) [016.1] [pid=18915] Current/Max Attempt(s): 1/2
Tue Mar 2 03:38:29 2010 (050156) [016.1] [pid=18915] Host isn't UP, so we won't retry the service check...
Tue Mar 2 03:38:29 2010 (050612) [016.1] [pid=18915] Rescheduling next check of service at Tue Mar 2 03:40:49 2010
Tue Mar 2 03:38:29 2010 (050622) [016.1] [pid=18915] Checking service 'unispace' on host 'fws1-041.yandex.ru' for flapping...
Tue Mar 2 03:38:29 2010 (050627) [016.2] [pid=18915] LFT=5.00, HFT=20.00, CPC=6.25, PSC=6.25%
Tue Mar 2 03:38:29 2010 (050635) [016.1] [pid=18915] Checking host 'fws1-041.yandex.ru' for flapping...
Tue Mar 2 03:38:29 2010 (050638) [016.2] [pid=18915] LFT=5.00, HFT=20.00, CPC=6.25, PSC=6.25%
Tue Mar 2 03:38:29 2010 (050654) [016.1] [pid=18915] Deleted check result file '/var/lib/nagios3/spool/checkresults/cv9dTg7'

Tue Mar 2 03:43:29 2010 (444940) [016.2] [pid=18915] Found a check result (#6235) to handle...
Tue Mar 2 03:43:29 2010 (444945) [016.1] [pid=18915] Handling check result for service 'unispace' on host 'fws1-041.yandex.ru'...
Tue Mar 2 03:43:29 2010 (444949) [016.0] [pid=18915] ** Handling check result for service 'unispace' on host 'fws1-041.yandex.ru'...
Tue Mar 2 03:43:29 2010 (444952) [016.1] [pid=18915] HOST: fws1-041.yandex.ru, SERVICE: unispace, CHECK TYPE: Passive, OPTIONS: 0, SCHEDULED: No, RESCHEDULE: No, EXITED OK:
 Yes, RETURN CODE: 2, OUTPUT: 0% free on /db.
Tue Mar 2 03:43:29 2010 (444960) [016.2] [pid=18915] Parsing check output...
Tue Mar 2 03:43:29 2010 (444964) [016.2] [pid=18915] Short Output: 0% free on /db.
Tue Mar 2 03:43:29 2010 (444967) [016.2] [pid=18915] Long Output: NULL
Tue Mar 2 03:43:29 2010 (444970) [016.2] [pid=18915] Perf Data: NULL
Tue Mar 2 03:43:29 2010 (444973) [016.2] [pid=18915] ST: HARD CA: 1 MA: 2 CS: 2 LS: 2 LHS: 2
Tue Mar 2 03:43:29 2010 (444977) [016.1] [pid=18915] Service is in a non-OK state!
Tue Mar 2 03:43:29 2010 (444980) [016.1] [pid=18915] Host is currently UP, so we'll recheck its state to make sure...
Tue Mar 2 03:43:29 2010 (444983) [016.1] [pid=18915] * Using last known host state: 0
Tue Mar 2 03:43:29 2010 (444986) [016.1] [pid=18915] Current/Max Attempt(s): 1/2
Tue Mar 2 03:43:29 2010 (444989) [016.1] [pid=18915] Host is UP, so we'll retry the service check...
Tue Mar 2 03:43:29 2010 (445669) [016.1] [pid=18915] Running global event handler for service 'unispace' on host 'fws1-041.yandex.ru'...
Tue Mar 2 03:43:29 2010 (445680) [2320.2] [pid=18915] Raw Command Input: /usr/local/bin/golem_queue_event.sh
Tue Mar 2 03:43:29 2010 (445687) [2320.2] [pid=18915] Expanded Command Output: /usr/local/bin/golem_queue_event.sh
Tue Mar 2 03:43:29 2010 (445691) [016.2] [pid=18915] Raw global service event handler command line: /usr/local/bin/golem_queue_event.sh
Tue Mar 2 03:43:29 2010 (445697) [016.2] [pid=18915] Processed global service event handler command line: /usr/local/bin/golem_queue_event.sh
Tue Mar 2 03:43:29 2010 (495029) [016.1] [pid=18915] Checking service 'unispace' on host 'fws1-041.yandex.ru' for flapping...
Tue Mar 2 03:43:29 2010 (495050) [016.1] [pid=18915] Checking host 'fws1-041.yandex.ru' for flapping...
Tue Mar 2 03:43:29 2010 (495056) [016.2] [pid=18915] LFT=5.00, HFT=20.00, CPC=12.11, PSC=12.11%
Tue Mar 2 03:43:29 2010 (495098) [016.1] [pid=18915] Deleted check result file '/var/lib/nagios3/spool/checkresults/cci2LHf'
(0000141)
kohts (reporter)
2010-03-02 10:57

Ah, forgot to mention, worst consequence
is that event handlers are not executed
in some cases, for example when service
goes HARD CRITICAL, SOFT CRITICAL
(after host being down and up) and then
SOFT OK.

So I can't reliably propagate events
to external system.

The same is true for notifications.
(0000142)
kohts (reporter)
2010-03-05 07:31

Forget about all my comments, they were wrong.

> CRITICAL;SOFT;1
> CRITICAL;SOFT;2
> CRITICAL;HARD;3
> CRITICAL;SOFT;1

The above cited state transition happens
when after CRITICAL;HARD;3 host goes down
and there comes another CRITICAL for the service.

It resets attempts to 1 which after host goes up
means SOFT. Now if service does not recover
it becomes HARD (CRITICAL) again after retries
or it becomes OK (SOFT).

You should think differently about SOFT/HARD states.
(0000143)
kohts (reporter)
2010-03-05 07:31
edited on: 2010-03-05 07:32

I think this ticket should be closed.

(0000145)
kohts (reporter)
2010-03-05 07:40

My problem is described and fixed in this issue:
http://tracker.nagios.org/view.php?id=130 [^]
(0000155)
cameleon078 (reporter)
2010-03-22 03:47
edited on: 2010-03-22 03:47

I don't think this ticket should be closed because it's a problem more important that described here. It concerns event handler but notifications too !!

(0000156)
cameleon078 (reporter)
2010-03-22 09:21

I propose this patch :

*** checks.c 2010-03-22 15:19:04.000000000 +0100
--- nagios-3.2.1/base/checks.c 2009-08-11 18:56:39.000000000 +0200
***************
*** 1460,1468 ****
                                log_debug_info(DEBUGL_CHECKS,1,"Host is UP, so we'll retry the service check...\n");

                                /* this is a soft state */
! if (temp_service->current_attempt < temp_service->max_attempts) {
! temp_service->state_type=SOFT_STATE;
! }

                                /* log the service check retry */
                                log_service_event(temp_service);
--- 1460,1466 ----
                                log_debug_info(DEBUGL_CHECKS,1,"Host is UP, so we'll retry the service check...\n");

                                /* this is a soft state */
! temp_service->state_type=SOFT_STATE;

                                /* log the service check retry */
                                log_service_event(temp_service);
(0000197)
ndimmock (reporter)
2010-07-05 05:46
edited on: 2010-07-05 06:49

I am seeing the same problem in version 3.2.1: if a service goes CRITICAL HARD but the host remains UP, then the service retry counter gets reset to 1 which causes a state change to CRITICAL SOFT then back to CRITICAL HARD when the retry limit is exceeded again.

This is causing notifications to be resent every 4 minutes instead of the 60 I have configured via notification_interval and is not the behaviour I observe in version 2.12.

EDIT: Correction, the problem only seems to exist if use_large_installation_tweaks=1.

(0000203)
egalstad (administrator)
2010-08-04 21:35

Applied to CVS - thanks!
(0000264)
tonvoon (administrator)
2011-01-11 10:56

I've made a change to checks.c to fix a problem caused by this. More details at http://bit.ly/gZUzlR [^]

- Issue History
Date Modified Username Field Change
2010-02-18 07:00 cameleon078 New Issue
2010-02-18 07:00 cameleon078 File Added: extract_nagios_log_01.txt
2010-02-18 07:00 cameleon078 Nagios Version => 3.2.0
2010-02-18 07:00 cameleon078 OS => Linux CentOS
2010-02-18 07:00 cameleon078 OS Version => 5.4
2010-03-02 10:23 kohts Note Added: 0000139
2010-03-02 10:33 kohts Note Added: 0000140
2010-03-02 10:57 kohts Note Added: 0000141
2010-03-05 07:31 kohts Note Added: 0000142
2010-03-05 07:31 kohts Note Added: 0000143
2010-03-05 07:32 kohts Note Edited: 0000143
2010-03-05 07:40 kohts Note Added: 0000145
2010-03-22 03:47 cameleon078 Note Added: 0000155
2010-03-22 03:47 cameleon078 Note Edited: 0000155
2010-03-22 09:21 cameleon078 Note Added: 0000156
2010-07-05 05:46 ndimmock Note Added: 0000197
2010-07-05 05:47 ndimmock Issue Monitored: ndimmock
2010-07-05 06:49 ndimmock Note Edited: 0000197
2010-08-04 21:35 egalstad Note Added: 0000203
2010-08-04 21:35 egalstad Status new => resolved
2010-08-04 21:35 egalstad Resolution open => fixed
2010-08-04 21:35 egalstad Assigned To => egalstad
2011-01-11 10:56 tonvoon Note Added: 0000264
2011-01-12 04:10 kohts Issue Monitored: kohts


Mantis 1.1.7[^]
Copyright © 2000 - 2008 Mantis Group
Powered by Mantis Bugtracker