| Anonymous | Login | Signup for a new account | 2013-05-24 02:24 EDT |
| Main | My View | View Issues | Change Log | Roadmap |
| Viewing Issue Simple Details [ Jump to Notes ] | [ View Advanced ] [ Issue History ] [ Print ] | ||||||
| ID | Category | Severity | Reproducibility | Date Submitted | Last Update | ||
| 0000128 | [Nagios Core] Checks | minor | have not tried | 2010-02-18 07:00 | 2011-01-11 10:56 | ||
| Reporter | cameleon078 | View Status | public | ||||
| Assigned To | egalstad | ||||||
| Priority | normal | Resolution | fixed | ||||
| Status | resolved | Product Version | |||||
| Summary | 0000128: Service in soft state after hard state without recovery | ||||||
| Description |
Hello, Here is the history of service CPU status on host test-server (extract of nagios.log in attachement) : CRITICAL;SOFT;1 CRITICAL;SOFT;2 CRITICAL;HARD;3 CRITICAL;SOFT;1 => Why I don't have after CRITICAL;HARD;3 and before CRITICAL;SOFT;1 : OK;HARD;3. I think the host test-server was down and when it's come back the status of service changed from hard to soft ... Thanks. |
||||||
| Additional Information |
Here is a piece of code of checks.c in Nagios : /* ADDED IF STATEMENT 01-17-05 EG */ /* 01-17-05: Services in hard problem states before hosts went down would sometimes come back as soft problem states after */ /* the hosts recovered. This caused problems, so hopefully this will fix it */ if(temp_service->state_type==SOFT_STATE) temp_service->current_attempt=1; } Is this workaround works fine ? |
||||||
| Tags | No tags attached. | ||||||
| Nagios Version | 3.2.0 | ||||||
| OS | Linux CentOS | ||||||
| OS Version | 5.4 | ||||||
| Attached Files |
|
||||||
|
|
|||||||
Notes |
|
|
(0000139) kohts (reporter) 2010-03-02 10:23 |
The problem comes from this code (base/checks.c): ------ /* put service into a hard state without attempting check retries and don't send out notifications about it */ temp_service->host_problem_at_last_check=TRUE; temp_service->state_type=HARD_STATE; temp_service->last_hard_state=temp_service->current_state; temp_service->current_attempt=1; ------ The code is executed whenever service is not OK and host is not OK, and it always updates service retries to 1. If host goes up while service is still not OK service becomes SOFT which totally breaks logic. The code seems to be very old, at least it is present in first version available in cvs: http://nagios.cvs.sourceforge.net/viewvc/nagios/nagios/base/checks.c?revision=1.1&view=markup [^] It seems to be totally unneeded in current version, because HARD/SOFT transitions are managed by the code below, independently of host state. So the only valid meaning of this code can be host_problem_at_last_check flag management, everything else should be deleted. Hence the patch against 3.2.0: --- checks.c.orig 2009-08-11 20:56:39.000000000 +0400 +++ checks.c 2010-03-02 18:41:42.000000000 +0300 @@ -1412,9 +1412,10 @@ /* put service into a hard state without attempting check retries and don't send out notifications about it */ temp_service->host_problem_at_last_check=TRUE; - temp_service->state_type=HARD_STATE; - temp_service->last_hard_state=temp_service->current_state; - temp_service->current_attempt=1; + +// temp_service->state_type=HARD_STATE; +// temp_service->last_hard_state=temp_service->current_state; +// temp_service->current_attempt=1; } |
|
(0000140) kohts (reporter) 2010-03-02 10:33 |
Here are nagios.log and debug.log for the described scenario, host fws1-041.yandex.ru, service unispace. nagios.log: ----------- pechora:~# cat /var/log/nagios3/nagios.log | /usr/local/bin/read_nagios_log.pl | grep fws1-041 Tue Mar 2 00:00:00 2010 CURRENT HOST STATE: fws1-041.yandex.ru;UP;HARD;1;FPING OK - fws1-041.yandex.ru (loss=0%, rta=0.430000 ms) Tue Mar 2 00:00:00 2010 CURRENT SERVICE STATE: fws1-041.yandex.ru;COMP;OK;HARD;1;HTTP OK HTTP/1.1 200 Ok - 11783 bytes in 0.004 seconds Tue Mar 2 00:00:00 2010 CURRENT SERVICE STATE: fws1-041.yandex.ru;META;OK;HARD;1;OK Tue Mar 2 00:00:00 2010 CURRENT SERVICE STATE: fws1-041.yandex.ru;gmirror;OK;HARD;1;OK Tue Mar 2 00:00:00 2010 CURRENT SERVICE STATE: fws1-041.yandex.ru;hw_errs;OK;HARD;1;Ok Tue Mar 2 00:00:00 2010 CURRENT SERVICE STATE: fws1-041.yandex.ru;smart_temp;OK;HARD;1;ok Tue Mar 2 00:00:00 2010 CURRENT SERVICE STATE: fws1-041.yandex.ru;ssh;OK;HARD;1;SSH OK - OpenSSH_5.1p1 FreeBSD-20080901 (protocol 1.99) Tue Mar 2 00:00:00 2010 CURRENT SERVICE STATE: fws1-041.yandex.ru;unispace;OK;HARD;1;ok Tue Mar 2 00:00:00 2010 CURRENT SERVICE STATE: fws1-041.yandex.ru;watchdog;OK;HARD;1;Ok Tue Mar 2 03:33:37 2010 SERVICE ALERT: fws1-041.yandex.ru;unispace;CRITICAL;SOFT;1;0% free on /db. Tue Mar 2 03:33:37 2010 GLOBAL SERVICE EVENT HANDLER: fws1-041.yandex.ru;unispace;CRITICAL;SOFT;1;golem_event_handler Tue Mar 2 03:37:45 2010 HOST ALERT: fws1-041.yandex.ru;DOWN;SOFT;1;FPING CRITICAL - fws1-041.yandex.ru (loss=100% ) Tue Mar 2 03:37:45 2010 GLOBAL HOST EVENT HANDLER: fws1-041.yandex.ru;DOWN;SOFT;1;golem_event_handler Tue Mar 2 03:38:29 2010 SERVICE ALERT: fws1-041.yandex.ru;unispace;CRITICAL;HARD;1;0% free on /db. Tue Mar 2 03:38:35 2010 HOST ALERT: fws1-041.yandex.ru;UP;SOFT;2;FPING OK - fws1-041.yandex.ru (loss=0%, rta=0.450000 ms) Tue Mar 2 03:38:35 2010 GLOBAL HOST EVENT HANDLER: fws1-041.yandex.ru;UP;SOFT;2;golem_event_handler Tue Mar 2 03:43:29 2010 SERVICE ALERT: fws1-041.yandex.ru;unispace;CRITICAL;SOFT;1;0% free on /db. Tue Mar 2 03:43:29 2010 GLOBAL SERVICE EVENT HANDLER: fws1-041.yandex.ru;unispace;CRITICAL;SOFT;1;golem_event_handler Tue Mar 2 03:48:30 2010 SERVICE NOTIFICATION: search;fws1-041.yandex.ru;unispace;CRITICAL;notify-by-email;0% free on /db. Tue Mar 2 04:03:28 2010 SERVICE NOTIFICATION: search;fws1-041.yandex.ru;unispace;CRITICAL;notify-by-email;0% free on /db. excerpts from debug.log: ------------------------ Tue Mar 2 03:33:37 2010 (176849) [016.1] [pid=18915] Handling check result for service 'unispace' on host 'fws1-041.yandex.ru'... Tue Mar 2 03:33:37 2010 (176853) [016.0] [pid=18915] ** Handling check result for service 'unispace' on host 'fws1-041.yandex.ru'... Tue Mar 2 03:33:37 2010 (176856) [016.1] [pid=18915] HOST: fws1-041.yandex.ru, SERVICE: unispace, CHECK TYPE: Passive, OPTIONS: 0, SCHEDULED: No, RESCHEDULE: No, EXITED OK: Yes, RETURN CODE: 2, OUTPUT: 0% free on /db. Tue Mar 2 03:33:37 2010 (176865) [016.2] [pid=18915] Parsing check output... Tue Mar 2 03:33:37 2010 (176870) [016.2] [pid=18915] Short Output: 0% free on /db. Tue Mar 2 03:33:37 2010 (176873) [016.2] [pid=18915] Long Output: NULL Tue Mar 2 03:33:37 2010 (176876) [016.2] [pid=18915] Perf Data: NULL Tue Mar 2 03:33:37 2010 (176880) [016.2] [pid=18915] ST: HARD CA: 1 MA: 2 CS: 2 LS: 0 LHS: 0 Tue Mar 2 03:33:37 2010 (176883) [016.2] [pid=18915] Service has changed state since last check! Tue Mar 2 03:33:37 2010 (176887) [016.1] [pid=18915] Service is in a non-OK state! Tue Mar 2 03:33:37 2010 (176890) [016.1] [pid=18915] Host is currently UP, so we'll recheck its state to make sure... Tue Mar 2 03:33:37 2010 (176893) [016.0] [pid=18915] ** Running async check of host 'fws1-041.yandex.ru'... Tue Mar 2 03:33:37 2010 (176898) [016.0] [pid=18915] Checking host 'fws1-041.yandex.ru'... Tue Mar 2 03:33:37 2010 (176902) [016.2] [pid=18915] Adjusting check attempt number for host 'fws1-041.yandex.ru': current attempt=1/5, state=0, state type=1 Tue Mar 2 03:33:37 2010 (176905) [016.2] [pid=18915] New check attempt number = 1 Tue Mar 2 03:33:37 2010 (176922) [2320.2] [pid=18915] Raw Command Input: $USER1$/check_fping $HOSTADDRESS$ 20 60 10.0 100.0 -p 5 Tue Mar 2 03:33:37 2010 (176927) [2320.2] [pid=18915] Expanded Command Output: $USER1$/check_fping $HOSTADDRESS$ 20 60 10.0 100.0 -p 5 Tue Mar 2 03:33:37 2010 (177370) [016.1] [pid=18915] Check result output will be written to '/tmp/checkGvQ1t9' (fd=6) Tue Mar 2 03:33:37 2010 (183186) [016.2] [pid=18915] Host check is executing in child process (pid=24156) Tue Mar 2 03:33:37 2010 (183284) [016.1] [pid=18915] Current/Max Attempt(s): 1/2 Tue Mar 2 03:33:37 2010 (183292) [016.1] [pid=18915] Host is UP, so we'll retry the service check... Tue Mar 2 03:33:37 2010 (183408) [016.1] [pid=18915] Running global event handler for service 'unispace' on host 'fws1-041.yandex.ru'... Tue Mar 2 03:33:37 2010 (183421) [2320.2] [pid=18915] Raw Command Input: /usr/local/bin/golem_queue_event.sh Tue Mar 2 03:33:37 2010 (183429) [2320.2] [pid=18915] Expanded Command Output: /usr/local/bin/golem_queue_event.sh Tue Mar 2 03:33:37 2010 (183455) [016.2] [pid=18915] Raw global service event handler command line: /usr/local/bin/golem_queue_event.sh Tue Mar 2 03:33:37 2010 (183463) [016.2] [pid=18915] Processed global service event handler command line: /usr/local/bin/golem_queue_event.sh Tue Mar 2 03:33:37 2010 (235680) [016.2] [pid=24156] Moving temp check result file '/tmp/checkGvQ1t9' to queue file '/var/lib/nagios3/spool/checkresults/cWDb97f'... Tue Mar 2 03:33:37 2010 (237716) [016.1] [pid=18915] Rescheduling next check of service at Tue Mar 2 03:35:49 2010 Tue Mar 2 03:33:37 2010 (237751) [016.1] [pid=18915] Checking service 'unispace' on host 'fws1-041.yandex.ru' for flapping... Tue Mar 2 03:33:37 2010 (237757) [016.1] [pid=18915] Checking host 'fws1-041.yandex.ru' for flapping... Tue Mar 2 03:33:37 2010 (237763) [016.2] [pid=18915] LFT=5.00, HFT=20.00, CPC=0.00, PSC=0.00% Tue Mar 2 03:33:37 2010 (237806) [016.1] [pid=18915] Deleted check result file '/var/lib/nagios3/spool/checkresults/ctD5CkR' Tue Mar 2 03:38:29 2010 (050029) [016.1] [pid=18915] Handling check result for service 'unispace' on host 'fws1-041.yandex.ru'... Tue Mar 2 03:38:29 2010 (050034) [016.0] [pid=18915] ** Handling check result for service 'unispace' on host 'fws1-041.yandex.ru'... Tue Mar 2 03:38:29 2010 (050037) [016.1] [pid=18915] HOST: fws1-041.yandex.ru, SERVICE: unispace, CHECK TYPE: Passive, OPTIONS: 0, SCHEDULED: No, RESCHEDULE: No, EXITED OK: Yes, RETURN CODE: 2, OUTPUT: 0% free on /db. Tue Mar 2 03:38:29 2010 (050061) [016.2] [pid=18915] Parsing check output... Tue Mar 2 03:38:29 2010 (050067) [016.2] [pid=18915] Short Output: 0% free on /db. Tue Mar 2 03:38:29 2010 (050070) [016.2] [pid=18915] Long Output: NULL Tue Mar 2 03:38:29 2010 (050074) [016.2] [pid=18915] Perf Data: NULL Tue Mar 2 03:38:29 2010 (050078) [016.2] [pid=18915] ST: SOFT CA: 2 MA: 2 CS: 2 LS: 2 LHS: 0 Tue Mar 2 03:38:29 2010 (050082) [016.2] [pid=18915] Service had a HARD STATE CHANGE!! Tue Mar 2 03:38:29 2010 (050111) [016.1] [pid=18915] Service is in a non-OK state! Tue Mar 2 03:38:29 2010 (050116) [016.1] [pid=18915] Host is currently DOWN/UNREACHABLE. Tue Mar 2 03:38:29 2010 (050119) [016.1] [pid=18915] Assuming host is in same state as before... Tue Mar 2 03:38:29 2010 (050151) [016.1] [pid=18915] Current/Max Attempt(s): 1/2 Tue Mar 2 03:38:29 2010 (050156) [016.1] [pid=18915] Host isn't UP, so we won't retry the service check... Tue Mar 2 03:38:29 2010 (050612) [016.1] [pid=18915] Rescheduling next check of service at Tue Mar 2 03:40:49 2010 Tue Mar 2 03:38:29 2010 (050622) [016.1] [pid=18915] Checking service 'unispace' on host 'fws1-041.yandex.ru' for flapping... Tue Mar 2 03:38:29 2010 (050627) [016.2] [pid=18915] LFT=5.00, HFT=20.00, CPC=6.25, PSC=6.25% Tue Mar 2 03:38:29 2010 (050635) [016.1] [pid=18915] Checking host 'fws1-041.yandex.ru' for flapping... Tue Mar 2 03:38:29 2010 (050638) [016.2] [pid=18915] LFT=5.00, HFT=20.00, CPC=6.25, PSC=6.25% Tue Mar 2 03:38:29 2010 (050654) [016.1] [pid=18915] Deleted check result file '/var/lib/nagios3/spool/checkresults/cv9dTg7' Tue Mar 2 03:43:29 2010 (444940) [016.2] [pid=18915] Found a check result (#6235) to handle... Tue Mar 2 03:43:29 2010 (444945) [016.1] [pid=18915] Handling check result for service 'unispace' on host 'fws1-041.yandex.ru'... Tue Mar 2 03:43:29 2010 (444949) [016.0] [pid=18915] ** Handling check result for service 'unispace' on host 'fws1-041.yandex.ru'... Tue Mar 2 03:43:29 2010 (444952) [016.1] [pid=18915] HOST: fws1-041.yandex.ru, SERVICE: unispace, CHECK TYPE: Passive, OPTIONS: 0, SCHEDULED: No, RESCHEDULE: No, EXITED OK: Yes, RETURN CODE: 2, OUTPUT: 0% free on /db. Tue Mar 2 03:43:29 2010 (444960) [016.2] [pid=18915] Parsing check output... Tue Mar 2 03:43:29 2010 (444964) [016.2] [pid=18915] Short Output: 0% free on /db. Tue Mar 2 03:43:29 2010 (444967) [016.2] [pid=18915] Long Output: NULL Tue Mar 2 03:43:29 2010 (444970) [016.2] [pid=18915] Perf Data: NULL Tue Mar 2 03:43:29 2010 (444973) [016.2] [pid=18915] ST: HARD CA: 1 MA: 2 CS: 2 LS: 2 LHS: 2 Tue Mar 2 03:43:29 2010 (444977) [016.1] [pid=18915] Service is in a non-OK state! Tue Mar 2 03:43:29 2010 (444980) [016.1] [pid=18915] Host is currently UP, so we'll recheck its state to make sure... Tue Mar 2 03:43:29 2010 (444983) [016.1] [pid=18915] * Using last known host state: 0 Tue Mar 2 03:43:29 2010 (444986) [016.1] [pid=18915] Current/Max Attempt(s): 1/2 Tue Mar 2 03:43:29 2010 (444989) [016.1] [pid=18915] Host is UP, so we'll retry the service check... Tue Mar 2 03:43:29 2010 (445669) [016.1] [pid=18915] Running global event handler for service 'unispace' on host 'fws1-041.yandex.ru'... Tue Mar 2 03:43:29 2010 (445680) [2320.2] [pid=18915] Raw Command Input: /usr/local/bin/golem_queue_event.sh Tue Mar 2 03:43:29 2010 (445687) [2320.2] [pid=18915] Expanded Command Output: /usr/local/bin/golem_queue_event.sh Tue Mar 2 03:43:29 2010 (445691) [016.2] [pid=18915] Raw global service event handler command line: /usr/local/bin/golem_queue_event.sh Tue Mar 2 03:43:29 2010 (445697) [016.2] [pid=18915] Processed global service event handler command line: /usr/local/bin/golem_queue_event.sh Tue Mar 2 03:43:29 2010 (495029) [016.1] [pid=18915] Checking service 'unispace' on host 'fws1-041.yandex.ru' for flapping... Tue Mar 2 03:43:29 2010 (495050) [016.1] [pid=18915] Checking host 'fws1-041.yandex.ru' for flapping... Tue Mar 2 03:43:29 2010 (495056) [016.2] [pid=18915] LFT=5.00, HFT=20.00, CPC=12.11, PSC=12.11% Tue Mar 2 03:43:29 2010 (495098) [016.1] [pid=18915] Deleted check result file '/var/lib/nagios3/spool/checkresults/cci2LHf' |
|
(0000141) kohts (reporter) 2010-03-02 10:57 |
Ah, forgot to mention, worst consequence is that event handlers are not executed in some cases, for example when service goes HARD CRITICAL, SOFT CRITICAL (after host being down and up) and then SOFT OK. So I can't reliably propagate events to external system. The same is true for notifications. |
|
(0000142) kohts (reporter) 2010-03-05 07:31 |
Forget about all my comments, they were wrong. > CRITICAL;SOFT;1 > CRITICAL;SOFT;2 > CRITICAL;HARD;3 > CRITICAL;SOFT;1 The above cited state transition happens when after CRITICAL;HARD;3 host goes down and there comes another CRITICAL for the service. It resets attempts to 1 which after host goes up means SOFT. Now if service does not recover it becomes HARD (CRITICAL) again after retries or it becomes OK (SOFT). You should think differently about SOFT/HARD states. |
|
(0000143) kohts (reporter) 2010-03-05 07:31 edited on: 2010-03-05 07:32 |
I think this ticket should be closed. |
|
(0000145) kohts (reporter) 2010-03-05 07:40 |
My problem is described and fixed in this issue: http://tracker.nagios.org/view.php?id=130 [^] |
|
(0000155) cameleon078 (reporter) 2010-03-22 03:47 edited on: 2010-03-22 03:47 |
I don't think this ticket should be closed because it's a problem more important that described here. It concerns event handler but notifications too !! |
|
(0000156) cameleon078 (reporter) 2010-03-22 09:21 |
I propose this patch : *** checks.c 2010-03-22 15:19:04.000000000 +0100 --- nagios-3.2.1/base/checks.c 2009-08-11 18:56:39.000000000 +0200 *************** *** 1460,1468 **** log_debug_info(DEBUGL_CHECKS,1,"Host is UP, so we'll retry the service check...\n"); /* this is a soft state */ ! if (temp_service->current_attempt < temp_service->max_attempts) { ! temp_service->state_type=SOFT_STATE; ! } /* log the service check retry */ log_service_event(temp_service); --- 1460,1466 ---- log_debug_info(DEBUGL_CHECKS,1,"Host is UP, so we'll retry the service check...\n"); /* this is a soft state */ ! temp_service->state_type=SOFT_STATE; /* log the service check retry */ log_service_event(temp_service); |
|
(0000197) ndimmock (reporter) 2010-07-05 05:46 edited on: 2010-07-05 06:49 |
I am seeing the same problem in version 3.2.1: if a service goes CRITICAL HARD but the host remains UP, then the service retry counter gets reset to 1 which causes a state change to CRITICAL SOFT then back to CRITICAL HARD when the retry limit is exceeded again. This is causing notifications to be resent every 4 minutes instead of the 60 I have configured via notification_interval and is not the behaviour I observe in version 2.12. EDIT: Correction, the problem only seems to exist if use_large_installation_tweaks=1. |
|
(0000203) egalstad (administrator) 2010-08-04 21:35 |
Applied to CVS - thanks! |
|
(0000264) tonvoon (administrator) 2011-01-11 10:56 |
I've made a change to checks.c to fix a problem caused by this. More details at http://bit.ly/gZUzlR [^] |
| Mantis 1.1.7[^] Copyright © 2000 - 2008 Mantis Group |