| Anonymous | Login | Signup for a new account | 2013-05-19 14:02 EDT |
| Main | My View | View Issues | Change Log | Roadmap |
| Viewing Issue Simple Details [ Jump to Notes ] | [ View Advanced ] [ Issue History ] [ Print ] | ||||||
| ID | Category | Severity | Reproducibility | Date Submitted | Last Update | ||
| 0000031 | [Nagios Core] Check Scheduling | minor | always | 2009-06-25 05:25 | 2012-09-05 09:07 | ||
| Reporter | hsmade | View Status | public | ||||
| Assigned To | ageric | ||||||
| Priority | normal | Resolution | fixed | ||||
| Status | resolved | Product Version | |||||
| Summary | 0000031: 8 to midnight timeperiod schedules check for next year | ||||||
| Description |
I have a timeperiod config like this: define timeperiod{ timeperiod_name after_processing alias After processing sunday 08:00-23:55 monday 08:00-23:55 tuesday 08:00-23:55 wednesday 08:00-23:55 thursday 08:00-23:55 friday 08:00-23:55 saturday 08:00-23:55 } I have a check that uses this time period. When the last check for that day is executed (just before 23:55), the next scheduled check time becomes just after 8 o' clock the next day plus a year! Last Check Time: 06-22-2009 23:49:44 Check Type: ACTIVE Next Scheduled Check: 06-23-2010 08:04:39 |
||||||
| Additional Information | |||||||
| Tags | No tags attached. | ||||||
| Nagios Version | 3.0.6 | ||||||
| OS | Linux | ||||||
| OS Version | Debian Lenny | ||||||
| Attached Files | |||||||
|
|
|||||||
Notes |
|
|
(0000056) hsmade (reporter) 2009-07-02 04:53 |
I could reproduce this for nagios version 3.1.2 as well, just that the service check time is set to the next possible check time + 1 week. |
|
(0000057) hsmade (reporter) 2009-07-03 03:07 edited on: 2009-07-03 03:08 |
I found out that the rescheduling is actually done correctly (8am next day), but when it wants to check it reschedules for next week without executing the check. get_next_valid_time is called with time 8:04:53 and comes back with this same time stating PREF TIME IS VALID (-DTEST_TIMEPERIODS_B). |
|
(0000064) hsmade (reporter) 2009-07-06 02:41 edited on: 2009-07-07 02:07 |
I found the reason for this. The service is scheduled for the next morning after the last check for the day as it should. It's executed a couple of seconds early however, so run_async_service_check returns 'ERROR' as that's outside of the timeperiod. time_is_valid is then set to FALSE. Next run_scheduled_service_check tries to reschedule with a preferred time of +5 mins, which is inside the timeperiod and thus accepted. So next_valid_time=preferred_time and time_is_valid=FALSE now, which leads to a delay in scheduling of a week (in 3.1.2, in 3.0.6 a year). # checks.c#run_scheduled_service_check: result=run_async_service_check(svc,check_options,latency,TRUE,TRUE,&time_is_valid,&preferred_time); # this results in result=ERROR and time_is_valid=FALSE as it's called just before 07:59:54 if(result==ERROR){ if(svc->should_be_scheduled==TRUE){ time(¤t_time); if(current_time>=preferred_time) # true preferred_time=current_time+((svc->check_interval<=0)?300:(svc->check_interval*interval_length)); # preferred_time is now 08:04:54 get_next_valid_time(preferred_time,&next_valid_time,svc->check_period_ptr); if(time_is_valid==FALSE && next_valid_time==preferred_time){ # the preferred_time was accepted by get_next_valid_time, as it falls in the timeperiod. # But time_is_valid was still set to FALSE, # so the check scheduling is delayed with a week I think the time_is_valid that's tested here should be set by get_next_valid_time, or not tested at all. Apart from that, of course it's wrong to execute the check before it's scheduled time. |
|
(0000081) hirenp (reporter) 2009-08-08 20:01 |
I am not able to reproduce nagios running a check a few seconds early as you describe above. I'm using -HEAD with very simple configs, and have not been able to cause nagios to schdule a check a week away. if you have simple configs that can reproduce this on -HEAD, please let me know. |
|
(0000082) hsmade (reporter) 2009-08-10 12:42 |
Try this: service may run from 8am to 23pm, then rewind the clock for some minutes after 23pm. After 23pm, nagios will reschedule the service for 8am the next morning. When rewinding the clock, nagios will compensate and thus reschedule the check for 8am minus 'some minutes' and hence execution fails and a reschedule is done for +1 week. |
|
(0000083) hirenp (reporter) 2009-08-10 12:57 |
ah okay, skewing the time could cause the check to be run a few minutes early, that makes a lot more sense. please test the diff sent to the development list today to see if this helps with this issue, if you're not on the list I could send you the patch. the diff is a workaround that should solve this issue. |
|
(0000090) eriksornes (reporter) 2009-09-02 05:10 |
This is a pretty serious issue for us. I think the severity should be raised. I can not trust that the nagios-check actually run. |
|
(0000091) hsmade (reporter) 2009-09-02 05:14 |
A workaround is to schedule your NTP updates every hour or even more often. |
|
(0000092) edgarwahn (reporter) 2009-09-02 07:12 |
You should cease using ntpdate and switch to ntpd, ntpd "drifts" your clock (making seconds a little bit longer or shorter until time fits). On Debian try openntpd (set DAEMON_OPTS="-s" in /etc/defaults/openntpd). |
|
(0000093) hsmade (reporter) 2009-09-02 07:20 |
Thanks for the tip! |
|
(0000170) ageric (administrator) 2010-04-26 16:27 |
Is this still an issue with 3.2.1? Some fixes to take care of this went in just before 3.2.0, I think, and another one is in cvs. If this isn't fixed in later versions, we'll need to take care of that. Otherwise, I'll close this issue in two weeks time. |
|
(0000196) rik (reporter) 2010-06-24 09:39 |
Please do not close this issue. I can confirm buggy behaviour in Nagios Core 3.2.1. We need a weekly check, fridays between 16:00-17:00. This is our define timeperiod { timeperiod_name fri1h1600 alias friday afternoon friday 16:00-17:00 } After first definition of a new service (with check_interval 15), it is scheduled to run exactly at 16:00 next friday. After that run, it gets rescheduled to run at 16:15 , 16:30, 16:45 and 17:00. Then another scheduled run at 17:15. At 17:15, Nagios decides not to actually run the plugin, but reschedule the check instead. The next run is then scheduled at 17:30, a week from now. Which happens to be the next friday like we needed, but not within the desired timeframe, so the check does not actually run then either. nagios.log reports: "[timestamp] Warning: Check of service debug weekly 15 on host testhost could not be rescheduled properly. Scheduling check for next week..." Compared to the behaviour of 3.0.6, this makes little improvement. The log entry at least makes the error detectable. Here is a possible workaround in 3.2.1. Mileage may vary with different versions. For weekly checks on Thursdays between 16:10-16:20: define timeperiod { timeperiod_name weeklyxx alias has funny way of scheduling a first run! thursday 1 16:10-16:20 thursday 2 16:10-16:20 thursday 3 16:10-16:20 thursday 4 16:10-16:20 thursday 5 16:10-16:20 } define timeperiod { timeperiod_name weeklyxxx alias shorter and more reliable 2010-06-17 / 7 16:10-16:20 } Both will reschedule as expected: next week, same day, same time (16:10). Please do not close this issue. We consider this behaviour to be a bug. |
|
(0000259) terracor (reporter) 2010-11-22 03:28 |
I can confirm this behaviour in Nagios 3.2.1 too. The workaround with a shorter timeperiod definition does not work here. Any other hints, to get around this behaviour? This bug is really nasty, cause I can't rely on the checks any longer :( |
|
(0000260) pabel (reporter) 2010-11-26 03:31 |
Is there any chance the Severity of this can be raised as this is a serious problem for us. We are running 3.2.3. |
|
(0000262) terracor (reporter) 2010-12-27 07:35 |
It would be really cool, if somebody could take care of this issue or at least raise the severity. |
|
(0000266) pabel (reporter) 2011-01-25 06:06 |
We have hit the same bug with notifications of passive checks and timeperiods. I have traced both back to the "check_time_against_period" function. |
|
(0000267) qha (reporter) 2011-01-31 08:36 |
I too would like to see an increase in severity of this issue. |
|
(0000268) eyandow (reporter) 2011-02-08 10:02 |
I would like this issue to be fixed or at least escalated as well. We are running an NTPD server and are having a remote host checking the time on that server so correct time isn't an issue. This is a pretty serious bug and lots of people are having it. |
|
(0000275) Bernd (reporter) 2011-03-12 04:46 |
There has been some discussions about scheduling issues quite a time ago. Since I believe those discussions are related to this issue I'm posting the links I remember pointing to the nagios-devel mailing list archive. http://thread.gmane.org/gmane.network.nagios.devel/6170 [^] http://thread.gmane.org/gmane.network.nagios.devel/6158 [^] http://thread.gmane.org/gmane.network.nagios.devel/6067 [^] http://thread.gmane.org/gmane.network.nagios.devel/5530 [^] About ntpd: "Running ntpd won't help at all. The timeshifts happen because the systemclock is running too fast. It doesn't really matter if you adjust it back in little ticks or in one big bang once a day." (by Sascha Runschke, article 5540) |
|
(0000296) ageric (administrator) 2011-05-10 10:03 |
Fixes have been added to Nagios after 3.2.3 which might fix this issue. If any of you experiencing this could test the latest svn code (it's quite stable), I'd be most grateful. Trying (and failing) to reproduce this is frustrating, to say the least. |
|
(0000302) pabel (reporter) 2011-05-11 03:31 |
I checked out and built trunk from svn. We have checks that should run once per day in a 1 hour period. They did not fire with the new build. I can test as much as required until this is resolved if you need a test platform. |
|
(0000323) murf (reporter) 2011-08-14 15:31 edited on: 2011-08-20 18:16 |
Is there any progress on this? I have recently discovered my daily scheduled checks are not working, as they would appear to be getting rescheduled outside their timeperiods. In the interim I am restarting Nagios daily without retaining the scheduling queue, which seems to get the checks scheduled correctly once at least. --- Actually this is not the case. What seems to be happening is that the scheduled jobs are getting rescheduled, which is moving them outside of their timeperiods. I don't understand why these checks would be rescheduled. |
|
(0000363) dnsmichi (reporter) 2011-10-22 13:27 |
it's pretty hard to reproduce, so if you happen to have some debug logs, please provide them. nagios.cfg debug_level=16 debug_verbosity=2 max_debug_file_size=100000000 |
|
(0000365) murf (reporter) 2011-10-25 03:29 edited on: 2011-10-26 04:15 |
we tracked down the problem to a problem with NTP servers, so the time was getting adjusted on the nagios server and it would then reschedule the checks. The resceduling part of the code from that entry poiunt did not check that it was rescheduling into a valid timeperiod for the check. We sorted out the NTP issues and everything has been working fine since. Finding that the system was "compensating" for the time changes was what we saw in the debug. |
|
(0000429) chrisgpayne (reporter) 2012-05-15 10:12 |
We are seeing a similar problem (Opsview Community, using "Nagios® Core⢠3.2.3"). Other 24x7 checks on this host work fine, but we have a check scheduled to run Mondays 12:05-20:00, that hangs every week. I can manually schedule a check for 12:05 Monday, and during the active time, the check runs fine. However, at the end of the check interval, it schedules the next check to 20:10 (outside of the window) the following week, with this error in the logs: [1337047500] Warning: Check of service 'XXXX' on host 'yyyy' could not be rescheduled properly. Scheduling check for next week... And the check keeps getting pushed back , I believe 5 minutes every week with the same error, never actually running. The service description is: define service { host_name yyyy service_description XXXX check_command check_nrpe!-H $HOSTADDRESS$ -p 15666 -c checkXXXX normal_check_interval 5 retry_check_interval 1 check_period mondaynoonfiveeight max_check_attempts 3 flap_detection_enabled 1 notifications_enabled 1 notification_period 24x7 notification_interval 60 notification_options c,r,u,w contact_groups hostgroup41_servicegroup45,keyword_12,keyword_11 notes IMPORTED: use service-global } inherited from: define service{ name service-global active_checks_enabled 1 ; Active service checks are enabled passive_checks_enabled 1 ; Passive service checks are enabled/accepted obsess_over_service 0 ; We should obsess over this service (if necessary) check_freshness 0 ; Default is to NOT check service 'freshness' notifications_enabled 1 ; Service notifications are enabled event_handler_enabled 1 ; Service event handler is enabled but ignored unless event_handler set flap_detection_enabled 1 ; Flap detection is enabled process_perf_data 1 ; Process performance data retain_status_information 1 ; Retain status information across program restarts retain_nonstatus_information 1 ; Retain non-status information across program restarts is_volatile 0 check_period 24x7 register 0 ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL SERVICE, JUST A TEMPLATE! } and the interval is defined: define timeperiod{ timeperiod_name mondaynoonfiveeight alias Monday Five after Noon to Eight pm monday 12:05-20:00 } The server is properly synced to NTP, so I do not think this is the issue. I don't think I have misconfigured anything, and this is imported from a nagios 2.12 installation where everything had worked as expected. Does anyone have any suggestions? |
|
(0000575) ageric (administrator) 2012-09-05 09:07 |
commit 4c108b3ee07574f45a518dcc249f8da7103f5669 Author: Andreas Ericsson <ae@op5.se> Date: Wed Sep 5 13:06:42 2012 +0000 core: Avoid scheduling stuff a week into the future One pretty major issue with scheduling one week into the future is that by next week the time will be exactly the same as it is now, so if timeperiods are defined in a "normal" manner we'll be in the exact same situation then as we are now, and we'll just keep on scheduling stuff far into the future. Another issue is that it's unlikely that checks are actually supposed to run that far into the future, and we can therefore assume that things have gone haywire beyond the usual manners. The most sensible (for a given value of sensible) thing we can do is to use the preferred time to schedule the check and see then if we can get a sane timeslot for the check to run. That will cause us to run into the event more frequently than necessary, but it's not that great a chore and it will at least prevent checks from going awol in scheduling limbo for any extended period of time. As a nice bonus, it also lets me close issue 31 for good and all, as users now have a way to tweak the amount of time into the future we'll schedule the checks when we fail to schedule them due to some bugs, which is a Good Thing(tm) indeed and lets them work around future problems by using some suitable number for check interval. So long as it's shorter than the configured time period slot, the check is guaranteed to run at least once. Signed-off-by: Andreas Ericsson <ae@op5.se> git-svn-id: https://nagios.svn.sourceforge.net/svnroot/nagios/nagioscore/trunk@2152 [^] 5f96b256-904b-4d8d-8c98-d829582c6739 |
| Mantis 1.1.7[^] Copyright © 2000 - 2008 Mantis Group |