Nagios Bug and Feature Tracker
Bug and Feature Tracker

Viewing Issue Simple Details Jump to Notes ] View Advanced ] Issue History ] Print ]
ID Category Severity Reproducibility Date Submitted Last Update
0000031 [Nagios Core] Check Scheduling minor always 2009-06-25 05:25 2012-09-05 09:07
Reporter hsmade View Status public  
Assigned To ageric
Priority normal Resolution fixed  
Status resolved   Product Version
Summary 0000031: 8 to midnight timeperiod schedules check for next year
Description I have a timeperiod config like this:
define timeperiod{
        timeperiod_name after_processing
        alias After processing
        sunday 08:00-23:55
        monday 08:00-23:55
        tuesday 08:00-23:55
        wednesday 08:00-23:55
        thursday 08:00-23:55
        friday 08:00-23:55
        saturday 08:00-23:55
        }

I have a check that uses this time period. When the last check for that day is executed (just before 23:55), the next scheduled check time becomes just after 8 o' clock the next day plus a year!

Last Check Time: 06-22-2009 23:49:44
Check Type: ACTIVE
Next Scheduled Check: 06-23-2010 08:04:39

Additional Information
Tags No tags attached.
Nagios Version 3.0.6
OS Linux
OS Version Debian Lenny
Attached Files

- Relationships

-  Notes
(0000056)
hsmade (reporter)
2009-07-02 04:53

I could reproduce this for nagios version 3.1.2 as well, just that the service check time is set to the next possible check time + 1 week.
(0000057)
hsmade (reporter)
2009-07-03 03:07
edited on: 2009-07-03 03:08

I found out that the rescheduling is actually done correctly (8am next day), but when it wants to check it reschedules for next week without executing the check.

get_next_valid_time is called with time 8:04:53 and comes back with this same time stating PREF TIME IS VALID (-DTEST_TIMEPERIODS_B).

(0000064)
hsmade (reporter)
2009-07-06 02:41
edited on: 2009-07-07 02:07

I found the reason for this. The service is scheduled for the next morning after the last check for the day as it should. It's executed a couple of seconds early however, so run_async_service_check returns 'ERROR' as that's outside of the timeperiod. time_is_valid is then set to FALSE.
Next run_scheduled_service_check tries to reschedule with a preferred time of +5 mins, which is inside the timeperiod and thus accepted. So next_valid_time=preferred_time and time_is_valid=FALSE now, which leads to a delay in scheduling of a week (in 3.1.2, in 3.0.6 a year).

# checks.c#run_scheduled_service_check:
result=run_async_service_check(svc,check_options,latency,TRUE,TRUE,&time_is_valid,&preferred_time);
# this results in result=ERROR and time_is_valid=FALSE as it's called just before 07:59:54

if(result==ERROR){
 if(svc->should_be_scheduled==TRUE){
  time(&current_time);
  if(current_time>=preferred_time) # true
   preferred_time=current_time+((svc->check_interval<=0)?300:(svc->check_interval*interval_length));
   # preferred_time is now 08:04:54
   get_next_valid_time(preferred_time,&next_valid_time,svc->check_period_ptr);
   if(time_is_valid==FALSE && next_valid_time==preferred_time){
    # the preferred_time was accepted by get_next_valid_time, as it falls in the timeperiod.
    # But time_is_valid was still set to FALSE,
    # so the check scheduling is delayed with a week

I think the time_is_valid that's tested here should be set by get_next_valid_time, or not tested at all. Apart from that, of course it's wrong to execute the check before it's scheduled time.

(0000081)
hirenp (reporter)
2009-08-08 20:01

I am not able to reproduce nagios running a check a few seconds early as you describe above. I'm using -HEAD with very simple configs, and have not been able to cause nagios to schdule a check a week away. if you have simple configs that can reproduce this on -HEAD, please let me know.
(0000082)
hsmade (reporter)
2009-08-10 12:42

Try this: service may run from 8am to 23pm, then rewind the clock for some minutes after 23pm. After 23pm, nagios will reschedule the service for 8am the next morning. When rewinding the clock, nagios will compensate and thus reschedule the check for 8am minus 'some minutes' and hence execution fails and a reschedule is done for +1 week.
(0000083)
hirenp (reporter)
2009-08-10 12:57

ah okay, skewing the time could cause the check to be run a few minutes early, that makes a lot more sense. please test the diff sent to the development list today to see if this helps with this issue, if you're not on the list I could send you the patch. the diff is a workaround that should solve this issue.
(0000090)
eriksornes (reporter)
2009-09-02 05:10

This is a pretty serious issue for us. I think the severity should be raised.
I can not trust that the nagios-check actually run.
(0000091)
hsmade (reporter)
2009-09-02 05:14

A workaround is to schedule your NTP updates every hour or even more often.
(0000092)
edgarwahn (reporter)
2009-09-02 07:12

You should cease using ntpdate and switch to ntpd, ntpd "drifts" your clock (making seconds a little bit longer or shorter until time fits).
On Debian try openntpd (set DAEMON_OPTS="-s" in /etc/defaults/openntpd).
(0000093)
hsmade (reporter)
2009-09-02 07:20

Thanks for the tip!
(0000170)
ageric (administrator)
2010-04-26 16:27

Is this still an issue with 3.2.1? Some fixes to take care of this went in just before 3.2.0, I think, and another one is in cvs.

If this isn't fixed in later versions, we'll need to take care of that. Otherwise, I'll close this issue in two weeks time.
(0000196)
rik (reporter)
2010-06-24 09:39

Please do not close this issue.

I can confirm buggy behaviour in Nagios Core 3.2.1. We need a weekly check, fridays between 16:00-17:00. This is our
define timeperiod {
   timeperiod_name fri1h1600
   alias friday afternoon
   friday 16:00-17:00
}

After first definition of a new service (with check_interval 15), it is scheduled to run exactly at 16:00 next friday. After that run, it gets rescheduled to run at 16:15 , 16:30, 16:45 and 17:00. Then another scheduled run at 17:15. At 17:15, Nagios decides not to actually run the plugin, but reschedule the check instead. The next run is then scheduled at 17:30, a week from now. Which happens to be the next friday like we needed, but not within the desired timeframe, so the check does not actually run then either.

nagios.log reports:
"[timestamp] Warning: Check of service debug weekly 15 on host testhost could not be rescheduled properly. Scheduling check for next week..."

Compared to the behaviour of 3.0.6, this makes little improvement. The log entry at least makes the error detectable.

Here is a possible workaround in 3.2.1. Mileage may vary with different versions. For weekly checks on Thursdays between 16:10-16:20:

define timeperiod {
   timeperiod_name weeklyxx
   alias has funny way of scheduling a first run!
   thursday 1 16:10-16:20
   thursday 2 16:10-16:20
   thursday 3 16:10-16:20
   thursday 4 16:10-16:20
   thursday 5 16:10-16:20
}

define timeperiod {
   timeperiod_name weeklyxxx
   alias shorter and more reliable
   2010-06-17 / 7 16:10-16:20
}

Both will reschedule as expected: next week, same day, same time (16:10).

Please do not close this issue. We consider this behaviour to be a bug.
(0000259)
terracor (reporter)
2010-11-22 03:28

I can confirm this behaviour in Nagios 3.2.1 too. The workaround with a shorter timeperiod definition does not work here. Any other hints, to get around this behaviour? This bug is really nasty, cause I can't rely on the checks any longer :(
(0000260)
pabel (reporter)
2010-11-26 03:31

Is there any chance the Severity of this can be raised as this is a serious problem for us. We are running 3.2.3.
(0000262)
terracor (reporter)
2010-12-27 07:35

It would be really cool, if somebody could take care of this issue or at least raise the severity.
(0000266)
pabel (reporter)
2011-01-25 06:06

We have hit the same bug with notifications of passive checks and timeperiods. I have traced both back to the "check_time_against_period" function.
(0000267)
qha (reporter)
2011-01-31 08:36

I too would like to see an increase in severity of this issue.
(0000268)
eyandow (reporter)
2011-02-08 10:02

I would like this issue to be fixed or at least escalated as well. We are running an NTPD server and are having a remote host checking the time on that server so correct time isn't an issue. This is a pretty serious bug and lots of people are having it.
(0000275)
Bernd (reporter)
2011-03-12 04:46

There has been some discussions about scheduling issues quite a time ago. Since I believe those discussions are related to this issue I'm posting the links I remember pointing to the nagios-devel mailing list archive.

http://thread.gmane.org/gmane.network.nagios.devel/6170 [^]
http://thread.gmane.org/gmane.network.nagios.devel/6158 [^]
http://thread.gmane.org/gmane.network.nagios.devel/6067 [^]
http://thread.gmane.org/gmane.network.nagios.devel/5530 [^]

About ntpd: "Running ntpd won't help at all. The timeshifts happen because the systemclock is running too fast. It doesn't really matter if you
adjust it back in little ticks or in one big bang once a day." (by Sascha Runschke, article 5540)
(0000296)
ageric (administrator)
2011-05-10 10:03

Fixes have been added to Nagios after 3.2.3 which might fix this issue. If any of you experiencing this could test the latest svn code (it's quite stable), I'd be most grateful.

Trying (and failing) to reproduce this is frustrating, to say the least.
(0000302)
pabel (reporter)
2011-05-11 03:31

I checked out and built trunk from svn. We have checks that should run once per day in a 1 hour period. They did not fire with the new build.

I can test as much as required until this is resolved if you need a test platform.
(0000323)
murf (reporter)
2011-08-14 15:31
edited on: 2011-08-20 18:16

Is there any progress on this?

I have recently discovered my daily scheduled checks are not working, as they would appear to be getting rescheduled outside their timeperiods.

In the interim I am restarting Nagios daily without retaining the scheduling queue, which seems to get the checks scheduled correctly once at least.

---

Actually this is not the case. What seems to be happening is that the scheduled jobs are getting rescheduled, which is moving them outside of their timeperiods.

I don't understand why these checks would be rescheduled.

(0000363)
dnsmichi (reporter)
2011-10-22 13:27

it's pretty hard to reproduce, so if you happen to have some debug logs, please provide them.

nagios.cfg

debug_level=16
debug_verbosity=2
max_debug_file_size=100000000
(0000365)
murf (reporter)
2011-10-25 03:29
edited on: 2011-10-26 04:15

we tracked down the problem to a problem with NTP servers, so the time was getting adjusted on the nagios server and it would then reschedule the checks. The resceduling part of the code from that entry poiunt did not check that it was rescheduling into a valid timeperiod for the check.

We sorted out the NTP issues and everything has been working fine since.

Finding that the system was "compensating" for the time changes was what we saw in the debug.

(0000429)
chrisgpayne (reporter)
2012-05-15 10:12

We are seeing a similar problem (Opsview Community, using "NagiosĀ® Coreā„¢ 3.2.3").

Other 24x7 checks on this host work fine, but we have a check scheduled to run Mondays 12:05-20:00, that hangs every week. I can manually schedule a check for 12:05 Monday, and during the active time, the check runs fine.

However, at the end of the check interval, it schedules the next check to 20:10 (outside of the window) the following week, with this error in the logs:

[1337047500] Warning: Check of service 'XXXX' on host 'yyyy' could not be rescheduled properly. Scheduling check for next week...

And the check keeps getting pushed back , I believe 5 minutes every week with the same error, never actually running.

The service description is:

define service {
        host_name yyyy
        service_description XXXX
        check_command check_nrpe!-H $HOSTADDRESS$ -p 15666 -c checkXXXX
        normal_check_interval 5
        retry_check_interval 1
        check_period mondaynoonfiveeight
        max_check_attempts 3
        flap_detection_enabled 1
        notifications_enabled 1
        notification_period 24x7
        notification_interval 60
        notification_options c,r,u,w
        contact_groups hostgroup41_servicegroup45,keyword_12,keyword_11
        notes IMPORTED:
        use service-global
}
inherited from:

define service{
        name service-global
        active_checks_enabled 1 ; Active service checks are enabled
        passive_checks_enabled 1 ; Passive service checks are enabled/accepted
        obsess_over_service 0 ; We should obsess over this service (if necessary)
        check_freshness 0 ; Default is to NOT check service 'freshness'
        notifications_enabled 1 ; Service notifications are enabled
        event_handler_enabled 1 ; Service event handler is enabled but ignored unless event_handler set
        flap_detection_enabled 1 ; Flap detection is enabled
        process_perf_data 1 ; Process performance data
        retain_status_information 1 ; Retain status information across program restarts
        retain_nonstatus_information 1 ; Retain non-status information across program restarts
        is_volatile 0
        check_period 24x7
        register 0 ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL SERVICE, JUST A TEMPLATE!
}

and the interval is defined:

define timeperiod{
        timeperiod_name mondaynoonfiveeight
        alias Monday Five after Noon to Eight pm
        monday 12:05-20:00
        }

The server is properly synced to NTP, so I do not think this is the issue. I don't think I have misconfigured anything, and this is imported from a nagios 2.12 installation where everything had worked as expected.

Does anyone have any suggestions?
(0000575)
ageric (administrator)
2012-09-05 09:07

commit 4c108b3ee07574f45a518dcc249f8da7103f5669
Author: Andreas Ericsson <ae@op5.se>
Date: Wed Sep 5 13:06:42 2012 +0000

    core: Avoid scheduling stuff a week into the future
    
    One pretty major issue with scheduling one week into the future is
    that by next week the time will be exactly the same as it is now,
    so if timeperiods are defined in a "normal" manner we'll be in the
    exact same situation then as we are now, and we'll just keep on
    scheduling stuff far into the future.
    
    Another issue is that it's unlikely that checks are actually
    supposed to run that far into the future, and we can therefore
    assume that things have gone haywire beyond the usual manners.
    
    The most sensible (for a given value of sensible) thing we can do
    is to use the preferred time to schedule the check and see then
    if we can get a sane timeslot for the check to run. That will
    cause us to run into the event more frequently than necessary,
    but it's not that great a chore and it will at least prevent
    checks from going awol in scheduling limbo for any extended period
    of time.
    
    As a nice bonus, it also lets me close issue 31 for good and all,
    as users now have a way to tweak the amount of time into the
    future we'll schedule the checks when we fail to schedule them
    due to some bugs, which is a Good Thing(tm) indeed and lets them
    work around future problems by using some suitable number for
    check interval. So long as it's shorter than the configured time
    period slot, the check is guaranteed to run at least once.
    
    Signed-off-by: Andreas Ericsson <ae@op5.se>
    
    git-svn-id: https://nagios.svn.sourceforge.net/svnroot/nagios/nagioscore/trunk@2152 [^] 5f96b256-904b-4d8d-8c98-d829582c6739

- Issue History
Date Modified Username Field Change
2009-06-25 05:25 hsmade New Issue
2009-06-25 05:25 hsmade Nagios Version => 3.0.6
2009-06-25 05:25 hsmade OS => Linux
2009-06-25 05:25 hsmade OS Version => Debian Lenny
2009-06-25 06:37 hsmade Issue Monitored: hsmade
2009-06-30 10:25 guest Note Added: 0000044
2009-07-02 04:53 hsmade Note Added: 0000056
2009-07-03 03:07 hsmade Note Added: 0000057
2009-07-03 03:08 hsmade Note Edited: 0000057
2009-07-06 01:42 hsmade Note Added: 0000063
2009-07-06 01:43 hsmade Note Edited: 0000063
2009-07-06 01:48 hsmade Note Deleted: 0000063
2009-07-06 02:41 hsmade Note Added: 0000064
2009-07-06 02:43 hsmade Note Edited: 0000064
2009-07-07 02:07 hsmade Note Edited: 0000064
2009-07-31 06:57 eriksornes Issue Monitored: eriksornes
2009-08-08 20:01 hirenp Note Added: 0000081
2009-08-10 12:42 hsmade Note Added: 0000082
2009-08-10 12:57 hirenp Note Added: 0000083
2009-09-02 05:10 eriksornes Note Added: 0000090
2009-09-02 05:14 hsmade Note Added: 0000091
2009-09-02 07:12 edgarwahn Note Added: 0000092
2009-09-02 07:20 hsmade Note Added: 0000093
2009-11-23 09:45 guest Note Edited: 0000044
2009-11-23 09:46 guest Note Edited: 0000044
2010-04-26 16:27 ageric Note Added: 0000170
2010-04-26 16:27 ageric Status new => feedback
2010-06-24 09:39 rik Note Added: 0000196
2010-11-22 03:28 terracor Note Added: 0000259
2010-11-22 03:29 terracor Issue Monitored: terracor
2010-11-26 03:03 pabel Issue Monitored: pabel
2010-11-26 03:31 pabel Note Added: 0000260
2010-12-27 07:35 terracor Note Added: 0000262
2011-01-25 06:06 pabel Note Added: 0000266
2011-01-31 08:36 qha Note Added: 0000267
2011-01-31 08:42 qha Issue Monitored: qha
2011-02-08 10:02 eyandow Note Added: 0000268
2011-03-12 04:46 Bernd Note Added: 0000275
2011-05-10 10:03 ageric Note Added: 0000296
2011-05-11 03:31 pabel Note Added: 0000302
2011-08-08 03:27 guest Note Deleted: 0000044
2011-08-14 15:31 murf Note Added: 0000323
2011-08-14 20:24 murf Note Edited: 0000323
2011-08-20 18:16 murf Note Edited: 0000323
2011-10-22 13:27 dnsmichi Note Added: 0000363
2011-10-24 05:03 rik Issue Monitored: rik
2011-10-24 05:03 rik Issue End Monitor: rik
2011-10-25 03:29 murf Note Added: 0000365
2011-10-26 04:15 murf Note Edited: 0000365
2012-05-15 10:12 chrisgpayne Note Added: 0000429
2012-05-15 11:09 terracor Issue End Monitor: terracor
2012-09-05 09:06 ageric Status feedback => assigned
2012-09-05 09:06 ageric Assigned To => ageric
2012-09-05 09:07 ageric Note Added: 0000575
2012-09-05 09:07 ageric Status assigned => resolved
2012-09-05 09:07 ageric Resolution open => fixed


Mantis 1.1.7[^]
Copyright © 2000 - 2008 Mantis Group
Powered by Mantis Bugtracker