[BUG] Monitor jobs do not reschedule when flipping the SWEEPER_ENABLED setting #333

qreshi · 2021-01-07T02:23:15Z

When the SWEEPER_ENABLED setting is set to false, all Alerting jobs are descheduled as expected. However, once this setting is set back to true, the expectation is that existing enabled Monitors would be rescheduled but this is currently not the case.

The issue seems to be occurring in the sweep() logic. This is indirectly invoked from the enable() method when the SWEEPER_ENABLED setting is set to true. At this time, any jobs owned by the shard in question are skipped if the newVersion being passed in is less than or equal to the currentVersion. So when the sweeper is re-enabled, all the jobs that went through no change are skipped over and never rescheduled.

Here is a snippet of the code in question:

alerting/core/src/main/kotlin/com/amazon/opendistroforelasticsearch/alerting/core/JobSweeper.kt

Lines 348 to 361 in 8daf9e6

    
           private fun sweep( 
        
               shardId: ShardId, 
        
               jobId: JobId, 
        
               newVersion: JobVersion, 
        
               job: ScheduledJob?, 
        
               failedToParse: Boolean = false 
        
           ) { 
        
               sweptJobs.getOrPut(shardId) { ConcurrentHashMap() } 
        
                   // Use [compute] to update atomically in case another thread concurrently indexes/deletes the same job 
        
                   .compute(jobId) { _, currentVersion -> 
        
                       if (newVersion <= (currentVersion ?: Versions.NOT_FOUND)) { 
        
                           logger.debug("Skipping job $jobId, $newVersion <= $currentVersion") 
        
                           return@compute currentVersion 
        
                       }

The solution would involve accounting for the case where the job version is unchanged only when coming from a scenario where the jobs are previously descheduled and coming from the enable() code path, so that the logic where the jobs are scheduled can occur. Not accounting for the aforementioned scenario could lead to existing jobs being repeatedly descheduled and rescheduled during the routine background runs, which should be avoided.

The text was updated successfully, but these errors were encountered:

pkriete · 2021-02-11T00:00:07Z

We would love to see this fixed soon. We ran into this issue on AWS ES after toggling opendistro.scheduled_jobs.enabled to false during an outage. Setting it back to true results in a completely opaque failure mode where everything looks like it's running but nothing happens. We have customer facing alerts, the ability to turn them off when they would otherwise be incorrect is critical for us.

For anyone else that might come across this before a patch goes out, bumping the version number on the alerting documents will restart your alerts:

POST .opendistro-alerting-config/_update_by_query

qreshi · 2022-02-18T16:32:20Z

Closing in favor of opensearch-project/alerting#89

qreshi added the bug Something isn't working label Jan 7, 2021

adityaj1107 mentioned this issue Jun 2, 2021

[BUG] Monitor jobs do not reschedule when flipping the SWEEPER_ENABLED setting opensearch-project/alerting#89

Closed

qreshi closed this as completed Feb 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Monitor jobs do not reschedule when flipping the SWEEPER_ENABLED setting #333

[BUG] Monitor jobs do not reschedule when flipping the SWEEPER_ENABLED setting #333

qreshi commented Jan 7, 2021

pkriete commented Feb 11, 2021

qreshi commented Feb 18, 2022

[BUG] Monitor jobs do not reschedule when flipping the SWEEPER_ENABLED setting #333

[BUG] Monitor jobs do not reschedule when flipping the SWEEPER_ENABLED setting #333

Comments

qreshi commented Jan 7, 2021

pkriete commented Feb 11, 2021

qreshi commented Feb 18, 2022