Long running delayed_job jobs stay locked after a restart on Heroku

后端未结

关注

 6  2172

When a Heroku worker is restarted (either on command or as the result of a deploy), Heroku sends SIGTERM to the worker process. In the case of delayed_job


                      
              相关标签:


      
      
        
          6条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  忘了有多久        
                
              
                            
                2020-12-01 05:14
              
            
            
                                                                       
TLDR:

Put this at the top of your job method:

begin
  term_now = false
  old_term_handler = trap 'TERM' do
    term_now = true
    old_term_handler.call
  end


AND

Make sure this is called at least once every ten seconds:

  if term_now
    puts 'told to terminate'
    return true
  end


AND

At the end of your method, put this:

ensure
  trap 'TERM', old_term_handler
end




Explanation:

I was having the same problem and came upon this Heroku article.

The job contained an outer loop, so I followed the article and added a trap('TERM') and exit. However delayed_job  picks that up as failed with SystemExit and marks the task as failed.

With the SIGTERM now trapped by our trap the worker's handler isn't called and instead it immediately restarts the job and then gets SIGKILL a few seconds later. Back to square one.

I tried a few alternatives to exit:


A return true marks the job as successful (and removes it from the queue), but suffers from the same problem if there's another job waiting in the queue.
Calling exit! will successfully exit the job and the worker, but it doesn't allow the worker to remove the job from the queue, so you still have the 'orphaned locked jobs' problem.


My final solution was the one given at at the top of my answer, it comprises of three parts:


Before we start the potentially long job we add a new interrupt handler for 'TERM' by doing a trap (as described in the Heroku article), and we use it to set term_now = true. 

But we must also grab the old_term_handler which the delayed job worker code set (which is returned by trap) and remember to call it.
We still must ensure that we return control to Delayed:Job:Worker with sufficient time for it to clean up and shutdown, so we should check term_now at least (just under) every ten seconds and return if it is true.

You can either return true or return false depending on whether you want the job to be considered successful or not.
Finally it is vital to remember to remove your handler and install back the Delayed:Job:Worker one when you have finished. If you fail to do this you will keep a dangling reference to the one we added, which can result in a memory leak if you add another one on top of that (for example, when the worker starts this job again).

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  深忆病人        
                
              
                            
                2020-12-01 05:18
              
            
            
                                                                       
New to the site, so can't comment on Dave's post, and need to add a new answer.

The issue I have with Dave's approach is that my tasks are long (minutes up to 8 hours), and are not repetitive at all. I can't "ensure to call" every 10 seconds.
Also, I have tried Dave's answer, and the job is always removed from the queue, regardless of what I return -- true or false. I am unclear as to how to keep the job on the queue.

See this this pull request. I think this may work for me. Please feel free to comment on it and support the pull request.

I am currently experimenting with a trap then rescue the exit signal... No luck so far.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  萌比男神i        
                
              
                            
                2020-12-01 05:20
              
            
            
                                                                       
That is what max_run_time is for: after max_run_time has elapsed from the time the job was locked, other processes will be able to acquire the lock.

See this discussion from google groups
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  孤独总比滥情好        
                
              
                            
                2020-12-01 05:33
              
            
            
                                                                       
I use a state machine to track the progress of jobs, and make the process idempotent so I can call perform on a given job/object multiple times and be confident it won't re-apply a destructive action. Then update the rake task/delayed_job to release the log on TERM. 

When the process restarts it will continue as intended.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  北荒        
                
              
                            
                2020-12-01 05:34
              
            
            
                                                                       
Abort Job Cleanly on SIGTERM

A much better solution is now built into delayed_job. Use this setting to throw an exception on TERM signals by adding this in your initializer:

Delayed::Worker.raise_signal_exceptions = :term


With that setting, the job will properly clean up and exit prior to heroku issuing a final KILL signal intended for non-cooperating processes:


  You may need to raise exceptions on SIGTERM signals, Delayed::Worker.raise_signal_exceptions = :term will cause the worker to raise a SignalException causing the running job to abort and be unlocked, which makes the job available to other workers. The default for this option is false.


Possible values for raise_signal_exceptions are: 


false - No exceptions will be raised (Default)
:term - Will only raise an exception on TERM signals but INT will wait for the current job to finish.
true - Will raise an exception on TERM and INT


Available since Version 3.0.5.

See this commit where it was introduced. 
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  花落未央        
                
              
                            
                2020-12-01 05:37
              
            
            
                                                                       
I ended up having to do this in a few places, so I created a module that I stick in lib/, and then run ExitOnTermSignal.execute { long_running_task } from inside my delayed job's perform block.

# Exits whatever is currently running when a SIGTERM is received. Needed since
# Delayed::Job traps TERM, so it does not clean up a job properly if the
# process receives a SIGTERM then SIGKILL, as happens on Heroku.
module ExitOnTermSignal
  def self.execute(&block)
    original_term_handler = Signal.trap 'TERM' do
      original_term_handler.call
      # Easiest way to kill job immediately and having DJ mark it as failed:
      exit
    end

    begin
      yield
    ensure
      Signal.trap 'TERM', original_term_handler
    end
  end
end

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复