ISSUES Defining Cron jobs in Procfile (Heroku) using apscheduler for Django project

问题

I am having a problem scheduling a cron job which requires scraping a website and storing it as part of the model (MOVIE) in the database.

The problem is that the model seems to get loaded before Procfile is executed.
How should I create a cron job which runs internally in the background and storing scraped information into the database? Here are my codes:

Procfile:

    web: python manage.py runserver 0.0.0.0:$PORT
    scheduler: python cinemas/scheduler.py

scheduler.py:

# More code above
from cinemas.models import Movie
from apscheduler.schedulers.blocking import BlockingScheduler
sched = BlockingScheduler()

@sched.scheduled_job('cron', day_of_week='mon-fri', hour=0, minutes=26)    
def get_movies_playing_now():
  global url_movies_playing_now
  Movie.objects.all().delete()
  while(url_movies_playing_now):
    title = []
    description = []
    #Create BeatifulSoup Object with url link
    s = requests.get(url_movies_playing_now, headers=headers)
    soup = bs4.BeautifulSoup(s.text, "html.parser")
    movies = soup.find_all('ul', class_='w462')[0]

    #Find Movie's title
    for movie_title in movies.find_all('h3'):
        title.append(movie_title.text)
    #Find Movie's description
    for movie_description in soup.find_all('ul',
                                           class_='w462')[0].find_all('p'):
        description.append(movie_description.text.replace(" [More]","."))

    for t, d in zip(title, description):
        m = Movie(movie_title=t, movie_description=d)
        m.save()

    #Go to the next page to find more movies
    paging = soup.find( class_='pagenating').find_all('a', class_=lambda x:
                                                      x != "inactive")
    href = ""
    for p in paging:
        if "next" in p.text.lower():
            href = p['href']
    url_movies_playing_now = href

sched.start()
# More code below

cinemas/models.py:

from django.db import models

#Create your models here.

class Movie(models.Model):
    movie_title = models.CharField(max_length=200)
    movie_description = models.CharField(max_length=20200)

This is the error i am getting when the Job is ran.

2016-11-17T17:57:06.074914+00:00 app[scheduler.1]: Traceback (most recent call last): 2016-11-17T17:57:06.074931+00:00 app[scheduler.1]: File "cinemas/scheduler.py", line 2, in 2016-11-17T17:57:06.075058+00:00 app[scheduler.1]: import cineplex 2016-11-17T17:57:06.075060+00:00 app[scheduler.1]: File "/app/cinemas/cineplex.py", line 1, in 2016-11-17T17:57:06.075173+00:00 app[scheduler.1]: from cinemas.models import Movie 2016-11-17T17:57:06.075196+00:00 app[scheduler.1]: File "/app/cinemas/models.py", line 5, in 2016-11-17T17:57:06.075295+00:00 app[scheduler.1]: class Movie(models.Model): 2016-11-17T17:57:06.075297+00:00 app[scheduler.1]: File "/app/.heroku/python/lib/python3.5/site-packages/django/db/models/base.py", line 105, in new 2016-11-17T17:57:06.075414+00:00 app[scheduler.1]: app_config = apps.get_containing_app_config(module) 2016-11-17T17:57:06.075440+00:00 app[scheduler.1]: File "/app/.heroku/python/lib/python3.5/site-packages/django/apps/registry.py", line 237, in get_containing_app_config 2016-11-17T17:57:06.075585+00:00 app[scheduler.1]:
self.check_apps_ready() 2016-11-17T17:57:06.075586+00:00 app[scheduler.1]: File "/app/.heroku/python/lib/python3.5/site-packages/django/apps/registry.py", line 124, in check_apps_ready 2016-11-17T17:57:06.075703+00:00 app[scheduler.1]: raise AppRegistryNotReady("Apps aren't loaded yet.") 2016-11-17T17:57:06.075726+00:00 app[scheduler.1]: django.core.exceptions.AppRegistryNotReady: Apps aren't loaded yet.

Cron job works fine if I do not include Model objects. How should I run this job every day using Model objects without failing?

Thanks

回答1:

That's because you can't just import the Django packages, models, etc.
In order to work properly, the Django internals require initialization, that's triggered from manage.py.

Rather than try and re-create all that myself, I always write long-running, non-web commands as a custom management command.

For example, if your app is cinemas, you would:

Create ./cinemas/management/commands/scheduler.py.
In that file, create a sub-class django.core.management.base.BaseCommand (that sub-class must be called Command)
In that class, override handle(). In your case, that's where you'd call sched.start()
Your Procfile would then have scheduler: python manage.py scheduler

Hope that helps.

回答2:

You can solve the problem with adding the following lines to the top of your sceduler.py

import django
django.setup()

In the django documentation it says

If you’re using components of Django “standalone” – for example, writing a Python script which loads some Django templates and renders them, or uses the ORM to fetch some data – there’s one more step you’ll need in addition to configuring settings.

After you’ve either set DJANGO_SETTINGS_MODULE or called configure(), you’ll need to call django.setup() to load your settings and populate Django’s application registry. For example:
import django
from django.conf import settings
from myapp import myapp_defaults

settings.configure(default_settings=myapp_defaults, DEBUG=True)
django.setup()

# Now this script or any imported module can use any part of Django it needs.
from myapp import models

I set DJANGO_SETTINGS_MODULE as a config variable so didn't add it to my scheduler.

来源：https://stackoverflow.com/questions/40662013/issues-defining-cron-jobs-in-procfile-heroku-using-apscheduler-for-django-proj

标签

python

django

heroku

cron

apscheduler