Python Headless Browser for GAE

前端 未结 2 1298
旧巷少年郎
旧巷少年郎 2020-12-05 16:28

I\'m trying to use Angular.js client-side with webapp2 on Google Appengine.

In order to solve the SEO issues the idea was to use a headless browser to run the javas

相关标签:
2条回答
  • 2020-12-05 17:01

    This can now be done on App Engine Flex with a custom runtime, so I'm adding this answer since this question is the first thing to popup in google.

    I based this custom runtime off of my other GAE flex microservice which uses the pre-built python runtime

    Project Structure:

    webdrivers/
    - geckodriver
    app.yaml
    Dockerfile
    main.py
    requirements.txt
    

    app.yaml:

    service: my-app-engine-service-name
    runtime: custom
    env: flex
    entrypoint: gunicorn -b :$PORT main:app --timeout 180
    

    Dockerfile:

    FROM gcr.io/google-appengine/python
    RUN apt-get update
    RUN apt-get install -y xvfb
    RUN apt-get install -y firefox
    LABEL python_version=python
    RUN virtualenv --no-download /env -p python
    ENV VIRTUAL_ENV /env
    ENV PATH /env/bin:$PATH
    ADD requirements.txt /app/
    RUN pip install -r requirements.txt
    ADD . /app/
    CMD exec gunicorn -b :$PORT main:app --timeout 180
    

    requirements.txt:

    Flask==0.12.2
    gunicorn==19.7.1
    selenium==3.13.0
    pyvirtualdisplay==0.2.1
    

    main.py

    import os
    import traceback
    
    from flask import Flask, jsonify, Response
    from selenium import webdriver
    from pyvirtualdisplay import Display
    
    app = Flask(__name__)
    
    # Add the webdrivers to the path
    os.environ['PATH'] += ':'+os.path.dirname(os.path.realpath(__file__))+"/webdrivers"
    
    @app.route('/')
    def hello():
        return 'Hello!!'
    
    @app.route('/test/', methods=['GET'])
    def go_headless():
        try:
            display = Display(visible=0, size=(1024, 768))
            display.start()
            d = webdriver.Firefox()
            d.get("http://www.python.org")    
            page_source = d.page_source.encode("utf-8")
            d.close()
            display.stop()
            return jsonify({'success': True, "result": page_source[:500]})
        except Exception as e:
            print traceback.format_exc()
            return jsonify({'success': False, 'msg': str(e)})
    
    if __name__ == '__main__':
        app.run(host='127.0.0.1', port=8080, debug=True)
    

    Download geckodriver from here (linux 64):

    https://github.com/mozilla/geckodriver/releases

    Other notes:

    • Be mindful of the versions of geckodriver, firefox & selenium you are using as it can be finnickey, giving this error WebDriverException: Message: Can't load the profile. Possible firefox version mismatch. You must use GeckoDriver instead for Firefox 48+. Profile Dir: /tmp/tmp 48P If you specified a log_file in the FirefoxBinary constructor, check it for details.
    • Unless you are using legacy geckodriver/firefox, do not set DesiredCapabilities().FIREFOX["marionette"] = False https://github.com/SeleniumHQ/selenium/issues/5106
    • display = Display(visible=0, size=(1024, 768)) is needed to fix this error: How to fix Selenium WebDriverException: The browser appears to have exited before we could connect?

    To test locally:

    docker build . -t my-docker-image-tag
    docker run -p 8080:8080 --name=my-docker-container-name my-docker-image-tag
    

    To deploy to app engine:

    gcloud app deploy app.yaml --version dev --project my-app-engine-project-id
    
    0 讨论(0)
  • 2020-12-05 17:18

    That's a super meta idea. A web request being fulfilled by a web server using a headless web browser to render a page and return the result. phew.

    Take a look at the following answer on headless browsers paying special attention to the Python based ones.

    headless browser question: headless internet browser?

    Looks like the ones that support Javascript all use WebKit and require PyQt or Pyside. Meaning that you're not going to be able to run them up on App Engine due to the runtime restrictions that are in place.

    I would suggest for SEO purposes you do some sort of user agent detection and emit a super scaled down version of your page using Jinja2 templates or something. You'd probably get better performance that way anyway.

    0 讨论(0)
提交回复
热议问题