How to run Airflow PythonOperator in a virtual environment

后端 未结 2 645
粉色の甜心
粉色の甜心 2021-02-07 10:58

I have several python files that I\'m currently executing using BashOperator. This allows me the flexibility to choose the python virtual environment easily.

fro         


        
2条回答
  •  被撕碎了的回忆
    2021-02-07 11:37

    First things first: you should not (in general) rely on pre-existing resources for your Operators. You operators should be portable, so using longstanding virtualenvs is somewhat against that principle. That being said, it's not as much of a big deal, just like you have to preinstall packages to the global environment you can pre-bake a few environments. Or, you can let the Operator create the environment and subsequent operators may reuse it - which is, I believe, the easiest and most dangerous approach.

    Implementing a "virtualenv cache" shouldn't be difficult. Reading the implementation of PythonVirtualenvOperator's execution method:

    def execute_callable(self):
        with TemporaryDirectory(prefix='venv') as tmp_dir:
            ...
            self._execute_in_subprocess(
                self._generate_python_cmd(tmp_dir,
                                          script_filename,
                                          input_filename,
                                          output_filename,
                                          string_args_filename))
            return self._read_result(output_filename)
    

    So it looks like it doesn't delete the virtualenv explicitly (it relies on TemporaryDirectory to do that). You can subclass PythonVirtualenvOperator and simply use your own context manager that reuses temporary directories:

    import glob
    
    @contextmanager
    def ReusableTemporaryDirectory(prefix):
        try:
            existing = glob.glob('/tmp/' + prefix + '*')
            if len(existing):
                name = existing[0]
            else:
                name = mkdtemp(prefix=prefix)
            yield name
        finally:
            # simply don't delete the tmp dir
            pass
    
    def execute_callable(self):
        with ReusableTemporaryDirectory(prefix='cached-venv') as tmp_dir:
            ...
    

    Naturally, you can get rid of the try-finally in ReusableTemporaryDirectory and put back the usual suffix and dir arguments, I made minimal changes to make it easy to compare with the original TemporaryDirectory class.

    With this, your virtualenv won't be discarded but newer dependencies will be eventually installed by the Operator.

提交回复
热议问题