Removing Airflow task logs

后端 未结 6 1887
走了就别回头了
走了就别回头了 2021-02-03 19:52

I\'m running 5 DAG\'s which have generated a total of about 6GB of log data in the base_log_folder over a months period. I just added a remote_base_log_folder

相关标签:
6条回答
  • 2021-02-03 20:21

    Airflow maintainers don't think truncating logs is a part of airflow core logic, to see this, and then in this issue, maintainers suggest to change LOG_LEVEL avoid too many log data.

    And in this PR, we can learn how to change log level in airflow.cfg.

    good luck.

    0 讨论(0)
  • 2021-02-03 20:29

    I don't think that there is a rotation mechanism but you can store them in S3 or google cloud storage as describe here : https://airflow.incubator.apache.org/configuration.html#logs

    0 讨论(0)
  • 2021-02-03 20:30

    Please refer https://github.com/teamclairvoyant/airflow-maintenance-dags

    This plugin has DAGs that can kill halted tasks and log-cleanups. You can grab the concepts and can come up with a new DAG that can cleanup as per your requirement.

    0 讨论(0)
  • 2021-02-03 20:38

    I know it sounds savage, but have you tried pointing base_log_folder to /dev/null? I use Airflow as a part of a container, so I don't care about the files either, as long as the logger pipe to STDOUT as well.

    Not sure how well this plays with S3 though.

    0 讨论(0)
  • 2021-02-03 20:40

    For your concrete problems, I have some suggestions. For those, you would always need a specialized logging config as described in this answer: https://stackoverflow.com/a/54195537/2668430

    • automatically remove old log files and rotate them

    I don't have any practical experience with the TimedRotatingFileHandler from the Python standard library yet, but you might give it a try: https://docs.python.org/3/library/logging.handlers.html#timedrotatingfilehandler

    It not only offers to rotate your files based on a time interval, but if you specify the backupCount parameter, it even deletes your old log files:

    If backupCount is nonzero, at most backupCount files will be kept, and if more would be created when rollover occurs, the oldest one is deleted. The deletion logic uses the interval to determine which files to delete, so changing the interval may leave old files lying around.

    Which sounds pretty much like the best solution for your first problem.


    • force airflow to not log on disk (base_log_folder), but only in remote storage?

    In this case you should specify the logging config in such a way that you do not have any logging handlers that write to a file, i.e. remove all FileHandlers.

    Rather, try to find logging handlers that send the output directly to a remote address. E.g. CMRESHandler which logs directly to ElasticSearch but needs some extra fields in the log calls. Alternatively, write your own handler class and let it inherit from the Python standard library's HTTPHandler.


    A final suggestion would be to combine both the TimedRotatingFileHandler and setup ElasticSearch together with FileBeat, so you would be able to store your logs inside ElasticSearch (i.e. remote), but you wouldn't store a huge amount of logs on your Airflow disk since they will be removed by the backupCount retention policy of your TimedRotatingFileHandler.

    0 讨论(0)
  • 2021-02-03 20:45

    We remove the Task logs by implementing our own FileTaskHandler, and then pointing to it in the airflow.cfg. So, we overwrite the default LogHandler to keep only N task logs, without scheduling additional DAGs.

    We are using Airflow==1.10.1.

    [core]
    logging_config_class = log_config.LOGGING_CONFIG
    

    log_config.LOGGING_CONFIG

    BASE_LOG_FOLDER = conf.get('core', 'BASE_LOG_FOLDER')
    FOLDER_TASK_TEMPLATE = '{{ ti.dag_id }}/{{ ti.task_id }}'
    FILENAME_TEMPLATE = '{{ ti.dag_id }}/{{ ti.task_id }}/{{ ts }}/{{ try_number }}.log'
    
    LOGGING_CONFIG = {
        'formatters': {},
        'handlers': {
            '...': {},
            'task': {
                'class': 'file_task_handler.FileTaskRotationHandler',
                'formatter': 'airflow.job',
                'base_log_folder': os.path.expanduser(BASE_LOG_FOLDER),
                'filename_template': FILENAME_TEMPLATE,
                'folder_task_template': FOLDER_TASK_TEMPLATE,
                'retention': 20
            },
            '...': {}
        },
        'loggers': {
            'airflow.task': {
                'handlers': ['task'],
                'level': JOB_LOG_LEVEL,
                'propagate': False,
            },
            'airflow.task_runner': {
                'handlers': ['task'],
                'level': LOG_LEVEL,
                'propagate': True,
            },
            '...': {}
        }
    }
    

    file_task_handler.FileTaskRotationHandler

    import os
    import shutil
    
    from airflow.utils.helpers import parse_template_string
    from airflow.utils.log.file_task_handler import FileTaskHandler
    
    
    class FileTaskRotationHandler(FileTaskHandler):
    
        def __init__(self, base_log_folder, filename_template, folder_task_template, retention):
            """
            :param base_log_folder: Base log folder to place logs.
            :param filename_template: template filename string.
            :param folder_task_template: template folder task path.
            :param retention: Number of folder logs to keep
            """
            super(FileTaskRotationHandler, self).__init__(base_log_folder, filename_template)
            self.retention = retention
            self.folder_task_template, self.folder_task_template_jinja_template = \
                parse_template_string(folder_task_template)
    
        @staticmethod
        def _get_directories(path='.'):
            return next(os.walk(path))[1]
    
        def _render_folder_task_path(self, ti):
            if self.folder_task_template_jinja_template:
                jinja_context = ti.get_template_context()
                return self.folder_task_template_jinja_template.render(**jinja_context)
    
            return self.folder_task_template.format(dag_id=ti.dag_id, task_id=ti.task_id)
    
        def _init_file(self, ti):
            relative_path = self._render_folder_task_path(ti)
            folder_task_path = os.path.join(self.local_base, relative_path)
            subfolders = self._get_directories(folder_task_path)
            to_remove = set(subfolders) - set(subfolders[-self.retention:])
    
            for dir_to_remove in to_remove:
                full_dir_to_remove = os.path.join(folder_task_path, dir_to_remove)
                print('Removing', full_dir_to_remove)
                shutil.rmtree(full_dir_to_remove)
    
            return FileTaskHandler._init_file(self, ti)
    
    0 讨论(0)
提交回复
热议问题