I\'m running 5 DAG\'s which have generated a total of about 6GB of log data in the base_log_folder
over a months period. I just added a remote_base_log_folder
For your concrete problems, I have some suggestions. For those, you would always need a specialized logging config as described in this answer: https://stackoverflow.com/a/54195537/2668430
automatically remove old log files and rotate them
I don't have any practical experience with the TimedRotatingFileHandler
from the Python standard library yet, but you might give it a try:
https://docs.python.org/3/library/logging.handlers.html#timedrotatingfilehandler
It not only offers to rotate your files based on a time interval, but if you specify the backupCount
parameter, it even deletes your old log files:
If
backupCount
is nonzero, at mostbackupCount
files will be kept, and if more would be created when rollover occurs, the oldest one is deleted. The deletion logic uses the interval to determine which files to delete, so changing the interval may leave old files lying around.
Which sounds pretty much like the best solution for your first problem.
force airflow to not log on disk (base_log_folder), but only in remote storage?
In this case you should specify the logging config in such a way that you do not have any logging handlers that write to a file, i.e. remove all FileHandlers
.
Rather, try to find logging handlers that send the output directly to a remote address. E.g. CMRESHandler which logs directly to ElasticSearch but needs some extra fields in the log calls. Alternatively, write your own handler class and let it inherit from the Python standard library's HTTPHandler.
A final suggestion would be to combine both the TimedRotatingFileHandler
and setup ElasticSearch together with FileBeat, so you would be able to store your logs inside ElasticSearch (i.e. remote), but you wouldn't store a huge amount of logs on your Airflow disk since they will be removed by the backupCount
retention policy of your TimedRotatingFileHandler
.