luigi

How to reset luigi task status?

点点圈 提交于 2019-12-05 03:47:11
Currently, I have a bunch of luigi tasks queued together, with a simple dependency chain( a -> b -> c -> d ). d gets executed first, and a at the end. a is the task that gets triggered. All the targets except a return a luigi.LocalTarget() object and have a single generic luigi.Parameter() which is a string (containing a date and a time). Runs on a luigi central server (which has history enabled). The problem is that, when I rerun the said task a , luigi checks the history and sees if that particular task has been run before, if it had had a status of DONE, it doesn't run the tasks ( d in this

Luigi Pipeline beginning in S3

陌路散爱 提交于 2019-12-04 18:36:57
问题 My initial files are in AWS S3 . Could someone point me how I need to setup this in a Luigi Task ? I reviewed the documentation and found luigi.S3 but is not clear for me what to do with that, then I searched in the web and only get links from mortar-luigi and implementation in top of luigi. UPDATE After following the example provided for @matagus (I created the ~/.boto file as suggested too): # coding: utf-8 import luigi from luigi.s3 import S3Target, S3Client class MyS3File(luigi

Scheduling spark jobs on a timely basis

筅森魡賤 提交于 2019-12-04 18:22:43
Which is the recommended tool for scheduling Spark Jobs on a daily/weekly basis. 1) Oozie 2) Luigi 3) Azkaban 4) Chronos 5) Airflow Thanks in advance. Joe Harris Updating my previous answer from here: Suggestion for scheduling tool(s) for building hadoop based data pipelines Airflow: Try this first. Decent UI, Python-ish job definition, semi-accessible for non-programmers, dependency declaration syntax is weird. Airflow has built in support for the fact that jobs scheduled jobs often need to be rerun and/or backfilled. Make sure you build your pipelines to support this. Azkaban: Nice UI,

Can i use luigi with Python celery

雨燕双飞 提交于 2019-12-04 16:52:53
I am using celery for my web application. Celery executes Parent tasks which then executes further pipline of tasks The issues with celery I can't get dependency graph and visualizer i get with luigi to see whats the status of my parent task Celery does not provide mechanism to restart the failed pipeline and start from where it failed. These two thing i can easily get from luigi. So i was thinking that once celery runs the parent task then inside that task i execute the Luigi pipleine. Is there going to be any issue with that i.e i need to autoscale the celery workers based on queuesize .

How to Dynamically create a Luigi Task

谁都会走 提交于 2019-12-04 04:47:26
I am building a wrapper for Luigi Tasks and I ran into a snag with the Register class that's actually an ABC metaclass and not being pickable when I create a dynamic type . The following code, more or less, is what I'm using to develop the dynamic class. class TaskWrapper(object): '''Luigi Spark Factory from the provided JobClass Args: JobClass(ScrubbedClass): The job to wrap options: Options as passed into the JobClass ''' def __new__(self, JobClass, **options): # Validate we have a good job valid_classes = ( ScrubbedClass01, # ScrubbedClass02, # ScrubbedClass03, ) if any(vc == JobClass for

Luigi - Unfulfilled %s at run time

对着背影说爱祢 提交于 2019-12-04 03:37:29
I am trying to learn in a very simple way how luigi works. Just as a newbie I came up with this code import luigi class class1(luigi.Task): def requires(self): return class2() def output(self): return luigi.LocalTarget('class1.txt') def run(self): print 'IN class A' class class2(luigi.Task): def requires(self): return [] def output(self): return luigi.LocalTarget('class2.txt') if __name__ == '__main__': luigi.run() Running this in command prompt gives error saying raise RuntimeError('Unfulfilled %s at run time: %s' % (deps, ', This happens because you define an output for class2 but never

Python based asynchronous workflow modules : What is difference between celery workflow and luigi workflow?

让人想犯罪 __ 提交于 2019-12-03 10:34:32
问题 I am using django as a web framework. I need a workflow engine that can do synchronous as well as asynchronous(batch tasks) chain of tasks. I found celery and luigi as batch processing workflow. My first question is what is the difference between these two modules. Luigi allows us to rerun failed chain of task and only failed sub-tasks get re-executed. What about celery: if we rerun the chain (after fixing failed sub-task code), will it rerun the already succeed sub-tasks? Suppose I have two

Python based asynchronous workflow modules : What is difference between celery workflow and luigi workflow?

独自空忆成欢 提交于 2019-12-03 02:06:56
I am using django as a web framework. I need a workflow engine that can do synchronous as well as asynchronous(batch tasks) chain of tasks. I found celery and luigi as batch processing workflow. My first question is what is the difference between these two modules. Luigi allows us to rerun failed chain of task and only failed sub-tasks get re-executed. What about celery: if we rerun the chain (after fixing failed sub-task code), will it rerun the already succeed sub-tasks? Suppose I have two sub-tasks. The first one creates some files and the second one reads those files. When I put these into