Luigi Pipeline beginning in S3

前端 未结 1 2002
青春惊慌失措
青春惊慌失措 2021-02-13 17:43

My initial files are in AWS S3. Could someone point me how I need to setup this in a Luigi Task?

I reviewed the documentation and found l

1条回答
  •  时光说笑
    2021-02-13 17:44

    The key here is to define an External Task that has no inputs and which outputs are those files you already have in living in S3. Luigi docs mention this in Requiring another Task:

    Note that requires() can not return a Target object. If you have a simple Target object that is created externally you can wrap it in a Task class

    So, basically you end up with something like this:

    import luigi
    
    from luigi.s3 import S3Target
    
    from somewhere import do_something_with
    
    
    class MyS3File(luigi.ExternalTask):
    
        def output(self):
            return luigi.S3Target('s3://my-bucket/path/to/file')
    
    class ProcessS3File(luigi.Task):
    
        def requires(self):
            return MyS3File()
    
        def output(self):
            return luigi.S3Target('s3://my-bucket/path/to/output-file')
    
        def run(self):
            result = None
            # this will return a file stream that reads the file from your aws s3 bucket
            with self.input().open('r') as f:
                result = do_something_with(f)
    
            # and the you 
            out_file = self.output().open('w')
            # it'd better to serialize this result before writing it to a file, but this is a pretty simple example
            out_file.write(result)
    

    UPDATE:

    Luigi uses boto to read files from and/or write them to AWS S3, so in order to make this code work, you'll need to provide your credentials in your boto config file ~/boto (look for other possible config file locations here):

    [Credentials]
    aws_access_key_id = 
    aws_secret_access_key = 
    

    0 讨论(0)
提交回复
热议问题