Spark Streaming on a S3 Directory

后端 未结 1 1310
我寻月下人不归
我寻月下人不归 2021-02-03 16:03

So I have thousands of events being streamed through Amazon Kinesis into SQS then dumped into a S3 directory. About every 10 minutes, a new text file is created to dump the data

相关标签:
1条回答
  • 2021-02-03 16:21

    In order to stream an S3 bucket. you need to provide the path to S3 bucket. And it will stream all data from all the files in this bucket. Then whenever w new file is created in this bucket, it will be streamed. If you are appending data to existing file which are read before, these new updates will not be read.

    here is small piece of code that works

    import org.apache.spark.streaming._
    
    val conf = new SparkConf().setAppName("Simple Application").setMaster("local[*]")      
    val sc = new SparkContext(conf)
    val hadoopConf=sc.hadoopConfiguration;
    hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
    hadoopConf.set("fs.s3.awsAccessKeyId",myAccessKey)
    hadoopConf.set("fs.s3.awsSecretAccessKey",mySecretKey)
    
    //ones above this may be deprecated?
    hadoopConf.set("fs.s3n.awsAccessKeyId",myAccessKey)
    hadoopConf.set("fs.s3n.awsSecretAccessKey",mySecretKey)
    
    val ssc = new org.apache.spark.streaming.StreamingContext(
      sc,Seconds(60))
    val lines = ssc.textFileStream("s3n://path to bucket")
    lines.print()
    
    ssc.start()             // Start the computation
    ssc.awaitTermination()  // Wait for the computation to terminate
    

    hope it will help.

    0 讨论(0)
提交回复
热议问题