Spark FileStreaming issue

社会主义新天地 提交于 2020-01-30 08:25:32

问题


I am trying simple file streaming example using Sparkstreaming(spark-streaming_2.10,version:1.5.1)

public class DStreamExample {

    public static void main(final String[] args) {

        final SparkConf sparkConf = new SparkConf();
        sparkConf.setAppName("SparkJob");
        sparkConf.setMaster("local[4]"); // for local

        final JavaSparkContext sc = new JavaSparkContext(sparkConf);

        final JavaStreamingContext ssc = new JavaStreamingContext(sc,
            new Duration(2000));

        final JavaDStream<String> lines = ssc.textFileStream("/opt/test/");
        lines.print();

        ssc.start();
        ssc.awaitTermination();
    }
}

When I run this code on single file or director it does not print anything from file, I see in logs its constantly polling but nothing is printed. I tried moving file to directory when this program was running.

Is there something I am missing? I tried applying map function on lines RDD that also does not work.


回答1:


The API textFileStream is not supposed to read existing directory content, instead, it's purpose is to monitor the given Hadoop-compatible filesystem path for changes, files must be written into monitored location by "moving" them from another location within same file system. In short, you are subscribing for directory changes and will receive the content of newly appeared files within the monitored location - in that state in which the file(s) appear at the moment of monitoring snapshot (which is 2000 ms duration in your case), and any further file updates will not reach the stream, only directory updates (new files) will do.

The way you can emulate updates is to create new file during your monitoring session:

import org.apache.commons.io.FileUtils;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.streaming.Duration;
import org.apache.spark.streaming.api.java.JavaDStream;
import org.apache.spark.streaming.api.java.JavaStreamingContext;

import java.io.File;
import java.io.IOException;
import java.util.List;

public class DStreamExample {

public static void main(final String[] args) throws IOException {

    final SparkConf sparkConf = new SparkConf();
    sparkConf.setAppName("SparkJob");
    sparkConf.setMaster("local[4]"); // for local

    final JavaSparkContext sc = new JavaSparkContext(sparkConf);

    final JavaStreamingContext ssc = new JavaStreamingContext(sc,
            new Duration(2000));

    final JavaDStream<String> lines = ssc.textFileStream("/opt/test/");

    // spawn the thread which will create new file within the monitored directory soon
    Runnable r = () -> {
        try {
            Thread.sleep(5000);
        } catch (InterruptedException e) {
            e.printStackTrace();
        }
        try {
            FileUtils.write(new File("/opt/test/newfile1"), "whatever");
        } catch (IOException e) {
            e.printStackTrace();
        }
    };

    new Thread(r).start();


    lines.foreachRDD((Function<JavaRDD<String>, Void>) rdd -> {
        List<String> lines1 = rdd.collect();
        lines1.stream().forEach(l -> System.out.println(l));
        return null;
    });

    ssc.start();
    ssc.awaitTermination();
}

}



来源:https://stackoverflow.com/questions/33704326/spark-filestreaming-issue

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!