Spark FileStreaming issue

问题

I am trying simple file streaming example using Sparkstreaming(spark-streaming_2.10,version:1.5.1)

public class DStreamExample {

    public static void main(final String[] args) {

        final SparkConf sparkConf = new SparkConf();
        sparkConf.setAppName("SparkJob");
        sparkConf.setMaster("local[4]"); // for local

        final JavaSparkContext sc = new JavaSparkContext(sparkConf);

        final JavaStreamingContext ssc = new JavaStreamingContext(sc,
            new Duration(2000));

        final JavaDStream<String> lines = ssc.textFileStream("/opt/test/");
        lines.print();

        ssc.start();
        ssc.awaitTermination();
    }
}

When I run this code on single file or director it does not print anything from file, I see in logs its constantly polling but nothing is printed. I tried moving file to directory when this program was running.

Is there something I am missing? I tried applying map function on lines RDD that also does not work.

回答1:

The API textFileStream is not supposed to read existing directory content, instead, it's purpose is to monitor the given Hadoop-compatible filesystem path for changes, files must be written into monitored location by "moving" them from another location within same file system. In short, you are subscribing for directory changes and will receive the content of newly appeared files within the monitored location - in that state in which the file(s) appear at the moment of monitoring snapshot (which is 2000 ms duration in your case), and any further file updates will not reach the stream, only directory updates (new files) will do.

The way you can emulate updates is to create new file during your monitoring session:

import org.apache.commons.io.FileUtils;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.streaming.Duration;
import org.apache.spark.streaming.api.java.JavaDStream;
import org.apache.spark.streaming.api.java.JavaStreamingContext;

import java.io.File;
import java.io.IOException;
import java.util.List;

public class DStreamExample {

public static void main(final String[] args) throws IOException {

    final SparkConf sparkConf = new SparkConf();
    sparkConf.setAppName("SparkJob");
    sparkConf.setMaster("local[4]"); // for local

    final JavaSparkContext sc = new JavaSparkContext(sparkConf);

    final JavaStreamingContext ssc = new JavaStreamingContext(sc,
            new Duration(2000));

    final JavaDStream<String> lines = ssc.textFileStream("/opt/test/");

    // spawn the thread which will create new file within the monitored directory soon
    Runnable r = () -> {
        try {
            Thread.sleep(5000);
        } catch (InterruptedException e) {
            e.printStackTrace();
        }
        try {
            FileUtils.write(new File("/opt/test/newfile1"), "whatever");
        } catch (IOException e) {
            e.printStackTrace();
        }
    };

    new Thread(r).start();


    lines.foreachRDD((Function<JavaRDD<String>, Void>) rdd -> {
        List<String> lines1 = rdd.collect();
        lines1.stream().forEach(l -> System.out.println(l));
        return null;
    });

    ssc.start();
    ssc.awaitTermination();
}

}

来源：https://stackoverflow.com/questions/33704326/spark-filestreaming-issue

标签

apache-spark

spark-streaming