How to read a file using sparkstreaming and write to a simple file using Scala?

问题

I'm trying to read a file using a scala SparkStreaming program. The file is stored in a directory on my local machine and trying to write it as a new file on my local machine itself. But whenever I write my stream and store it as parquet I end up getting blank folders.

This is my code :

 Logger.getLogger("org").setLevel(Level.ERROR)
 val spark = SparkSession
             .builder()
             .master("local[*]")
             .appName("StreamAFile")
             .config("spark.sql.warehouse.dir", "file:///C:/temp")
             .getOrCreate()


 import spark.implicits._            
 val schemaforfile = new StructType().add("SrNo",IntegerType).add("Name",StringType).add("Age",IntegerType).add("Friends",IntegerType)

 val file = spark.readStream.schema(schemaforfile).csv("C:\\SparkScala\\fakefriends.csv")  

 file.writeStream.format("parquet").start("C:\\Users\\roswal01\\Desktop\\streamed") 

 spark.stop()

Is there anything missing with my code or anything which in which I've gone wrong?

I also tried reading this file from a hdfs location but the same code ends up not creating any output folders on my hdfs.

回答1:

You've mistake here:

val file = spark.readStream.schema(schemaforfile).csv("C:\\SparkScala\\fakefriends.csv")

csv() function should have directory path as an argument. It will scan this directory and read all new files when they will be moved into this directory

For checkpointing, you should add

.option("checkpointLocation", "path/to/HDFS/dir")

For example:

val query = file.writeStream.format("parquet")
    .option("checkpointLocation", "path/to/HDFS/dir")
    .start("C:\\Users\\roswal01\\Desktop\\streamed") 

query.awaitTermination()

来源：https://stackoverflow.com/questions/41119084/how-to-read-a-file-using-sparkstreaming-and-write-to-a-simple-file-using-scala

标签

scala

apache-spark

spark-streaming

parquet