How to read a file using sparkstreaming and write to a simple file using Scala?

限于喜欢 提交于 2019-12-25 09:18:35

问题


I'm trying to read a file using a scala SparkStreaming program. The file is stored in a directory on my local machine and trying to write it as a new file on my local machine itself. But whenever I write my stream and store it as parquet I end up getting blank folders.

This is my code :

 Logger.getLogger("org").setLevel(Level.ERROR)
 val spark = SparkSession
             .builder()
             .master("local[*]")
             .appName("StreamAFile")
             .config("spark.sql.warehouse.dir", "file:///C:/temp")
             .getOrCreate()


 import spark.implicits._            
 val schemaforfile = new StructType().add("SrNo",IntegerType).add("Name",StringType).add("Age",IntegerType).add("Friends",IntegerType)

 val file = spark.readStream.schema(schemaforfile).csv("C:\\SparkScala\\fakefriends.csv")  

 file.writeStream.format("parquet").start("C:\\Users\\roswal01\\Desktop\\streamed") 

 spark.stop()

Is there anything missing with my code or anything which in which I've gone wrong?

I also tried reading this file from a hdfs location but the same code ends up not creating any output folders on my hdfs.


回答1:


You've mistake here:

val file = spark.readStream.schema(schemaforfile).csv("C:\\SparkScala\\fakefriends.csv")  

csv() function should have directory path as an argument. It will scan this directory and read all new files when they will be moved into this directory

For checkpointing, you should add

.option("checkpointLocation", "path/to/HDFS/dir")

For example:

val query = file.writeStream.format("parquet")
    .option("checkpointLocation", "path/to/HDFS/dir")
    .start("C:\\Users\\roswal01\\Desktop\\streamed") 

query.awaitTermination()


来源:https://stackoverflow.com/questions/41119084/how-to-read-a-file-using-sparkstreaming-and-write-to-a-simple-file-using-scala

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!