问题
I am running following code on Spark shell
>`spark-shell
scala> import org.apache.spark.streaming._
import org.apache.spark.streaming._
scala> import org.apache.spark._
import org.apache.spark._
scala> object sparkClient{
| def main(args : Array[String])
| {
| val ssc = new StreamingContext(sc,Seconds(1))
| val Dstreaminput = ssc.textFileStream("hdfs:///POC/SPARK/DATA/*")
| val transformed = Dstreaminput.flatMap(word => word.split(" "))
| val mapped = transformed.map(word => if(word.contains("error"))(word,"defect")else(word,"non-defect"))
| mapped.print()
| ssc.start()
| ssc.awaitTermination()
| }
| }
defined object sparkClient
scala> sparkClient.main(null)
Output is blank as follows. No file is read and no streaming took place.
Time: 1510663547000 ms
Time: 1510663548000 ms
Time: 1510663549000 ms
Time: 1510663550000 ms
Time: 1510663551000 ms
Time: 1510663552000 ms
Time: 1510663553000 ms
Time: 1510663554000 ms
Time: 1510663555000 ms
The path which I have given as input in the above code is as follows:
[hadoopadmin@master ~]$ hadoop fs -ls /POC/SPARK/DATA/
17/11/14 18:04:32 WARN util.NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
Found 3 items
-rw-r--r-- 2 hadoopadmin supergroup 17881 2017-09-21 11:02
/POC/SPARK/DATA/LICENSE
-rw-r--r-- 2 hadoopadmin supergroup 24645 2017-09-21 11:04
/POC/SPARK/DATA/NOTICE
-rw-r--r-- 2 hadoopadmin supergroup 845 2017-09-21 12:35
/POC/SPARK/DATA/confusion.txt
Could anyone please explain where I am going wrong? Or is there anything wrong with the syntax(although I did not encounter any error) as I am new to spark?
回答1:
textFileStream
won't read pre-existing data. It will include only new files:
created in the dataDirectory by atomically moving or renaming them into the data directory.
https://spark.apache.org/docs/latest/streaming-programming-guide.html#basic-sources
来源:https://stackoverflow.com/questions/47286564/unable-to-get-any-data-when-spark-streaming-program-in-run-taking-source-as-text