Spark Streaming 接任意数据源作为 Stream
问题出发点
工程中遇到流式处理的问题时,多采用Spark Streaming 或者 Storm 来处理;Strom采用Spout的流接入方式,Streaming采用Stream的流接入方式,为了方便本地测试,所以选择了spark streaming,但是官方仅支持如下几种方案,当遇到其他高吞吐数据量作为流时,就需要主角 Receiver 登场:
实现关键类
Receiver是spark内部实现的一套机制,通过自定义一个类继承Receiver即可实现自定义数据源,再通过ssc的receiverStream接口即可实现数据转RDD的操作,即可像Kafka,Flume等正常操作Spark Streaming。本质上通过receiverStream得到的是ReceiverInputDStreaming。
class MyReceiver(storageLevel: StorageLevel) extends NetworkReceiver[String](storageLevel) {
def onStart() {
// Setup stuff (start threads, open sockets, etc.) to start receiving data.
// Must start new thread to receive data, as onStart() must be non-blocking.
// Call store(...) in those threads to store received data into Spark's memory.
// Call stop(...), restart(...) or reportError(...) on any thread based on how
// different errors need to be handled.
// See corresponding method documentation for more details
}
def onStop() {
// Cleanup stuff (stop threads, close sockets, etc.) to stop receiving data.
}
}
这里需要实现两个函数,onStart 和 onStop ,onStart里就是你数据源的具体逻辑,按照官方的说法,onstart方法下你需要启动线程,连接sockets以开始接收数据。要求必须启动新线程以接收数据,且保证onStart() 是非阻塞的。在这些线程中调用store()方法将接收到的数据存储到Spark的内存中,作为一次流的内容,这里store方法是Receiver中自带的,无需自己实现。这里需要注意你连接的client必须非堵塞,如果同时连接多个端口或者一个key只能一个线程消费时,就会引发异常。
具体实现
spark streaming 主类:
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
val sparkConf = new SparkConf().setAppName(appName)
val ssc = new StreamingContext(sparkConf, Seconds(interval.toInt))
val stream = ssc.receiverStream(new MyReceiver())
stream.foreachRDD(rdd => {
rdd.foreachPartition(partition => {
partition.foreach(line => {
println(line)
})
})
})
try {
ssc.start()
ssc.awaitTermination()
} catch {
case e: Exception => {
println(e.getStackTrace)
}
}
MyReceiver类:
大概解释一下 onStart 方法启一个线程,执行receiver函数,receiver中初始化自己的数据连接服务器并get数据,将get到的数据调用store方法,即可存到spark的内存中。正常情况下,receiver函数中while (ture) 即可,除非是限时的流式处理(比较少见)
1)onStop方法不写也可以,主要实现onStart方法即可
2) 可以根据自己服务器环境 调整StorageLevel
3) 如果非堵塞 也可以在onstart方法中实现多线程增加吞吐
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.receiver.Receiver
class MyReceiver(host: String, port: String) extends Receiver[String](StorageLevel.MEMORY_AND_DISK_2) {
def onStart(): Unit = {
new Thread("Socket Receiver") {
override def run() { receive() }
}.start()
}
def onStop: Unit = {
if (Thread.currentThread.isInterrupted) {
sys.exit(1)
}
}
// myClient可以是任意连接
private def receive(): Unit = {
var client: MyClient = null
try {
client = new MyClient(host, port)
} catch {
case e: Exception => {
println(e.getStackTrace)
println("MyClient 连接失败!")
}
}
while ({
!Thread.currentThread.isInterrupted
}) {
try {
val message = client.get(key)
if (message != null) store(message)
} catch {
case e: Exception => {
e.printStackTrace()
}
}
}
}
}
Tips:
具体实现Receiver的话还有RawNetworkReceiver和SocketReciver两种方法,有兴趣实现也可以参考文档和上面的写法实现。核心就是onStart对数据源接入的定义。
来源:oschina
链接:https://my.oschina.net/u/4303989/blog/4308581