Spark流计算
概述
⼀般流式计算会与批量计算相⽐较。在流式计算模型中,输⼊是持续的,可以认为在时间上是⽆界的,也就意味着,永远拿不到全量数据去做计算。同时,计算结果是持续输出的,也即计算结果在时间上也是⽆界的。流式计算⼀般对实时性要求较⾼,同时⼀般是先定义⽬标计算,然后数据到来之后将计算逻辑应⽤于数据。同时为了提⾼计算效率,往往尽可能采⽤增量计算代替全量计算。批量处理模型中,⼀般先有全量数据集,然后定义计算逻辑,并将计算应⽤于全量数据。特点是全量计算,并且计算结果⼀次性全量输出。
⽬前主流流处理框架:Kafka Streaming、Storm(JStrom)、Spark Streaming 、Flink(BLink)
①:Kafka Streaming:是⼀套基于Kafka-Streaming库的⼀套流计算⼯具jar包,具有⼊⻔⻔槛低,简单容易集成等特点。
②:Apache Storm:⼀款纯粹的流计算引擎,能够达到每秒钟百万级别数据的低延迟处理框架。
③:Spark Streaming:是构建在Spark 批处理之上⼀款流处理框架。与批处理不同的是,流处理计算的数据是⽆界数据流,输出也是持续的。Spark Streaming底层将Spark RDD Batch 拆分成 Macro RDDBatch实现类似流处理的功能。因此spark Streaming在微观上依旧是批处理框架。
④:Flink DataStream:在实时性和应⽤性上以及性能都有很⼤的提升,是⽬前为⽌最⽕热的流计算引擎。
快速入门
① 导⼊依赖
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.4.5</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>2.4.5</version>
</dependency>
② 编写Driver
package com.baizhi.quickstart
import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.{DStream, ReceiverInputDStream}
import org.apache.spark.streaming.{Seconds, StreamingContext}
object SparkWordCountTopology {
def main(args: Array[String]): Unit = {
//1.创建流计算所需的环境
val conf = new SparkConf()
.setAppName("SparkWordCountTopology")
.setMaster("spark://zly:7077")
//.setMaster("local[2]")
//1s产生一次批处理,只能通过时间
val ssc = new StreamingContext(conf, Seconds(1))
//设置日志级别
ssc.sparkContext.setLogLevel("FATAL")
//2.创建持续输入DStream 细化,模拟发送数据 SocketReceiver
//每一个 Receivers,就需要分配一个核
val lines: DStream[String] = ssc.socketTextStream("zly", 9999)
// val lines1: DStream[String] = ssc.socketTextStream("zly", 9999)
//3.对离散流进行转换 细化
val result:DStream[(String,Int)] = lines.flatMap(line => line.split("\\s+"))
.map(word => (word, 1))
.reduceByKey((v1, v2) => v1 + v2)
//4.将计算结果输出到控制台 细化
result.print()
//5.启动流计算
ssc.start()
//在kill时进行资源的释放
ssc.awaitTermination()
}
}
③ 使⽤mvn package进⾏打包
<build>
<plugins>
<!--scala编译插件-->
<plugin>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>4.0.1</version>
<executions>
<execution>
<id>scala-compile-first</id>
<phase>process-resources</phase>
<goals>
<goal>add-source</goal>
<goal>compile</goal>
</goals>
</execution>
</executions>
</plugin>
<!--创建fatjar插件-->
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>2.4.3</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<filters>
<filter>
<artifact>*:*</artifact>
<excludes>
<exclude>META-INF/*.SF</exclude>
<exclude>META-INF/*.DSA</exclude>
<exclude>META-INF/*.RSA</exclude>
</excludes>
</filter>
</filters>
</configuration>
</execution>
</executions>
</plugin>
<!--编译插件-->
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.2</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
<encoding>UTF-8</encoding>
</configuration>
<executions>
<execution>
<phase>compile</phase>
<goals>
<goal>compile</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
④ 下载nc组件
# yum -y install nmap-ncat
⑤ 启动nc服务
# nc -lk 9999
⑥ 启动服务
./bin/spark-submit --master spark://zly:7077 --name SparkWordCountTopology --deploy-mode client --class com.baizhi.quickstart.SparkWordCountTopology --total-executor-cores 6 /root/spark-dstream-1.0-SNAPSHOT.jar
⑦ 查看结果
1) 模拟发送数据
2) 接收到数据后进行流计算
注意:每一个接受数据都会占用一个线程,资源会先分配给 Receivers,多余的线程才会分配给计算,所以要注意分配的核数。
Discretized Streams(离散流)
Discretized Stream或DStream是Spark Streaming提供的基本抽象。它表示连续的数据流,可以是从源接收的输⼊数据流,也可以是通过转换输⼊流⽣成的已处理数据流。在内部,DStream由⼀系列连续的RDD表示,这是Spark对不可变分布式数据集的抽象。DStream中的每个RDD都包含来⾃特定时间间隔的数据,如下图所示。
注意:通过对DStream底层运⾏机制的了解,在设计StreamingContext的时候要求设置的 Seconds() 间隔要略⼤于 微批的计算时间。这样才可以有效的避免数据在Spark的内存中产⽣积压。
DStreams & Receivers
每个输⼊DStream(file stream除外,稍后讨论)都与Receiver对象相关联,该对象从源接收数据并将其存储在Spark的内存中进⾏处理。Spark Streaming提供了两类内建的输⼊源,⽤于接收外围系统数据:
内建输⼊源
Basic sources
Spark 的中StreamContext的API可以直接获取的数据源,例如:fileStream(读⽂
件)、socket(测试)
① socketTextStream
package com.baizhi.quickstart
import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.{DStream, ReceiverInputDStream}
import org.apache.spark.streaming.{Seconds, StreamingContext}
object SparkWordCountTopology {
def main(args: Array[String]): Unit = {
//1.创建流计算所需的环境
val conf = new SparkConf()
.setAppName("SparkWordCountTopology")
//.setMaster("spark://zly:7077")
.setMaster("local[2]")
//1s产生一次批处理,只能通过时间
val ssc = new StreamingContext(conf, Seconds(1))
//设置日志级别
ssc.sparkContext.setLogLevel("FATAL")
//2.创建持续输入DStream 细化,模拟发送数据 SocketReceiver
//每一个 Receivers,就需要分配一个核
val lines: DStream[String] = ssc.socketTextStream("zly", 9999)
//3.对离散流进行转换 细化
val result:DStream[(String,Int)] = lines.flatMap(line => line.split("\\s+"))
.map(word => (word, 1))
.reduceByKey((v1, v2) => v1 + v2)
//4.将计算结果输出到控制台 细化
result.print()
//5.启动流计算
ssc.start()
//在kill时进行资源的释放
ssc.awaitTermination()
}
}
② File Streams
textFileStream 底层调用 fileStream
package com.baizhi.quickstart
import org.apache.hadoop.io.{LongWritable, Text}
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat
import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.DStream
import org.apache.spark.streaming.{Seconds, StreamingContext}
object SparkWordCountTopologyFileStream {
def main(args: Array[String]): Unit = {
//1.创建流计算所需的环境
val conf = new SparkConf()
.setAppName("SparkWordCountTopology")
.setMaster("local[*]")
val ssc = new StreamingContext(conf, Seconds(5))
ssc.sparkContext.setLogLevel("FATAL")
//2.创建持续输入DStream 细化
val lines = ssc.textFileStream("hdfs://zly:9000/demo/words")
//val lines: DStream[(LongWritable,Text)] = ssc.fileStream[LongWritable,Text,TextInputFormat]("hdfs://zly:9000/demo/words")
//3.对离散流进行转换 细化
val result:DStream[(String,Int)] = lines.flatMap(line => line.split("\\s+"))
.map(word => (word, 1))
.reduceByKey((v1, v2) => v1 + v2)
//4.将计算结果输出到控制台 细化
result.print()
//5.启动流计算
ssc.start()
ssc.awaitTermination()
}
}
或者
package com.baizhi.quickstart
import org.apache.hadoop.io.{LongWritable, Text}
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat
import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.DStream
import org.apache.spark.streaming.{Seconds, StreamingContext}
object SparkWordCountTopologyFileStream {
def main(args: Array[String]): Unit = {
//1.创建流计算所需的环境
val conf = new SparkConf()
.setAppName("SparkWordCountTopology")
.setMaster("local[*]")
val ssc = new StreamingContext(conf, Seconds(5))
ssc.sparkContext.setLogLevel("FATAL")
//2.创建持续输入DStream 细化
val lines: DStream[(LongWritable,Text)] = ssc.fileStream[LongWritable,Text,TextInputFormat]("hdfs://zly:9000/demo/words")
//3.对离散流进行转换 细化
val result:DStream[(String,Int)] = lines.map((_._2.toString)).flatMap(line => line.split("\\s+"))
.map(word => (word, 1))
.reduceByKey((v1, v2) => v1 + v2)
//4.将计算结果输出到控制台 细化
result.print()
//5.启动流计算
ssc.start()
ssc.awaitTermination()
}
}
注意:在启动服务后,根据时间监测 hdfs://zly:9000/demo/words 是否有 新⽂件 产⽣,如果有新⽂件产⽣,系统就⾃动读取该⽂件。并不会监控⽂件内容的变化。提示 :在测试的时候,⼀定要注意同步时间。
① 这些文件具有相同的格式
② 这些文件通过原子移动或重命名文件的方式在dataDirectory创建
③ 一旦移动这些文件,就不能再进行修改,如果在文件中追加内容,追加的新数据也不会被读取。
③ Queue of RDDs(用于测试)
package com.baizhi.quickstart
import org.apache.hadoop.io.{LongWritable, Text}
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat
import org.apache.spark.SparkConf
import org.apache.spark.rdd.RDD
import org.apache.spark.streaming.dstream.DStream
import org.apache.spark.streaming.{Seconds, StreamingContext}
import scala.collection.mutable.Queue
object SparkWordCountTopologyQueueRDDS {
def main(args: Array[String]): Unit = {
//1.创建流计算所需的环境
val conf = new SparkConf()
.setAppName("SparkWordCountTopology")
.setMaster("local[*]")
val ssc = new StreamingContext(conf, Seconds(5))
ssc.sparkContext.setLogLevel("FATAL")
val queueRDDS=new Queue[RDD[String]]();
//产生测试数据
new Thread(new Runnable {
override def run(): Unit = {
//死循环,不停追加
while(true){
//+= 往队列添加测试数据
queueRDDS += ssc.sparkContext.makeRDD(List("this is a demo","hello hello"))
Thread.sleep(500)
}
}
}).start()
//2.创建持续输入DStream 细化
val lines: DStream[String] = ssc.queueStream(queueRDDS)
//3.对离散流进行转换 细化
val result:DStream[(String,Int)] = lines.flatMap(line => line.split("\\s+"))
.map(word => (word, 1))
.reduceByKey((v1, v2) => v1 + v2)
//4.将计算结果输出到控制台 细化
result.print()
//5.启动流计算
ssc.start()
ssc.awaitTermination()
}
}
⾼级数据源
Advanced sources
并不是Spark⾃带的输⼊源,例如:Kafka、Flume、Kinesis等,这些⼀般都需要
第三⽅⽀持。
① Custom Receiver(自定义Receiver)
需要自定义一个类继承Receiver
package com.baizhi.quickstart
import java.net.ConnectException
import org.apache.spark.internal.Logging
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.receiver.Receiver
import scala.util.Random
//存储数据类型,存储使用的级别
class CustomReceiver (values:List[String]) extends Receiver[String](StorageLevel.MEMORY_ONLY) with Logging {
override def onStart(): Unit = {
//创建一个线程,调用receive()方法
new Thread(new Runnable {
override def run(): Unit = receive()
} ).start()
}
override def onStop(): Unit = {
}
//接收来自外围系统的数据
private def receive(): Unit = {
try {
while (!isStopped()){
Thread.sleep(500)
var line= values(new Random().nextInt(values.length))
//随机写出去
store(line)
}
//如果停止,调restart
restart("Trying to restart again")
} catch {
case t: Throwable =>
restart("Error trying to restart again", t)
}
}
}
package com.baizhi.quickstart
import org.apache.spark.SparkConf
import org.apache.spark.rdd.RDD
import org.apache.spark.streaming.dstream.DStream
import org.apache.spark.streaming.{Seconds, StreamingContext}
import scala.collection.mutable.Queue
object SparkWordCountTopologyCustomReceiver {
def main(args: Array[String]): Unit = {
//1.创建流计算所需的环境
val conf = new SparkConf()
.setAppName("SparkWordCountTopology")
.setMaster("local[*]")
val ssc = new StreamingContext(conf, Seconds(5))
ssc.sparkContext.setLogLevel("FATAL")
//2.创建持续输入DStream 细化
var arrays=List("this is a demo","good good ","study come on")
//写个类继承Receiver
val lines: DStream[String] = ssc.receiverStream(new CustomReceiver(arrays))
//3.对离散流进行转换 细化
val result:DStream[(String,Int)] = lines.flatMap(line => line.split("\\s+"))
.map(word => (word, 1))
.reduceByKey((v1, v2) => v1 + v2)
//4.将计算结果输出到控制台 细化
result.print()
//5.启动流计算
ssc.start()
ssc.awaitTermination()
}
}
② Spark对接Kafka
参考资料:http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html
<dependency>
<groupId>org.apache.spark</groupId>
<!--scala的版本为2.11-->
<artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
<version>2.4.5</version>
</dependency>
package com.baizhi.quickstart
import org.apache.kafka.clients.consumer.{ConsumerConfig, ConsumerRecord}
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.SparkConf
import org.apache.spark.rdd.RDD
import org.apache.spark.streaming.dstream.{DStream, InputDStream}
import org.apache.spark.streaming.kafka010.{ConsumerStrategies, KafkaUtils, LocationStrategies}
import org.apache.spark.streaming.{Seconds, StreamingContext}
import scala.collection.mutable.Queue
object SparkWordCountTopologyKafka {
def main(args: Array[String]): Unit = {
//1.创建流计算所需的环境
val conf = new SparkConf()
.setAppName("SparkWordCountTopology")
.setMaster("local[*]")
val ssc = new StreamingContext(conf, Seconds(1))
ssc.sparkContext.setLogLevel("FATAL")
//消费者读数据
val kafkaParams = Map[String, Object](
ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> "zly:9092",
ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG-> classOf[StringDeserializer],
ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer],
//订阅的组id
"group.id" -> "g1",
//最近一次偏移量去读
"auto.offset.reset" -> "latest",
//偏移量的自动提交由spark管理,不由kafka管理
"enable.auto.commit" -> (false: java.lang.Boolean)
)
val topics = Array("topic01")
//创建流
val kafkaInputs: DStream[ConsumerRecord[String, String]] = KafkaUtils.createDirectStream[String, String](
ssc,
LocationStrategies.PreferConsistent, //设置读取位置策略
ConsumerStrategies.Subscribe[String, String](topics, kafkaParams)//订阅方式
)
kafkaInputs.map(record=>record.value())
.flatMap(line=>line.split("\\s+"))
.map(word=>(word,1))
.reduceByKey((v1,v2)=>v1+v2)
.print()
//5.启动流计算
ssc.start()
ssc.awaitTermination()
}
}
./bin/kafka-server-start.sh -daemon config/server.properties
./bin/kafka-topics.sh --bootstrap-server zly:9092 --create --topic topic01 --partitions 30 --replication-factor 1
./bin/kafka-console-producer.sh --broker-list zly:9092 --topic topic01
结果:
来源:CSDN
作者:无敌火车滴滴开
链接:https://blog.csdn.net/qq_36915093/article/details/104593100