Spark流计算-day1

一世执手 提交于 2020-03-01 22:59:38

Spark流计算

概述

⼀般流式计算会与批量计算相⽐较。在流式计算模型中,输⼊是持续的,可以认为在时间上是⽆界的,也就意味着,永远拿不到全量数据去做计算。同时,计算结果是持续输出的,也即计算结果在时间上也是⽆界的。流式计算⼀般对实时性要求较⾼,同时⼀般是先定义⽬标计算,然后数据到来之后将计算逻辑应⽤于数据。同时为了提⾼计算效率,往往尽可能采⽤增量计算代替全量计算。批量处理模型中,⼀般先有全量数据集,然后定义计算逻辑,并将计算应⽤于全量数据。特点是全量计算,并且计算结果⼀次性全量输出。
在这里插入图片描述⽬前主流流处理框架:Kafka Streaming、Storm(JStrom)、Spark Streaming 、Flink(BLink)
①:Kafka Streaming:是⼀套基于Kafka-Streaming库的⼀套流计算⼯具jar包,具有⼊⻔⻔槛低,简单容易集成等特点。
②:Apache Storm:⼀款纯粹的流计算引擎,能够达到每秒钟百万级别数据的低延迟处理框架。
③:Spark Streaming:是构建在Spark 批处理之上⼀款流处理框架。与批处理不同的是,流处理计算的数据是⽆界数据流,输出也是持续的。Spark Streaming底层将Spark RDD Batch 拆分成 Macro RDDBatch实现类似流处理的功能。因此spark Streaming在微观上依旧是批处理框架。
在这里插入图片描述
④:Flink DataStream:在实时性和应⽤性上以及性能都有很⼤的提升,是⽬前为⽌最⽕热的流计算引擎。

快速入门

① 导⼊依赖

<dependency>
	<groupId>org.apache.spark</groupId>
	<artifactId>spark-core_2.11</artifactId>
	<version>2.4.5</version>
</dependency> 
<dependency>
	<groupId>org.apache.spark</groupId>
	<artifactId>spark-streaming_2.11</artifactId>
	<version>2.4.5</version>
</dependency>

② 编写Driver

package com.baizhi.quickstart

import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.{DStream, ReceiverInputDStream}
import org.apache.spark.streaming.{Seconds, StreamingContext}

object SparkWordCountTopology {
  def main(args: Array[String]): Unit = {
    //1.创建流计算所需的环境
    val conf = new SparkConf()
                    .setAppName("SparkWordCountTopology")
                    .setMaster("spark://zly:7077")
                    //.setMaster("local[2]")

    //1s产生一次批处理,只能通过时间
    val ssc = new StreamingContext(conf, Seconds(1))
    //设置日志级别
    ssc.sparkContext.setLogLevel("FATAL")

    //2.创建持续输入DStream 细化,模拟发送数据         SocketReceiver
    //每一个 Receivers,就需要分配一个核
    val lines: DStream[String] = ssc.socketTextStream("zly", 9999)
   // val lines1: DStream[String] = ssc.socketTextStream("zly", 9999)

    //3.对离散流进行转换 细化
    val result:DStream[(String,Int)] = lines.flatMap(line => line.split("\\s+"))
      .map(word => (word, 1))
      .reduceByKey((v1, v2) => v1 + v2)

    //4.将计算结果输出到控制台 细化
    result.print()

    //5.启动流计算
    ssc.start()
    //在kill时进行资源的释放
    ssc.awaitTermination()
  }
}

③ 使⽤mvn package进⾏打包

<build>
        <plugins>
            <!--scala编译插件-->
            <plugin>
                <groupId>net.alchim31.maven</groupId>
                <artifactId>scala-maven-plugin</artifactId>
                <version>4.0.1</version>
                <executions>
                    <execution>
                        <id>scala-compile-first</id>
                        <phase>process-resources</phase>
                        <goals>
                            <goal>add-source</goal>
                            <goal>compile</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
            <!--创建fatjar插件-->
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-shade-plugin</artifactId>
                <version>2.4.3</version>
                <executions>
                    <execution>
                        <phase>package</phase>
                        <goals>
                            <goal>shade</goal>
                        </goals>
                        <configuration>
                            <filters>
                                <filter>
                                    <artifact>*:*</artifact>
                                    <excludes>
                                        <exclude>META-INF/*.SF</exclude>
                                        <exclude>META-INF/*.DSA</exclude>
                                        <exclude>META-INF/*.RSA</exclude>
                                    </excludes>
                                </filter>
                            </filters>
                        </configuration>
                    </execution>
                </executions>
            </plugin>
            <!--编译插件-->
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>3.2</version>
                <configuration>
                    <source>1.8</source>
                    <target>1.8</target>
                    <encoding>UTF-8</encoding>
                </configuration>
                <executions>
                    <execution>
                        <phase>compile</phase>
                        <goals>
                            <goal>compile</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>
 

④ 下载nc组件

# yum -y install nmap-ncat

⑤ 启动nc服务

# nc -lk 9999

⑥ 启动服务

 ./bin/spark-submit --master spark://zly:7077 --name SparkWordCountTopology --deploy-mode client --class com.baizhi.quickstart.SparkWordCountTopology --total-executor-cores 6 /root/spark-dstream-1.0-SNAPSHOT.jar

⑦ 查看结果
1) 模拟发送数据
在这里插入图片描述
2) 接收到数据后进行流计算
在这里插入图片描述

注意:每一个接受数据都会占用一个线程,资源会先分配给 Receivers,多余的线程才会分配给计算,所以要注意分配的核数。

Discretized Streams(离散流)

Discretized Stream或DStream是Spark Streaming提供的基本抽象。它表示连续的数据流,可以是从源接收的输⼊数据流,也可以是通过转换输⼊流⽣成的已处理数据流。在内部,DStream由⼀系列连续的RDD表示,这是Spark对不可变分布式数据集的抽象。DStream中的每个RDD都包含来⾃特定时间间隔的数据,如下图所示。
在这里插入图片描述

注意:通过对DStream底层运⾏机制的了解,在设计StreamingContext的时候要求设置的 Seconds() 间隔要略⼤于 微批的计算时间。这样才可以有效的避免数据在Spark的内存中产⽣积压。

DStreams & Receivers

每个输⼊DStream(file stream除外,稍后讨论)都与Receiver对象相关联,该对象从源接收数据并将其存储在Spark的内存中进⾏处理。Spark Streaming提供了两类内建的输⼊源,⽤于接收外围系统数据:

内建输⼊源

Basic sources

Spark 的中StreamContext的API可以直接获取的数据源,例如:fileStream(读⽂
件)、socket(测试)

① socketTextStream

package com.baizhi.quickstart

import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.{DStream, ReceiverInputDStream}
import org.apache.spark.streaming.{Seconds, StreamingContext}

object SparkWordCountTopology {
  def main(args: Array[String]): Unit = {
    //1.创建流计算所需的环境
    val conf = new SparkConf()
                    .setAppName("SparkWordCountTopology")
                    //.setMaster("spark://zly:7077")
                    .setMaster("local[2]")

    //1s产生一次批处理,只能通过时间
    val ssc = new StreamingContext(conf, Seconds(1))
    //设置日志级别
    ssc.sparkContext.setLogLevel("FATAL")

    //2.创建持续输入DStream 细化,模拟发送数据         SocketReceiver
    //每一个 Receivers,就需要分配一个核
    val lines: DStream[String] = ssc.socketTextStream("zly", 9999)

    //3.对离散流进行转换 细化
    val result:DStream[(String,Int)] = lines.flatMap(line => line.split("\\s+"))
      .map(word => (word, 1))
      .reduceByKey((v1, v2) => v1 + v2)

    //4.将计算结果输出到控制台 细化
    result.print()

    //5.启动流计算
    ssc.start()
    //在kill时进行资源的释放
    ssc.awaitTermination()
  }
}

② File Streams

textFileStream 底层调用 fileStream

package com.baizhi.quickstart

import org.apache.hadoop.io.{LongWritable, Text}
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat
import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.DStream
import org.apache.spark.streaming.{Seconds, StreamingContext}

object SparkWordCountTopologyFileStream {
  def main(args: Array[String]): Unit = {
    //1.创建流计算所需的环境
    val conf = new SparkConf()
                    .setAppName("SparkWordCountTopology")
                    .setMaster("local[*]")

    val ssc = new StreamingContext(conf, Seconds(5))
    ssc.sparkContext.setLogLevel("FATAL")

    //2.创建持续输入DStream 细化
    val lines = ssc.textFileStream("hdfs://zly:9000/demo/words")
    //val lines: DStream[(LongWritable,Text)] =  ssc.fileStream[LongWritable,Text,TextInputFormat]("hdfs://zly:9000/demo/words")
    //3.对离散流进行转换 细化
    val result:DStream[(String,Int)] = lines.flatMap(line => line.split("\\s+"))
      .map(word => (word, 1))
      .reduceByKey((v1, v2) => v1 + v2)

    //4.将计算结果输出到控制台 细化
    result.print()

    //5.启动流计算
    ssc.start()
    ssc.awaitTermination()
  }
}

或者

package com.baizhi.quickstart

import org.apache.hadoop.io.{LongWritable, Text}
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat
import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.DStream
import org.apache.spark.streaming.{Seconds, StreamingContext}

object SparkWordCountTopologyFileStream {
  def main(args: Array[String]): Unit = {
    //1.创建流计算所需的环境
    val conf = new SparkConf()
                    .setAppName("SparkWordCountTopology")
                    .setMaster("local[*]")

    val ssc = new StreamingContext(conf, Seconds(5))
    ssc.sparkContext.setLogLevel("FATAL")

    //2.创建持续输入DStream 细化
    
    val lines: DStream[(LongWritable,Text)] =  ssc.fileStream[LongWritable,Text,TextInputFormat]("hdfs://zly:9000/demo/words")
    //3.对离散流进行转换 细化
    val result:DStream[(String,Int)] = lines.map((_._2.toString)).flatMap(line => line.split("\\s+"))
      .map(word => (word, 1))
      .reduceByKey((v1, v2) => v1 + v2)

    //4.将计算结果输出到控制台 细化
    result.print()

    //5.启动流计算
    ssc.start()
    ssc.awaitTermination()
  }
}

注意:在启动服务后,根据时间监测 hdfs://zly:9000/demo/words 是否有 新⽂件 产⽣,如果有新⽂件产⽣,系统就⾃动读取该⽂件。并不会监控⽂件内容的变化。提示 :在测试的时候,⼀定要注意同步时间。
① 这些文件具有相同的格式
② 这些文件通过原子移动或重命名文件的方式在dataDirectory创建
③ 一旦移动这些文件,就不能再进行修改,如果在文件中追加内容,追加的新数据也不会被读取。

③ Queue of RDDs(用于测试)

package com.baizhi.quickstart

import org.apache.hadoop.io.{LongWritable, Text}
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat
import org.apache.spark.SparkConf
import org.apache.spark.rdd.RDD
import org.apache.spark.streaming.dstream.DStream
import org.apache.spark.streaming.{Seconds, StreamingContext}

import scala.collection.mutable.Queue


object SparkWordCountTopologyQueueRDDS {
  def main(args: Array[String]): Unit = {
    //1.创建流计算所需的环境
    val conf = new SparkConf()
                    .setAppName("SparkWordCountTopology")
                    .setMaster("local[*]")

    val ssc = new StreamingContext(conf, Seconds(5))
    ssc.sparkContext.setLogLevel("FATAL")

    val queueRDDS=new Queue[RDD[String]]();
    //产生测试数据
    new Thread(new Runnable {
      override def run(): Unit = {
        //死循环,不停追加
        while(true){
          //+= 往队列添加测试数据
          queueRDDS += ssc.sparkContext.makeRDD(List("this is a demo","hello hello"))
          Thread.sleep(500)
        }
      }
    }).start()
    //2.创建持续输入DStream 细化
    val lines: DStream[String] =  ssc.queueStream(queueRDDS)

    //3.对离散流进行转换 细化
    val result:DStream[(String,Int)] = lines.flatMap(line => line.split("\\s+"))
      .map(word => (word, 1))
      .reduceByKey((v1, v2) => v1 + v2)

    //4.将计算结果输出到控制台 细化
    result.print()


    //5.启动流计算
    ssc.start()
    ssc.awaitTermination()
  }
}

⾼级数据源

Advanced sources

并不是Spark⾃带的输⼊源,例如:Kafka、Flume、Kinesis等,这些⼀般都需要
第三⽅⽀持。

① Custom Receiver(自定义Receiver)

需要自定义一个类继承Receiver

package com.baizhi.quickstart

import java.net.ConnectException

import org.apache.spark.internal.Logging
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.receiver.Receiver

import scala.util.Random

//存储数据类型,存储使用的级别
class CustomReceiver (values:List[String]) extends Receiver[String](StorageLevel.MEMORY_ONLY) with Logging {

  override def onStart(): Unit = {
    //创建一个线程,调用receive()方法
    new Thread(new Runnable {
      override def run(): Unit =  receive()
    } ).start()

  }

  override def onStop(): Unit = {

  }

  //接收来自外围系统的数据
  private def receive(): Unit = {
    try {
      while (!isStopped()){
        Thread.sleep(500)
        var line= values(new Random().nextInt(values.length))
        //随机写出去
        store(line)
      }
      //如果停止,调restart
      restart("Trying to restart again")
    } catch {
      case t: Throwable =>
        restart("Error trying to restart again", t)
    }
  }
}

package com.baizhi.quickstart

import org.apache.spark.SparkConf
import org.apache.spark.rdd.RDD
import org.apache.spark.streaming.dstream.DStream
import org.apache.spark.streaming.{Seconds, StreamingContext}

import scala.collection.mutable.Queue


object SparkWordCountTopologyCustomReceiver {
  def main(args: Array[String]): Unit = {
    //1.创建流计算所需的环境
    val conf = new SparkConf()
                    .setAppName("SparkWordCountTopology")
                    .setMaster("local[*]")

    val ssc = new StreamingContext(conf, Seconds(5))
    ssc.sparkContext.setLogLevel("FATAL")

    //2.创建持续输入DStream 细化
    var arrays=List("this is a demo","good good ","study come on")
    //写个类继承Receiver
    val lines: DStream[String] =  ssc.receiverStream(new CustomReceiver(arrays))
    //3.对离散流进行转换 细化
    val result:DStream[(String,Int)] = lines.flatMap(line => line.split("\\s+"))
      .map(word => (word, 1))
      .reduceByKey((v1, v2) => v1 + v2)

    //4.将计算结果输出到控制台 细化
    result.print()


    //5.启动流计算
    ssc.start()
    ssc.awaitTermination()
  }
}

② Spark对接Kafka

参考资料:http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html

<dependency>
	<groupId>org.apache.spark</groupId>
	<!--scala的版本为2.11-->
	<artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
	<version>2.4.5</version>
</dependency>

package com.baizhi.quickstart

import org.apache.kafka.clients.consumer.{ConsumerConfig, ConsumerRecord}
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.SparkConf
import org.apache.spark.rdd.RDD
import org.apache.spark.streaming.dstream.{DStream, InputDStream}
import org.apache.spark.streaming.kafka010.{ConsumerStrategies, KafkaUtils, LocationStrategies}
import org.apache.spark.streaming.{Seconds, StreamingContext}

import scala.collection.mutable.Queue


object SparkWordCountTopologyKafka {
  def main(args: Array[String]): Unit = {
    //1.创建流计算所需的环境
    val conf = new SparkConf()
                    .setAppName("SparkWordCountTopology")
                    .setMaster("local[*]")

    val ssc = new StreamingContext(conf, Seconds(1))
    ssc.sparkContext.setLogLevel("FATAL")

    //消费者读数据
    val kafkaParams = Map[String, Object](
              ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> "zly:9092",
              ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG-> classOf[StringDeserializer],
              ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer],
              //订阅的组id
              "group.id" -> "g1",
              //最近一次偏移量去读
              "auto.offset.reset" -> "latest",
              //偏移量的自动提交由spark管理,不由kafka管理
              "enable.auto.commit" -> (false: java.lang.Boolean)
    )

    val topics = Array("topic01")
    //创建流
    val kafkaInputs: DStream[ConsumerRecord[String, String]] = KafkaUtils.createDirectStream[String, String](
      ssc,
      LocationStrategies.PreferConsistent, //设置读取位置策略
      ConsumerStrategies.Subscribe[String, String](topics, kafkaParams)//订阅方式
    )

    kafkaInputs.map(record=>record.value())
               .flatMap(line=>line.split("\\s+"))
               .map(word=>(word,1))
               .reduceByKey((v1,v2)=>v1+v2)
               .print()

    //5.启动流计算
    ssc.start()
    ssc.awaitTermination()
  }
}

./bin/kafka-server-start.sh -daemon config/server.properties 

./bin/kafka-topics.sh --bootstrap-server zly:9092 --create --topic topic01 --partitions 30 --replication-factor 1

./bin/kafka-console-producer.sh  --broker-list zly:9092 --topic topic01

结果:
在这里插入图片描述在这里插入图片描述

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!