一, Spark概述
spark框架地址
1、官网: http://spark.apache.org/ 2、源码托管: https://github.com/apache/spark 3、母公司网站: https://databricks.com/ 官方博客:https://databricks.com/blog/、https://databricks.com/blog/category/engineering/spark
1,官方定义
http://spark.apache.org/docs/2.2.0/
Spark框架,类似于MapReduce框架,针对大规模数据分析框架。
2,大数据分析类型
- 离线处理:处理分析的数据是静态不变的,类似MapReduce和Hive框架等
- 交互式分析:即席查询,类似于impala
- 实时分析:针对流式数据实时处理,展示结果等
3,Spark框架介绍
在磁盘上对100TB的数据进行排序,可以看到Spark比hadoop快的多,效率高。
为什么Spark框架如此快?
数据结构
RDD:弹性分布式数据集,Spark将要处理的数据封装到集合RDD中,调用RDD中函数处理数据。
RDD数据可以放到内存中,内存不足可以放到磁盘中。
Task任务运行方式不一样
MapReduce应用运行:MapTask和ReduceTask都是JVM进程。启动一个jvm进程很慢。
Spark中Task以线程Thread方式运行,线程运行在进程中,创建和销毁代价小,效率高。
D: Spark框架特性
- 块:与hadoop的mapreduce相比,Spark基于内存的运算要快100倍以上,就算是基于硬盘运算也要快10倍以上。
- 易用:Spark支持Java,Python,R,Scala的API,还支持80多种高级算法。
- 通用:Spark适用于批处理,交互式查询,实时流处理,机器学习,图计算。
- 兼容性:Spark可以使用Hadoop的yarn作为它的资源管理和调度器。
二,框架模块
A:SparkCore:Spark框架核心,主要内容是RDD,针对海量数据进行离线分析处理,类似MapReduce框架
B:SparkSQL:使用最多的框架,类似Hive框架,提供SQL功能,分析数据,远远不止时SQL,还提供DSL。
C:SparkStreaming:针对流式应用处理的模块。
D:Structured Streaming:Spark2.x出现的新型的流式数据处理框架
E:Spark MLlib:机器学习库
F:Spark GraphX:图形计算
G:PySpark:针对Python开发的模块
H:SparkR:针对R语言开发的模块
三,Spark 运行模型
1,本地模式:Local Mode
主要用于开发测试
2,集群模式:cluster mode
Spark Standalone集群,Spark自带的框架
Hadoop Yarn企业中常常把MR,Flink,Spark运行在Yarn上
四,快速入门
1,本地模式
本质:启动一个JVM Process进程,执行任务Task --master local | local[*] | local[K] 建议 K >= 2 正整数
具体的说明如下:
启动命令如下:
bin/spark-shell --master local[2]
2,词频统计
# 准备数据 /export/servers/hadoop/bin/hdfs dfs -put wordcount.input /datas # 读取HDFS文本数据,封装到RDD集合中,文本中每条数据就是集合中每条数据 val inputRDD = sc.textFile("/datas/wordcount.input") # 将集合中每条数据按照分隔符分割,使用正则:https://www.runoob.com/regexp/regexp-syntax.html val wordsRDD = inputRDD.flatMap(line => line.split("\\s+")) # inputRDD.flatMap(_.split("\\s+")) # 转换为二元组,表示每个单词出现一次 val tuplesRDD = wordsRDD.map(word => (word, 1)) # wordsRDD.map((_, 1)) # 按照Key分组,对Value进行聚合操作 # scala中二元组就是Java中Key/Value对 # reduceByKey:先分组,再聚合 # val wordcountsRDD = tuplesRDD.reduceByKey((a, b) => a + b) val wordcountsRDD = tuplesRDD.reduceByKey((tmp, item) => tmp + item) # 查看结果 wordcountsRDD.take(5) # 保存结果数据到HDFs中 wordcountsRDD.saveAsTextFile("/datas/spark-wc") # 查结果数据 /export/servers/hadoop/bin/hdfs dfs -text /datas/spark-wc/par*
3,列表中的聚合函数
// 高级函数:函数A的参数类型是一个函数,那么函数A就是高阶函数 def reduce[A1 >: Int](op: (A1, A1) => A1): A1 // 函数要求 op: (A1: 参数一, A1: 参数二) => A1 /* 表示需要两个参数及返回值,并且类型全部一样 参数一:聚合中间临时变量 参数二:集合中每个元素 */ scala> val list = (1 to 10).toList list: List[Int] = List(1, 2, 3, 4, 5, 6, 7, 8, 9, 10) /* val tmp = 0 tmp = tmp + item return tmp tmp: 聚合时中间临时变量 需求:平均值 总和 / 总数 */ list.reduce((tmp, item) => { println(s"tmp = $tmp, item = $item, sum = ${tmp + item}") tmp + item }) list.reduceLeft((tmp, item) => { println(s"tmp = $tmp, item = $item, sum = ${tmp + item}") tmp + item }) list.reduceRight((item, tmp) => { println(s"tmp = $tmp, item = $item, sum = ${tmp + item}") tmp + item })
比较reduceRight函数使用
4,本地模式运行圆周率
SPARK_HOME=/export/servers/spark-2.2.0-bin-2.6.0-cdh5.14.0/ ${SPARK_HOME}/bin/spark-submit \ --master local[2] \ --class org.apache.spark.examples.SparkPi \ ${SPARK_HOME}/examples/jars/spark-examples_2.11-2.2.0.jar \ 10 # 参数10含义:表示运行10次,每次100000点
五,Spark运行组成
1,mapreduce组成
一个MapReduce运行的就是一个Job,有一个AppMaster它是应用的管理者,负责这个应用中所有的Task执行,每个mapTask和reduceTask以进程的方式运行。
2,Spark应用组成
每个Spark Application可以包含多个Job,每个Application 运行在集群上面的时候,也是有两个部分。
第一部分,Driver Program相当于AppMaster,是整个应用的管理者,负责应用中所有的Job的调度执行。它是一个JVM Process,运行程序的Main函数,必须创建SparkContext上下文对象。
第二部分Executor,相当于一个线程池,运行JVM Process,其中有很多的线程,每个线程运行一个Task任务,一个任务运行需要1核CPU,所有可以认为Executor中的线程数就等于Cpu的核数。
六,Spark应用的开发
对于Spark的应用来说主要分为三部分,读取数据,处理数据,输出数据。
import org.apache.spark.rdd.RDD import org.apache.spark.{SparkConf, SparkContext} /** * 基于Scala语言使用SparkCore编程实现词频统计:WordCount * 从HDFS上读取数据,统计WordCount,将结果保存到HDFS上 */ object SparkWordCount { def main(args: Array[String]): Unit = { // 创建SparkConf对象,设置应用的配置信息,比如应用名称和应用运行模式 val sparkConf: SparkConf = new SparkConf() .setMaster("local[2]") .setAppName("SparkWordCount") // TODO: 构建SparkContext上下文实例对象,读取数据和调度Job执行 val sc: SparkContext = new SparkContext(sparkConf) // 设置日志级别,可设置的选项:Valid log levels include: ALL, DEBUG, ERROR, FATAL, INFO, OFF, TRACE, WARN sc.setLogLevel("WARN") // 第一步、读取数据 // 封装到RDD集合,认为列表List val inputRDD: RDD[String] = sc.textFile("/datas/wordcount.input") // 第二步、处理数据 // 调用RDD中函数,认为调用列表中的函数 // a. 每行数据分割为单词 val wordsRDD = inputRDD.flatMap(line => line.split("\\s+")) // b. 转换为二元组,表示每个单词出现一次 val tuplesRDD: RDD[(String, Int)] = wordsRDD.map(word => (word, 1)) // c. 按照Key分组聚合 val wordCountsRDD: RDD[(String, Int)] = tuplesRDD.reduceByKey((tmp, item) => tmp + item) // 第三步、输出数据 // 保存到为存储系统,比如HDFS wordCountsRDD.saveAsTextFile(s"/datas/swc-output-${System.currentTimeMillis()}") wordCountsRDD.foreach(println) // 为了测试,线程休眠,查看WEB UI界面 Thread.sleep(10000000) // TODO:应用程序运行接收,关闭资源 sc.stop() } }
七,Spark的应用提交
1,Spaek Submit
http://spark.apache.org/docs/2.2.0/submitting-applications.html
./bin/spark-submit \ --class <main-class> \ --master <master-url> \ --deploy-mode <deploy-mode> \ --conf <key>=<value> \ ... # other options <application-jar> \ [application-arguments] Some of the commonly used options are: --class: The entry point for your application (e.g. org.apache.spark.examples.SparkPi) --master: The master URL for the cluster (e.g. spark://23.195.26.187:7077) --deploy-mode: Whether to deploy your driver on the worker nodes (cluster) or locally as an external client (client) (default: client) † --conf: Arbitrary Spark configuration property in key=value format. For values that contain spaces wrap “key=value” in quotes (as shown). application-jar: Path to a bundled jar including your application and all dependencies. The URL must be globally visible inside of your cluster, for instance, an hdfs:// path or a file:// path that is present on all nodes. application-arguments: Arguments passed to the main method of your main class, if any
使用help命令查看相关参数:
# bin/spark-submit --help Usage: spark-submit [options] <app jar | python file> [app arguments] Usage: spark-submit --kill [submission ID] --master [spark://...] Usage: spark-submit --status [submission ID] --master [spark://...] Usage: spark-submit run-example [options] example-class [example args] Options: --master MASTER_URL spark://host:port, mesos://host:port, yarn, or local. --deploy-mode DEPLOY_MODE Whether to launch the driver program locally ("client") or on one of the worker machines inside the cluster ("cluster") (Default: client). --class CLASS_NAME Your application's main class (for Java / Scala apps). --name NAME A name of your application. --jars JARS Comma-separated list of local jars to include on the driver and executor classpaths. --packages Comma-separated list of maven coordinates of jars to include on the driver and executor classpaths. Will search the local maven repo, then maven central and any additional remote repositories given by --repositories. The format for the coordinates should be groupId:artifactId:version. --exclude-packages Comma-separated list of groupId:artifactId, to exclude while resolving the dependencies provided in --packages to avoid dependency conflicts. --repositories Comma-separated list of additional remote repositories to search for the maven coordinates given with --packages. --py-files PY_FILES Comma-separated list of .zip, .egg, or .py files to place on the PYTHONPATH for Python apps. --files FILES Comma-separated list of files to be placed in the working directory of each executor. File paths of these files in executors can be accessed via SparkFiles.get(fileName). --conf PROP=VALUE Arbitrary Spark configuration property. --properties-file FILE Path to a file from which to load extra properties. If not specified, this will look for conf/spark-defaults.conf. --driver-memory MEM Memory for driver (e.g. 1000M, 2G) (Default: 1024M). --driver-java-options Extra Java options to pass to the driver. --driver-library-path Extra library path entries to pass to the driver. --driver-class-path Extra class path entries to pass to the driver. Note that jars added with --jars are automatically included in the classpath. --executor-memory MEM Memory per executor (e.g. 1000M, 2G) (Default: 1G). --proxy-user NAME User to impersonate when submitting the application. This argument does not work with --principal / --keytab. --help, -h Show this help message and exit. --verbose, -v Print additional debug output. --version, Print the version of current Spark. Spark standalone with cluster deploy mode only: --driver-cores NUM Cores for driver (Default: 1). Spark standalone or Mesos with cluster deploy mode only: --supervise If given, restarts the driver on failure. --kill SUBMISSION_ID If given, kills the driver specified. --status SUBMISSION_ID If given, requests the status of the driver specified. Spark standalone and Mesos only: --total-executor-cores NUM Total cores for all executors. Spark standalone and YARN only: --executor-cores NUM Number of cores per executor. (Default: 1 in YARN mode, or all available cores on the worker in standalone mode) YARN-only: --driver-cores NUM Number of cores used by the driver, only in cluster mode (Default: 1). --queue QUEUE_NAME The YARN queue to submit to (Default: "default"). --num-executors NUM Number of executors to launch (Default: 2). If dynamic allocation is enabled, the initial number of executors will be at least NUM. --archives ARCHIVES Comma separated list of archives to be extracted into the working directory of each executor. --principal PRINCIPAL Principal to be used to login to KDC, while running on secure HDFS. --keytab KEYTAB The full path to the file that contains the keytab for the principal specified above. This keytab will be copied to the node running the Application Master via the Secure Distributed Cache, for renewing the login tickets and the delegation tokens periodically.
提交应用:
Usage: spark-submit [options] <app jar | python file> [app arguments] 1)、options 可选参数,应用运行配置信息,比如运行在哪里,本地模式还是集群模式 重要的一点 2)、<app jar | python file> 如果使用Java或者SCALa语言,将程序编译jar包;如果是Python语言,脚本文件 3)、[app arguments] 应用程序参数,可有可无
2,提交单词计数程序
提交在本地模式:
SPARK_HOME=/export/servers/spark-2.2.0-bin-2.6.0-cdh5.14.0/ ${SPARK_HOME}/bin/spark-submit \ --master local[2] \ --class cn.itcast.bigdata.spark.submit.SparkSubmit \ ${SPARK_HOME}/day01-core_2.11-1.0-SNAPSHOT.jar \ /datas/wordcount.input /datas/swcs
提交在Standalone集群
SPARK_HOME=/export/servers/spark-2.2.0-bin-2.6.0-cdh5.14.0/ ${SPARK_HOME}/bin/spark-submit \ --master spark://bigdata-cdh02.itcast.cn:7077,bigdata-cdh03.itcast.cn:7077 \ --driver-memory 512m \ --executor-memory 512m \ --num-executors 1 \ --total-executor-cores 2 \ --class cn.itcast.bigdata.spark.submit.SparkSubmit \ ${SPARK_HOME}/day01-core_2.11-1.0-SNAPSHOT.jar \ /datas/wordcount.input /datas/swcs
3,Spark on Yarn
文档:http://spark.apache.org/docs/2.2.0/running-on-yarn.html
提交Spark Application到YARN上,找的就是ResourceManager,命令如下:
SPARK_HOME=/export/servers/spark-2.2.0-bin-2.6.0-cdh5.14.0/ ${SPARK_HOME}/bin/spark-submit \ --master yarn \ --class org.apache.spark.examples.SparkPi \ ${SPARK_HOME}/examples/jars/spark-examples_2.11-2.2.0.jar \ 10SPARK_HOME=/export/servers/spark-2.2.0-bin-2.6.0-cdh5.14.0 ${SPARK_HOME}/bin/spark-submit \ --master yarn \ --deploy-mode client \ --class cn.itcast.bigdata.spark.submit.SparkSubmit \ --driver-memory 512m \ --executor-memory 512m \ --executor-cores 1 \ --num-executors 2 \ --queue default \ hdfs://bigdata-cdh01.itcast.cn:8020/spark/apps/day01-core_2.11-1.0-SNAPSHOT.jar \ /datas/wordcount.input /datas/swcs
分析可知,spark运行在yarn上,需要将依赖的jar包和配置文件上传,进行使用,需要很长时间。我们需要提交把jar包上传到hdfs,告知应用已经上传,这样就不需要每次都上传jar包,而浪费大量的时间。
4,Spark Application jar包
application-jar: Path to a bundled jar including your application and all dependencies. The URL must be globally visible inside of your cluster, for instance, an hdfs:// path or a file:// path that is present on all nodes.
根据官方的文档可知,存放的位置需要全局课件,要么是hdfs的路径,要么是所有集群中相同的路径,建议将应用jar包上传到hdfs目录中。修改后的提交命令。
# 提交运行词频统计WordCount至Standalone集群运行 SPARK_HOME=/export/servers/spark-2.2.0-bin-2.6.0-cdh5.14.0 ${SPARK_HOME}/bin/spark-submit \ --master spark://bigdata-cdh02.itcast.cn:7077 \ --deploy-mode client \ --class cn.itcast.bigdata.spark.submit.SparkSubmit \ --driver-memory 512m \ --executor-memory 512m \ --executor-cores 1 \ --total-executor-cores 2 \ hdfs://bigdata-cdh01.itcast.cn:8020/spark/apps/day01-core_2.11-1.0-SNAPSHOT.jar \ /datas/wordcount.input /datas/swcs
5,Deploy Mode
Spark Application提交运行时部署模式Deploy Mode,表示的是Driver Program运行的地方,要么是提交应用的Client:client,要么是集群中从节点(Standalone:Worker,YARN:NodeManager):cluster
client模式下Driver在本地,Executor在集群的Worker上:
cluster模式下无论是Driver还是Executor都在Workers上面:
# 提交运行词频统计WordCount至Standalone集群运行 SPARK_HOME=/export/servers/spark-2.2.0-bin-2.6.0-cdh5.14.0 ${SPARK_HOME}/bin/spark-submit \ --master spark://bigdata-cdh02.itcast.cn:6066 \ --deploy-mode cluster \ --class cn.itcast.bigdata.spark.submit.SparkSubmit \ --driver-memory 512m \ --executor-memory 512m \ --executor-cores 1 \ --total-executor-cores 2 \ hdfs://bigdata-cdh01.itcast.cn:8020/spark/apps/day01-core_2.11-1.0-SNAPSHOT.jar \ /datas/wordcount.input /datas/swcs
附录
<!-- 指定仓库位置,依次为aliyun、cloudera和jboss仓库 --> <repositories> <repository> <id>aliyun</id> <url>http://maven.aliyun.com/nexus/content/groups/public/</url> </repository> <repository> <id>cloudera</id> <url>https://repository.cloudera.com/artifactory/cloudera-repos/</url> </repository> <repository> <id>jboss</id> <url>http://repository.jboss.com/nexus/content/groups/public</url> </repository> </repositories> <properties> <scala.version>2.11.8</scala.version> <scala.binary.version>2.11</scala.binary.version> <spark.version>2.2.0</spark.version> <hadoop.version>2.6.0-cdh5.14.0</hadoop.version> </properties> <dependencies> <!-- 依赖Scala语言 --> <dependency> <groupId>org.scala-lang</groupId> <artifactId>scala-library</artifactId> <version>${scala.version}</version> </dependency> <!-- Spark Core 依赖 --> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-core_${scala.binary.version}</artifactId> <version>${spark.version}</version> </dependency> <!-- Hadoop Client 依赖 --> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-client</artifactId> <version>${hadoop.version}</version> </dependency> </dependencies> <build> <outputDirectory>target/classes</outputDirectory> <testOutputDirectory>target/test-classes</testOutputDirectory> <resources> <resource> <directory>${project.basedir}/src/main/resources</directory> </resource> </resources> <!-- Maven 编译的插件 --> <plugins> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-compiler-plugin</artifactId> <version>3.0</version> <configuration> <source>1.8</source> <target>1.8</target> <encoding>UTF-8</encoding> </configuration> </plugin> <plugin> <groupId>net.alchim31.maven</groupId> <artifactId>scala-maven-plugin</artifactId> <version>3.2.0</version> <executions> <execution> <goals> <goal>compile</goal> <goal>testCompile</goal> </goals> <configuration> <args> <arg>-dependencyfile</arg> <arg>${project.build.directory}/.scala_dependencies</arg> </args> </configuration> </execution> </executions> </plugin> </plugins> </build>