Spark Eclipse 开发环境搭建
1 安装Spark环境
- 首先下载与集群 Hadoop 版本对应的 Spark 编译好的版本,解压缩到指定位置,注意用户权限
- 进入解压缩之后的 SPARK_HOME 目录
- 配置 /etc/profile 或者 ~/.bashrc 中配置 SPARK_HOME
- cd $SPARK_HOME/conf cp spark-env.sh.template spark-env.sh
- vim spark-env.sh
export SCALA_HOME=/home/hadoop/cluster/scala-2.10.5
export JAVA_HOME=/home/hadoop/cluster/jdk1.7.0_79
export HADOOP_HOME=/home/hadoop/cluster/hadoop-2.6.0
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
#注意这个地方一定要指定为IP,否则下面的eclipse去连接的时候会报:
#All masters are unresponsive! Giving up. 这个错误的。
SPARK_MASTER_IP=10.16.112.121
SPARK_LOCAL_DIRS=/home/hadoop/cluster/spark-1.4.0-bin-hadoop2.6
SPARK_DRIVER_MEMORY=1G
2 standalone 模式开启 spark
sbin/start-master.sh
sbin/start-slave.sh
此时可以在浏览器中输入:http://yourip:8080 查看Spark集群的情况
此时默认的 Spark-Master 为: spark://10.16.112.121:7077
3 利用 Scala-Eclipse IDE 与 Maven 构建 Spark 开发环境
- 首先下载 Scala-Eclipse IDE 去 scala 官网下载即可
- 打开IDE, 新建 Maven 项目, pom.xml 填写如下:
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>spark.test</groupId>
<artifactId>FirstTrySpark</artifactId>
<version>0.0.1-SNAPSHOT</version>
<properties>
<!-- 填写对应版本 -->
<hadoop.version>2.6.0</hadoop.version>
<spark.version>1.4.0</spark.version>
</properties>
<dependencies>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>${hadoop.version}</version>
<scope>provided</scope>
<!-- 记得排除servlet依赖,否则会报冲突 -->
<exclusions>
<exclusion>
<groupId>javax.servlet</groupId>
<artifactId>*</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>2.6.0</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-mapreduce-client-jobclient</artifactId>
<version>2.6.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>${spark.version}</version>
</dependency>
</dependencies>
<build>
<sourceDirectory>src/main/java</sourceDirectory>
<plugins>
<!-- bind the maven-assembly-plugin to the package phase this will create
a jar file without the storm dependencies suitable for deployment to a cluster. -->
<plugin>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>3.2.0</version>
<executions>
<execution>
<goals>
<goal>compile</goal>
<goal>testCompile</goal>
</goals>
</execution>
</executions>
<configuration>
<scalaVersion>2.10</scalaVersion>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-assembly-plugin</artifactId>
<version>2.5.5</version>
<configuration>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<configuration>
<source>1.7</source>
<target>1.7</target>
</configuration>
</plugin>
</plugins>
<resources>
<resource>
<directory>src/main/resources</directory>
</resource>
</resources>
</build>
</project>
- 新建几个 Source Folder
src/main/java #编写 java 代码
src/main/scala #编写 scala 代码
src/main/resources #存放资源文件
src/test/java #编写测试 java 代码
src/test/scala #编写测试 scala 代码
src/test/resources #存放资源文件
此时环境全部搭建完毕!
4 编写测试代码是否可以连接成功
- 测试代码如下:
import org.apache.spark.SparkConf
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
/**
* @author clebeg
*/
object FirstTry {
def main(args: Array[String]): Unit = {
val conf = new SparkConf
conf.setMaster("spark://yourip:7077")
conf.set("spark.app.name", "first-tryspark")
val sc = new SparkContext(conf)
val rawblocks = sc.textFile("hdfs://yourip:9000/user/hadoop/linkage")
println(rawblocks.first)
}
}
5 部分错误汇总
大部分问题上面已经提到,这里不多说,下面提几个主要的问题:
- Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
分析问题:点开运行ID对应的运行日志发现下面的错误:
15/10/10 08:49:01 INFO executor.CoarseGrainedExecutorBackend: Registered signal handlers for [TERM, HUP, INT]
15/10/10 08:49:01 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/10/10 08:49:02 INFO spark.SecurityManager: Changing view acls to: hadoop,Administrator
15/10/10 08:49:02 INFO spark.SecurityManager: Changing modify acls to: hadoop,Administrator
15/10/10 08:49:02 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hadoop, Administrator); users with modify permissions: Set(hadoop, Administrator)
15/10/10 08:49:02 INFO slf4j.Slf4jLogger: Slf4jLogger started
15/10/10 08:49:02 INFO Remoting: Starting remoting
15/10/10 08:49:02 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://driverPropsFetcher@10.16.112.121:58708]
15/10/10 08:49:02 INFO util.Utils: Successfully started service 'driverPropsFetcher' on port 58708.
Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1643)
at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:65)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:146)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:245)
at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [120 seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107)
at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:97)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:159)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:66)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:65)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
... 4 more
15/10/10 08:51:02 INFO util.Utils: Shutdown hook called
仔细一看原来是权限的问题:立马关闭 Hadoop, 在 etc/hadoop/core-site.xml 中添加:
<property>
<name>hadoop.security.authorization</name>
<value>false</value>
</property>
设置任何人都可以读取,问题立马搞定。
- java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.
- 到地址http://www.barik.net/archive/2015/01/19/172716/ 下载包含 winutils.exe 的 hadoop2.6 重新编译的版本。注意一定要下载对应自己的Hadoop版本。
- 减压缩到指定位置,设置 HADOOP_HOME 环境变量。注意一定要重新启动 eclipse。 搞定!
- 本文中提到的数据在哪里获取? http://bit.ly/1Aoywaq 操作代码如下:
mkdir linkage
cd linkage/
curl -o donation.zip http://bit.ly/1Aoywaq
unzip donation.zip
unzip "block_*.zip"
hdfs dfs -mkdir /user/hadoop/linkage
hdfs dfs -put block_*.csv /user/hadoop/linkage
6 一些有用的链接
- http://www.migle.me/post/Spark%E5%AD%A6%E4%B9%A0-hello%20spark/ spark 初探
- http://spark.apache.org/docs/latest/index.html spark 官网
- http://www.cnblogs.com/hseagle/category/569175.html spark 源码剖析
- http://wuchong.me/blog/2015/04/04/spark-on-yarn-cluster-deploy/ spark 集群搭建
- http://www.barik.net/archive/2015/01/19/172716/ 包含winutil.exe 的Hadoop2.6重新编译版本
来源:oschina
链接:https://my.oschina.net/u/1244232/blog/515075