How to setup Intellij 14 Scala Worksheet to run Spark

问题

I'm trying to create a SparkContext in an Intellij 14 Scala Worksheet.

here are my dependencies

name := "LearnSpark"
version := "1.0"
scalaVersion := "2.11.7"
// for working with Spark API
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.4.0"

Here is the code i run in the worksheet

import org.apache.spark.{SparkContext, SparkConf}
val conf = new SparkConf().setMaster("local").setAppName("spark-play")
val sc = new SparkContext(conf)

error

15/08/24 14:01:59 ERROR SparkContext: Error initializing SparkContext.
java.lang.ClassNotFoundException: rg.apache.spark.rpc.akka.AkkaRpcEnvFactory
    at java.net.URLClassLoader$1.run(URLClassLoader.java:372)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:360)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
    at java.lang.Class.forName0(Native Method)

When I run Spark as standalone app it works fine. For example

import org.apache.spark.{SparkContext, SparkConf}

// stops verbose logs
import org.apache.log4j.{Level, Logger}

object TestMain {

  Logger.getLogger("org").setLevel(Level.OFF)

  def main(args: Array[String]): Unit = {

    //Create SparkContext
    val conf = new SparkConf()
      .setMaster("local[2]")
      .setAppName("mySparkApp")
      .set("spark.executor.memory", "1g")
      .set("spark.rdd.compress", "true")
      .set("spark.storage.memoryFraction", "1")

    val sc = new SparkContext(conf)

    val data = sc.parallelize(1 to 10000000).collect().filter(_ < 1000)
    data.foreach(println)
  }
}

Can someone provide some guidance on where I should look to resolve this exception?

Thanks.

回答1:

Since there still are quite some doubts if it is at all possible to run IntelliJ IDEA Scala Worksheet with Spark and this question is the most direct one, I wanted to share my screenshot and a cookbook style recipe to get Spark code evaluated in the Worksheet.

I am using Spark 2.1.0 with Scala Worksheet in IntelliJ IDEA (CE 2016.3.4).

The first step is to have build.sbt file when importing dependencies in IntelliJ. I have used the same simple.sbt from the Spark Quick Start:

name := "Simple Project"

version := "1.0"

scalaVersion := "2.11.7"

libraryDependencies += "org.apache.spark" %% "spark-core" % "2.1.0"

The second step is to uncheck 'Run worksheet in the compiler process' checkbox in Settings -> Languages and Frameworks -> Scala -> Worksheet. I have also tested the other Worksheet settings and they had no effect on the warning about duplicate Spark context creation.

Here is the version of the code from SimpleApp.scala example in the same guide modified to work in the Worksheet. The master and appName parameters have to be set in the same Worksheet:

import org.apache.spark.{SparkConf, SparkContext}

val conf = new SparkConf()
conf.setMaster("local[*]")
conf.setAppName("Simple Application")

val sc = new SparkContext(conf)

val logFile = "/opt/spark-latest/README.md"
val logData = sc.textFile(logFile).cache()
val numAs = logData.filter(line => line.contains("a")).count()
val numBs = logData.filter(line => line.contains("b")).count()

println(s"Lines with a: $numAs, Lines with b: $numBs")

Here is a screenshot of the functioning Scala Worksheet with Spark:

UPDATE for IntelliJ CE 2017.1 (Worksheet in REPL mode)

In 2017.1 Intellij introduced REPL mode for Worksheet. I have tested the same code with 'Use REPL' option checked. For this mode to run you need to leave the 'Run worksheet in the compiler process' checkbox in Worksheet Settings I have described above checked (it is by default).

The code runs fine in Worksheet REPL mode.

Here is the Screenshot:

回答2:

I use Intellij CE 2016.3, Spark 2.0.2 and run scala worksheet in eclipse compatible model, so far, most of them are ok now, there is only minor problem left.

open Preferences-> type scala -> in Languages & Frameworks, choose Scala -> Choose Worksheet -> only select eclipse compatibility mode or select nothing.

Previously, when selecting "Run worksheet in the compiler process", I experienced a lot of problems, not just using Spark, also Elasticsearch. I guess when selecting "Run worksheet in the compiler process", the Intellij will do some tricky optimization, adding lazy to the variable etc maybe, which in some situation makes the worksheet rather wired.

Also I find it that sometimes when the class defined in the worksheet not working or behaves abnormally, putting in a separate file and compile it, then run it in the worksheet, will solve a lot of problems.

回答3:

According to Spark 1.4.0 site you should be using scala 2.10.x:

Spark runs on Java 6+, Python 2.6+ and R 3.1+. For the Scala API, Spark 1.4.0 uses Scala 2.10. You will need to use a compatible Scala version (2.10.x).

EDITED:

When you click on "create new project" in intelliJ after selecting sbt project and click "next", this menu will appear where you can choose the scala version:

EDITED 2:

You can also use this spark core package for scala 2.11.x :

libraryDependencies += "org.apache.spark" %% "spark-core_2.11" % "1.4.0"

回答4:

I was facing the same problem, and couldn't able to resolve that, though tried several attempts. Instead of worksheet, right now I am using scala console, at least better than nothing to use.

回答5:

I too come across a similar issue with Intellij where the libraries are not resolved by SBT after adding the libraryDependencies in build.sbt. IDEA is not downloading the dependencies by default. Restarted the Intellij, it solves the problem. start downloading the dependencies.

So,

Make sure the dependencies are downloaded in your local project, if not, restart the IDE or trigger the IDE to download the required dependencies

Make sure the repositories are resolved, in case not, include the repository location under resolver+=

回答6:

Below is my maven dependecies configuration, it always works and stable. I usually write spark pragram and submit it to yarn-cluster for cluster running.

The key jar is ${spark.home}/lib/spark-assembly-1.5.2 hadoop2.6.0.jar, it contains almost all spark dependencies and included with every spark release. (Actually spark-submit will distribute this jar to cluster, so Dont worry ClassNotFoundException anymore :D )

I think you can change your libraryDependencies += "org.apache.spark" %% "spark-core" % "1.4.0" with above similar configuration(Maven use systemPath to point to local jar dependency, I think SBT have similar configuration)

Note: logging jars exclusions is optional, because of its conflicts with my other jars.

 <!--Apache Spark -->
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-assembly</artifactId>
            <version>1.5.2</version>
            <scope>system</scope>
            <systemPath>${spark.home}/lib/spark-assembly-1.5.2-hadoop2.6.0.jar</systemPath>
            <exclusions>
                <exclusion>
                    <groupId>org.slf4j</groupId>
                    <artifactId>slf4j-log4j12</artifactId>
                </exclusion>
                <exclusion>
                    <groupId>log4j</groupId>
                    <artifactId>log4j</artifactId>
                </exclusion>
            </exclusions>
        </dependency>
        <dependency>
            <groupId>org.scala-lang</groupId>
            <artifactId>scala-library</artifactId>
            <version>2.10.2</version>
        </dependency>

来源：https://stackoverflow.com/questions/32189206/how-to-setup-intellij-14-scala-worksheet-to-run-spark

标签

scala

intellij-idea

apache-spark

worksheet