Trying to run a simple GraphFrame example using pyspark.
spark version : 2.0
graphframe version : 0.2.0
I am able to import graphframes in Jupyter:
The simplest way is to start jupyter with pyspark and graphframes is to start jupyter out from pyspark with the respective packages
Just open your terminal and set the two environment variables and start pyspark
with the graphframes package
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS=notebook
pyspark --packages graphframes:graphframes:0.6.0-spark2.3-s_2.11
the advantage of this is also that if you later on want to run your code via spark-submit
you can use the same start command
Depending on your spark version, all you have to do is download the graphframe jar corresponding to your version of spark here https://spark-packages.org/package/graphframes/graphframes.
Then you'll have to copy the jar downloaded to your spark jar directory
root@93d8398b53f2:/usr/local/spark/jars# wget http://dl.bintray.com/spark-packages/maven/graphframes/graphframes/0.3.0-spark2.0-s_2.11/graphframes-0.3.0-spark2.0-s_2.11.jar
There's the little tric right here, launch pyspark with arguments for the first time so that it downloads all the graphframe's jars dependencies:
root@93d8398b53f2:/usr/local/spark/bin# pyspark --packages graphframes:graphframes:0.3.0-spark2.0-s_2.11 --jars graphframes-0.3.0-spark2.0-s_2.11.jar
This should come up:
Ivy Default Cache set to: /root/.ivy2/cache
The jars for the packages stored in: /root/.ivy2/jars
:: loading settings :: url = jar:file:/usr/local/spark-2.0.0-bin-hadoop2.7/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
graphframes#graphframes added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
confs: [default]
found graphframes#graphframes;0.3.0-spark2.0-s_2.11 in spark-packages
found com.typesafe.scala-logging#scala-logging-api_2.11;2.1.2 in central
found com.typesafe.scala-logging#scala-logging-slf4j_2.11;2.1.2 in central
found org.scala-lang#scala-reflect;2.11.0 in central
found org.slf4j#slf4j-api;1.7.7 in central
downloading http://dl.bintray.com/spark-packages/maven/graphframes/graphframes/0.3.0-spark2.0-s_2.11/graphframes-0.3.0-spark2.0-s_2.11.jar ...
[SUCCESSFUL ] graphframes#graphframes;0.3.0-spark2.0-s_2.11!graphframes.jar (269ms)
downloading https://repo1.maven.org/maven2/com/typesafe/scala-logging/scala-logging-api_2.11/2.1.2/scala-logging-api_2.11-2.1.2.jar ...
[SUCCESSFUL ] com.typesafe.scala-logging#scala-logging-api_2.11;2.1.2!scala-logging-api_2.11.jar (53ms)
downloading https://repo1.maven.org/maven2/com/typesafe/scala-logging/scala-logging-slf4j_2.11/2.1.2/scala-logging-slf4j_2.11-2.1.2.jar ...
[SUCCESSFUL ] com.typesafe.scala-logging#scala-logging-slf4j_2.11;2.1.2!scala-logging-slf4j_2.11.jar (66ms)
downloading https://repo1.maven.org/maven2/org/scala-lang/scala-reflect/2.11.0/scala-reflect-2.11.0.jar ...
[SUCCESSFUL ] org.scala-lang#scala-reflect;2.11.0!scala-reflect.jar (1409ms)
downloading https://repo1.maven.org/maven2/org/slf4j/slf4j-api/1.7.7/slf4j-api-1.7.7.jar ...
[SUCCESSFUL ] org.slf4j#slf4j-api;1.7.7!slf4j-api.jar (53ms)
:: resolution report :: resolve 6161ms :: artifacts dl 1877ms
:: modules in use:
com.typesafe.scala-logging#scala-logging-api_2.11;2.1.2 from central in [default]
com.typesafe.scala-logging#scala-logging-slf4j_2.11;2.1.2 from central in [default]
graphframes#graphframes;0.3.0-spark2.0-s_2.11 from spark-packages in [default]
org.scala-lang#scala-reflect;2.11.0 from central in [default]
org.slf4j#slf4j-api;1.7.7 from central in [default]
---------------------------------------------------------------------
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
---------------------------------------------------------------------
| default | 5 | 5 | 5 | 0 || 5 | 5 |
---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent
confs: [default]
5 artifacts copied, 0 already retrieved (4713kB/39ms)
Warning: Local jar /usr/local/spark-2.0.0-bin-hadoop2.7/bin/graphframes-0.3.0-spark2.0-s_2.11.jar does not exist, skipping.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel).
16/11/17 15:43:51 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/11/17 15:43:54 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 2.0.0
/_/
Using Python version 2.7.12 (default, Jul 2 2016 17:42:40)
SparkSession available as 'spark'.
>>>
Meaning it has downloaded all the dependencies required. The important thing right here is Ivy Default Cache set to: /root/.ivy2/cache, precisely the jars stored in /root/.ivy2/jars
You can exit right after, if you insist in proceeding with the python code calling GraphFrame, it will call the error:
Py4JJavaError: An error occurred while calling o561.newInstance.
: java.lang.NoClassDefFoundError: Could not initialize class org.graphframes.GraphFrame.
Let's see what's inside the directory /root/.ivy2/jars:
root@93d8398b53f2:/usr/local/spark/bin# ls /root/.ivy2/jars/
com.typesafe.scala-logging_scala-logging-api_2.11-2.1.2.jar com.typesafe.scala-logging_scala-logging-slf4j_2.11-2.1.2.jar graphframes_graphframes-0.3.0-spark2.0-s_2.11.jar org.scala-lang_scala-reflect-2.11.0.jar org.slf4j_slf4j-api-1.7.7.jar
Now you'll want to copy all the jars appearing in /root/.ivy2/jars to your spark's jars directory:
root@93d8398b53f2:/usr/local/spark/jars# cp /root/.ivy2/jars/* .
Launch pyspark for the second time:
root@93d8398b53f2:/usr/local/spark/jars# pyspark --packages graphframes:graphframes:0.3.0-spark2.0-s_2.11 --jars graphframes-0.3.0-spark2.0-s_2.11.jar
This should come up:
Ivy Default Cache set to: /root/.ivy2/cache
The jars for the packages stored in: /root/.ivy2/jars
:: loading settings :: url = jar:file:/usr/local/spark-2.0.0-bin-hadoop2.7/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
graphframes#graphframes added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
confs: [default]
found graphframes#graphframes;0.3.0-spark2.0-s_2.11 in spark-packages
found com.typesafe.scala-logging#scala-logging-api_2.11;2.1.2 in central
found com.typesafe.scala-logging#scala-logging-slf4j_2.11;2.1.2 in central
found org.scala-lang#scala-reflect;2.11.0 in central
found org.slf4j#slf4j-api;1.7.7 in central
:: resolution report :: resolve 748ms :: artifacts dl 27ms
:: modules in use:
com.typesafe.scala-logging#scala-logging-api_2.11;2.1.2 from central in [default]
com.typesafe.scala-logging#scala-logging-slf4j_2.11;2.1.2 from central in [default]
graphframes#graphframes;0.3.0-spark2.0-s_2.11 from spark-packages in [default]
org.scala-lang#scala-reflect;2.11.0 from central in [default]
org.slf4j#slf4j-api;1.7.7 from central in [default]
---------------------------------------------------------------------
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
---------------------------------------------------------------------
| default | 5 | 0 | 0 | 0 || 5 | 0 |
---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent
confs: [default]
0 artifacts copied, 5 already retrieved (0kB/24ms)
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel).
16/11/17 15:53:01 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/11/17 15:53:03 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 2.0.0
/_/
Using Python version 2.7.12 (default, Jul 2 2016 17:42:40)
SparkSession available as 'spark'.
>>>
You can now enjoy GraphFrame:
>>> # Create an Edge DataFrame with "src" and "dst" columns
... e = sqlContext.createDataFrame([
... ("a", "b", "friend"),
... ("b", "c", "follow"),
... ("c", "b", "follow"),
... ], ["src", "dst", "relationship"])
>>> # Create a GraphFrame
... from graphframes import *
>>> g = GraphFrame(v, e)
>>>
>>> # Query: Get in-degree of each vertex.
... g.inDegrees.show()
+---+--------+
| id|inDegree|
+---+--------+
| c| 1|
| b| 2|
+---+--------+
>>>
>>> # Query: Count the number of "follow" connections in the graph.
... g.edges.filter("relationship = 'follow'").count()
2
>>> results.vertices.select("id", "pagerank").show()
16/11/17 16:03:45 WARN Executor: 1 block locks were not released by TID = 9059:
[rdd_337_0]
16/11/17 16:03:45 WARN Executor: 1 block locks were not released by TID = 9060:
[rdd_337_1]
+---+-------------------+
| id| pagerank|
+---+-------------------+
| a| 0.01|
| b| 0.2808611427228327|
| c|0.27995525261339177|
+---+-------------------+
For PyCharm go to configurations and add the environment variable:
Name: PYSPARK_SUBMIT_ARGS
Value: --packages graphframes:graphframes:0.2.0-spark2.0-s_2.11 pyspark-shell
I've found it doesn't work for me without the pyspark-shell at the end
Follow-up on @Gilles Essoki solution. Make sure you have the right Spark version and Scala version for your environment.
graphframes:(latest version)-spark(your spark version)-s_(your scala version)
I did not have to specify the jar file or copy it to the spark default jar directory when I had the right versions. Note: You need to run 'spark-shell' cmd.
%spark-shell ... ... ... Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 1.6.0 /_/ Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_67)
The correct version to get for this setup from SparkPackages
For my environment I had to use the following command :
%pyspark --packages graphframes:graphframes:0.3.0-spark1.6-s_2.10
Make sure that your PYSPARK_SUBMIT_ARGS is updated to have "--packages graphframes:graphframes:0.2.0-spark2.0" in your kernel.json ~/.ipython/kernels//kernel.json.
You probably already looked at the following link. It has more details on Jupiter setup. Basically, pyspark has to be supplied the graphframes.jar.