Unable to run a basic GraphFrames example

前端 未结 5 1786
忘掉有多难
忘掉有多难 2021-02-07 16:28

Trying to run a simple GraphFrame example using pyspark.

spark version : 2.0

graphframe version : 0.2.0

I am able to import graphframes in Jupyter:

相关标签:
5条回答
  • 2021-02-07 16:45

    The simplest way is to start jupyter with pyspark and graphframes is to start jupyter out from pyspark with the respective packages

    Just open your terminal and set the two environment variables and start pyspark with the graphframes package

    export PYSPARK_DRIVER_PYTHON=jupyter
    export PYSPARK_DRIVER_PYTHON_OPTS=notebook
    pyspark --packages graphframes:graphframes:0.6.0-spark2.3-s_2.11
    
    

    the advantage of this is also that if you later on want to run your code via spark-submit you can use the same start command

    0 讨论(0)
  • 2021-02-07 16:46

    I was able to make it work..

    Depending on your spark version, all you have to do is download the graphframe jar corresponding to your version of spark here https://spark-packages.org/package/graphframes/graphframes.

    Then you'll have to copy the jar downloaded to your spark jar directory

        root@93d8398b53f2:/usr/local/spark/jars# wget http://dl.bintray.com/spark-packages/maven/graphframes/graphframes/0.3.0-spark2.0-s_2.11/graphframes-0.3.0-spark2.0-s_2.11.jar
    

    There's the little tric right here, launch pyspark with arguments for the first time so that it downloads all the graphframe's jars dependencies:

        root@93d8398b53f2:/usr/local/spark/bin# pyspark --packages graphframes:graphframes:0.3.0-spark2.0-s_2.11 --jars graphframes-0.3.0-spark2.0-s_2.11.jar
    

    This should come up:

    Ivy Default Cache set to: /root/.ivy2/cache
    The jars for the packages stored in: /root/.ivy2/jars
    :: loading settings :: url = jar:file:/usr/local/spark-2.0.0-bin-hadoop2.7/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
    graphframes#graphframes added as a dependency
    :: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
        confs: [default]
        found graphframes#graphframes;0.3.0-spark2.0-s_2.11 in spark-packages
        found com.typesafe.scala-logging#scala-logging-api_2.11;2.1.2 in central
        found com.typesafe.scala-logging#scala-logging-slf4j_2.11;2.1.2 in central
        found org.scala-lang#scala-reflect;2.11.0 in central
        found org.slf4j#slf4j-api;1.7.7 in central
    downloading http://dl.bintray.com/spark-packages/maven/graphframes/graphframes/0.3.0-spark2.0-s_2.11/graphframes-0.3.0-spark2.0-s_2.11.jar ...
        [SUCCESSFUL ] graphframes#graphframes;0.3.0-spark2.0-s_2.11!graphframes.jar (269ms)
    downloading https://repo1.maven.org/maven2/com/typesafe/scala-logging/scala-logging-api_2.11/2.1.2/scala-logging-api_2.11-2.1.2.jar ...
        [SUCCESSFUL ] com.typesafe.scala-logging#scala-logging-api_2.11;2.1.2!scala-logging-api_2.11.jar (53ms)
    downloading https://repo1.maven.org/maven2/com/typesafe/scala-logging/scala-logging-slf4j_2.11/2.1.2/scala-logging-slf4j_2.11-2.1.2.jar ...
        [SUCCESSFUL ] com.typesafe.scala-logging#scala-logging-slf4j_2.11;2.1.2!scala-logging-slf4j_2.11.jar (66ms)
    downloading https://repo1.maven.org/maven2/org/scala-lang/scala-reflect/2.11.0/scala-reflect-2.11.0.jar ...
        [SUCCESSFUL ] org.scala-lang#scala-reflect;2.11.0!scala-reflect.jar (1409ms)
    downloading https://repo1.maven.org/maven2/org/slf4j/slf4j-api/1.7.7/slf4j-api-1.7.7.jar ...
        [SUCCESSFUL ] org.slf4j#slf4j-api;1.7.7!slf4j-api.jar (53ms)
    :: resolution report :: resolve 6161ms :: artifacts dl 1877ms
        :: modules in use:
        com.typesafe.scala-logging#scala-logging-api_2.11;2.1.2 from central in [default]
        com.typesafe.scala-logging#scala-logging-slf4j_2.11;2.1.2 from central in [default]
        graphframes#graphframes;0.3.0-spark2.0-s_2.11 from spark-packages in [default]
        org.scala-lang#scala-reflect;2.11.0 from central in [default]
        org.slf4j#slf4j-api;1.7.7 from central in [default]
        ---------------------------------------------------------------------
        |                  |            modules            ||   artifacts   |
        |       conf       | number| search|dwnlded|evicted|| number|dwnlded|
        ---------------------------------------------------------------------
        |      default     |   5   |   5   |   5   |   0   ||   5   |   5   |
        ---------------------------------------------------------------------
    :: retrieving :: org.apache.spark#spark-submit-parent
        confs: [default]
        5 artifacts copied, 0 already retrieved (4713kB/39ms)
    Warning: Local jar /usr/local/spark-2.0.0-bin-hadoop2.7/bin/graphframes-0.3.0-spark2.0-s_2.11.jar does not exist, skipping.
    Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
    Setting default log level to "WARN".
    To adjust logging level use sc.setLogLevel(newLevel).
    16/11/17 15:43:51 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    16/11/17 15:43:54 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
    Welcome to
          ____              __
         / __/__  ___ _____/ /__
        _\ \/ _ \/ _ `/ __/  '_/
       /__ / .__/\_,_/_/ /_/\_\   version 2.0.0
          /_/
    
    Using Python version 2.7.12 (default, Jul  2 2016 17:42:40)
    SparkSession available as 'spark'.
    >>> 
    

    Meaning it has downloaded all the dependencies required. The important thing right here is Ivy Default Cache set to: /root/.ivy2/cache, precisely the jars stored in /root/.ivy2/jars

    You can exit right after, if you insist in proceeding with the python code calling GraphFrame, it will call the error:

        Py4JJavaError: An error occurred while calling o561.newInstance.
        : java.lang.NoClassDefFoundError: Could not initialize class org.graphframes.GraphFrame. 
    

    Let's see what's inside the directory /root/.ivy2/jars:

    root@93d8398b53f2:/usr/local/spark/bin# ls /root/.ivy2/jars/
    com.typesafe.scala-logging_scala-logging-api_2.11-2.1.2.jar  com.typesafe.scala-logging_scala-logging-slf4j_2.11-2.1.2.jar  graphframes_graphframes-0.3.0-spark2.0-s_2.11.jar  org.scala-lang_scala-reflect-2.11.0.jar  org.slf4j_slf4j-api-1.7.7.jar
    

    Now you'll want to copy all the jars appearing in /root/.ivy2/jars to your spark's jars directory:

        root@93d8398b53f2:/usr/local/spark/jars# cp /root/.ivy2/jars/* .
    

    Launch pyspark for the second time:

        root@93d8398b53f2:/usr/local/spark/jars# pyspark --packages graphframes:graphframes:0.3.0-spark2.0-s_2.11 --jars graphframes-0.3.0-spark2.0-s_2.11.jar
    

    This should come up:

    Ivy Default Cache set to: /root/.ivy2/cache
    The jars for the packages stored in: /root/.ivy2/jars
    :: loading settings :: url = jar:file:/usr/local/spark-2.0.0-bin-hadoop2.7/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
    graphframes#graphframes added as a dependency
    :: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
        confs: [default]
        found graphframes#graphframes;0.3.0-spark2.0-s_2.11 in spark-packages
        found com.typesafe.scala-logging#scala-logging-api_2.11;2.1.2 in central
        found com.typesafe.scala-logging#scala-logging-slf4j_2.11;2.1.2 in central
        found org.scala-lang#scala-reflect;2.11.0 in central
        found org.slf4j#slf4j-api;1.7.7 in central
    :: resolution report :: resolve 748ms :: artifacts dl 27ms
        :: modules in use:
        com.typesafe.scala-logging#scala-logging-api_2.11;2.1.2 from central in [default]
        com.typesafe.scala-logging#scala-logging-slf4j_2.11;2.1.2 from central in [default]
        graphframes#graphframes;0.3.0-spark2.0-s_2.11 from spark-packages in [default]
        org.scala-lang#scala-reflect;2.11.0 from central in [default]
        org.slf4j#slf4j-api;1.7.7 from central in [default]
        ---------------------------------------------------------------------
        |                  |            modules            ||   artifacts   |
        |       conf       | number| search|dwnlded|evicted|| number|dwnlded|
        ---------------------------------------------------------------------
        |      default     |   5   |   0   |   0   |   0   ||   5   |   0   |
        ---------------------------------------------------------------------
    :: retrieving :: org.apache.spark#spark-submit-parent
        confs: [default]
        0 artifacts copied, 5 already retrieved (0kB/24ms)
    Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
    Setting default log level to "WARN".
    To adjust logging level use sc.setLogLevel(newLevel).
    16/11/17 15:53:01 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    16/11/17 15:53:03 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
    Welcome to
          ____              __
         / __/__  ___ _____/ /__
        _\ \/ _ \/ _ `/ __/  '_/
       /__ / .__/\_,_/_/ /_/\_\   version 2.0.0
          /_/
    
    Using Python version 2.7.12 (default, Jul  2 2016 17:42:40)
    SparkSession available as 'spark'.
    >>> 
    

    You can now enjoy GraphFrame:

    >>> # Create an Edge DataFrame with "src" and "dst" columns
    ... e = sqlContext.createDataFrame([
    ...   ("a", "b", "friend"),
    ...   ("b", "c", "follow"),
    ...   ("c", "b", "follow"),
    ... ], ["src", "dst", "relationship"])
    >>> # Create a GraphFrame
    ... from graphframes import *
    >>> g = GraphFrame(v, e)
    >>> 
    >>> # Query: Get in-degree of each vertex.
    ... g.inDegrees.show()
    +---+--------+                                                                  
    | id|inDegree|
    +---+--------+
    |  c|       1|
    |  b|       2|
    +---+--------+
    >>> 
    >>> # Query: Count the number of "follow" connections in the graph.
    ... g.edges.filter("relationship = 'follow'").count()
    2       
    >>> results.vertices.select("id", "pagerank").show()                            
    16/11/17 16:03:45 WARN Executor: 1 block locks were not released by TID = 9059:
    [rdd_337_0]
    16/11/17 16:03:45 WARN Executor: 1 block locks were not released by TID = 9060:
    [rdd_337_1]
    +---+-------------------+
    | id|           pagerank|
    +---+-------------------+
    |  a|               0.01|
    |  b| 0.2808611427228327|
    |  c|0.27995525261339177|
    +---+-------------------+
    
    0 讨论(0)
  • 2021-02-07 17:05

    For PyCharm go to configurations and add the environment variable:

    Name: PYSPARK_SUBMIT_ARGS

    Value: --packages graphframes:graphframes:0.2.0-spark2.0-s_2.11 pyspark-shell

    I've found it doesn't work for me without the pyspark-shell at the end

    0 讨论(0)
  • 2021-02-07 17:07

    Follow-up on @Gilles Essoki solution. Make sure you have the right Spark version and Scala version for your environment.

    graphframes:(latest version)-spark(your spark version)-s_(your scala version)

    I did not have to specify the jar file or copy it to the spark default jar directory when I had the right versions. Note: You need to run 'spark-shell' cmd.

    %spark-shell
    ...
    ...
    ...
    Welcome to
          ____              __
         / __/__  ___ _____/ /__
        _\ \/ _ \/ _ `/ __/  '_/
       /___/ .__/\_,_/_/ /_/\_\   version 1.6.0
          /_/
    Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_67)
    

    The correct version to get for this setup from SparkPackages

    For my environment I had to use the following command :

    %pyspark --packages graphframes:graphframes:0.3.0-spark1.6-s_2.10

    0 讨论(0)
  • 2021-02-07 17:09

    Make sure that your PYSPARK_SUBMIT_ARGS is updated to have "--packages graphframes:graphframes:0.2.0-spark2.0" in your kernel.json ~/.ipython/kernels//kernel.json.

    You probably already looked at the following link. It has more details on Jupiter setup. Basically, pyspark has to be supplied the graphframes.jar.

    0 讨论(0)
提交回复
热议问题