Unable to run a basic GraphFrames example

前端 未结 5 1785
忘掉有多难
忘掉有多难 2021-02-07 16:28

Trying to run a simple GraphFrame example using pyspark.

spark version : 2.0

graphframe version : 0.2.0

I am able to import graphframes in Jupyter:

5条回答
  •  别那么骄傲
    2021-02-07 16:46

    I was able to make it work..

    Depending on your spark version, all you have to do is download the graphframe jar corresponding to your version of spark here https://spark-packages.org/package/graphframes/graphframes.

    Then you'll have to copy the jar downloaded to your spark jar directory

        root@93d8398b53f2:/usr/local/spark/jars# wget http://dl.bintray.com/spark-packages/maven/graphframes/graphframes/0.3.0-spark2.0-s_2.11/graphframes-0.3.0-spark2.0-s_2.11.jar
    

    There's the little tric right here, launch pyspark with arguments for the first time so that it downloads all the graphframe's jars dependencies:

        root@93d8398b53f2:/usr/local/spark/bin# pyspark --packages graphframes:graphframes:0.3.0-spark2.0-s_2.11 --jars graphframes-0.3.0-spark2.0-s_2.11.jar
    

    This should come up:

    Ivy Default Cache set to: /root/.ivy2/cache
    The jars for the packages stored in: /root/.ivy2/jars
    :: loading settings :: url = jar:file:/usr/local/spark-2.0.0-bin-hadoop2.7/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
    graphframes#graphframes added as a dependency
    :: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
        confs: [default]
        found graphframes#graphframes;0.3.0-spark2.0-s_2.11 in spark-packages
        found com.typesafe.scala-logging#scala-logging-api_2.11;2.1.2 in central
        found com.typesafe.scala-logging#scala-logging-slf4j_2.11;2.1.2 in central
        found org.scala-lang#scala-reflect;2.11.0 in central
        found org.slf4j#slf4j-api;1.7.7 in central
    downloading http://dl.bintray.com/spark-packages/maven/graphframes/graphframes/0.3.0-spark2.0-s_2.11/graphframes-0.3.0-spark2.0-s_2.11.jar ...
        [SUCCESSFUL ] graphframes#graphframes;0.3.0-spark2.0-s_2.11!graphframes.jar (269ms)
    downloading https://repo1.maven.org/maven2/com/typesafe/scala-logging/scala-logging-api_2.11/2.1.2/scala-logging-api_2.11-2.1.2.jar ...
        [SUCCESSFUL ] com.typesafe.scala-logging#scala-logging-api_2.11;2.1.2!scala-logging-api_2.11.jar (53ms)
    downloading https://repo1.maven.org/maven2/com/typesafe/scala-logging/scala-logging-slf4j_2.11/2.1.2/scala-logging-slf4j_2.11-2.1.2.jar ...
        [SUCCESSFUL ] com.typesafe.scala-logging#scala-logging-slf4j_2.11;2.1.2!scala-logging-slf4j_2.11.jar (66ms)
    downloading https://repo1.maven.org/maven2/org/scala-lang/scala-reflect/2.11.0/scala-reflect-2.11.0.jar ...
        [SUCCESSFUL ] org.scala-lang#scala-reflect;2.11.0!scala-reflect.jar (1409ms)
    downloading https://repo1.maven.org/maven2/org/slf4j/slf4j-api/1.7.7/slf4j-api-1.7.7.jar ...
        [SUCCESSFUL ] org.slf4j#slf4j-api;1.7.7!slf4j-api.jar (53ms)
    :: resolution report :: resolve 6161ms :: artifacts dl 1877ms
        :: modules in use:
        com.typesafe.scala-logging#scala-logging-api_2.11;2.1.2 from central in [default]
        com.typesafe.scala-logging#scala-logging-slf4j_2.11;2.1.2 from central in [default]
        graphframes#graphframes;0.3.0-spark2.0-s_2.11 from spark-packages in [default]
        org.scala-lang#scala-reflect;2.11.0 from central in [default]
        org.slf4j#slf4j-api;1.7.7 from central in [default]
        ---------------------------------------------------------------------
        |                  |            modules            ||   artifacts   |
        |       conf       | number| search|dwnlded|evicted|| number|dwnlded|
        ---------------------------------------------------------------------
        |      default     |   5   |   5   |   5   |   0   ||   5   |   5   |
        ---------------------------------------------------------------------
    :: retrieving :: org.apache.spark#spark-submit-parent
        confs: [default]
        5 artifacts copied, 0 already retrieved (4713kB/39ms)
    Warning: Local jar /usr/local/spark-2.0.0-bin-hadoop2.7/bin/graphframes-0.3.0-spark2.0-s_2.11.jar does not exist, skipping.
    Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
    Setting default log level to "WARN".
    To adjust logging level use sc.setLogLevel(newLevel).
    16/11/17 15:43:51 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    16/11/17 15:43:54 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
    Welcome to
          ____              __
         / __/__  ___ _____/ /__
        _\ \/ _ \/ _ `/ __/  '_/
       /__ / .__/\_,_/_/ /_/\_\   version 2.0.0
          /_/
    
    Using Python version 2.7.12 (default, Jul  2 2016 17:42:40)
    SparkSession available as 'spark'.
    >>> 
    

    Meaning it has downloaded all the dependencies required. The important thing right here is Ivy Default Cache set to: /root/.ivy2/cache, precisely the jars stored in /root/.ivy2/jars

    You can exit right after, if you insist in proceeding with the python code calling GraphFrame, it will call the error:

        Py4JJavaError: An error occurred while calling o561.newInstance.
        : java.lang.NoClassDefFoundError: Could not initialize class org.graphframes.GraphFrame. 
    

    Let's see what's inside the directory /root/.ivy2/jars:

    root@93d8398b53f2:/usr/local/spark/bin# ls /root/.ivy2/jars/
    com.typesafe.scala-logging_scala-logging-api_2.11-2.1.2.jar  com.typesafe.scala-logging_scala-logging-slf4j_2.11-2.1.2.jar  graphframes_graphframes-0.3.0-spark2.0-s_2.11.jar  org.scala-lang_scala-reflect-2.11.0.jar  org.slf4j_slf4j-api-1.7.7.jar
    

    Now you'll want to copy all the jars appearing in /root/.ivy2/jars to your spark's jars directory:

        root@93d8398b53f2:/usr/local/spark/jars# cp /root/.ivy2/jars/* .
    

    Launch pyspark for the second time:

        root@93d8398b53f2:/usr/local/spark/jars# pyspark --packages graphframes:graphframes:0.3.0-spark2.0-s_2.11 --jars graphframes-0.3.0-spark2.0-s_2.11.jar
    

    This should come up:

    Ivy Default Cache set to: /root/.ivy2/cache
    The jars for the packages stored in: /root/.ivy2/jars
    :: loading settings :: url = jar:file:/usr/local/spark-2.0.0-bin-hadoop2.7/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
    graphframes#graphframes added as a dependency
    :: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
        confs: [default]
        found graphframes#graphframes;0.3.0-spark2.0-s_2.11 in spark-packages
        found com.typesafe.scala-logging#scala-logging-api_2.11;2.1.2 in central
        found com.typesafe.scala-logging#scala-logging-slf4j_2.11;2.1.2 in central
        found org.scala-lang#scala-reflect;2.11.0 in central
        found org.slf4j#slf4j-api;1.7.7 in central
    :: resolution report :: resolve 748ms :: artifacts dl 27ms
        :: modules in use:
        com.typesafe.scala-logging#scala-logging-api_2.11;2.1.2 from central in [default]
        com.typesafe.scala-logging#scala-logging-slf4j_2.11;2.1.2 from central in [default]
        graphframes#graphframes;0.3.0-spark2.0-s_2.11 from spark-packages in [default]
        org.scala-lang#scala-reflect;2.11.0 from central in [default]
        org.slf4j#slf4j-api;1.7.7 from central in [default]
        ---------------------------------------------------------------------
        |                  |            modules            ||   artifacts   |
        |       conf       | number| search|dwnlded|evicted|| number|dwnlded|
        ---------------------------------------------------------------------
        |      default     |   5   |   0   |   0   |   0   ||   5   |   0   |
        ---------------------------------------------------------------------
    :: retrieving :: org.apache.spark#spark-submit-parent
        confs: [default]
        0 artifacts copied, 5 already retrieved (0kB/24ms)
    Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
    Setting default log level to "WARN".
    To adjust logging level use sc.setLogLevel(newLevel).
    16/11/17 15:53:01 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    16/11/17 15:53:03 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
    Welcome to
          ____              __
         / __/__  ___ _____/ /__
        _\ \/ _ \/ _ `/ __/  '_/
       /__ / .__/\_,_/_/ /_/\_\   version 2.0.0
          /_/
    
    Using Python version 2.7.12 (default, Jul  2 2016 17:42:40)
    SparkSession available as 'spark'.
    >>> 
    

    You can now enjoy GraphFrame:

    >>> # Create an Edge DataFrame with "src" and "dst" columns
    ... e = sqlContext.createDataFrame([
    ...   ("a", "b", "friend"),
    ...   ("b", "c", "follow"),
    ...   ("c", "b", "follow"),
    ... ], ["src", "dst", "relationship"])
    >>> # Create a GraphFrame
    ... from graphframes import *
    >>> g = GraphFrame(v, e)
    >>> 
    >>> # Query: Get in-degree of each vertex.
    ... g.inDegrees.show()
    +---+--------+                                                                  
    | id|inDegree|
    +---+--------+
    |  c|       1|
    |  b|       2|
    +---+--------+
    >>> 
    >>> # Query: Count the number of "follow" connections in the graph.
    ... g.edges.filter("relationship = 'follow'").count()
    2       
    >>> results.vertices.select("id", "pagerank").show()                            
    16/11/17 16:03:45 WARN Executor: 1 block locks were not released by TID = 9059:
    [rdd_337_0]
    16/11/17 16:03:45 WARN Executor: 1 block locks were not released by TID = 9060:
    [rdd_337_1]
    +---+-------------------+
    | id|           pagerank|
    +---+-------------------+
    |  a|               0.01|
    |  b| 0.2808611427228327|
    |  c|0.27995525261339177|
    +---+-------------------+
    

提交回复
热议问题