Using graphframes with PyCharm

落花浮王杯 提交于 2019-12-05 06:32:48


I have spent almost 2 days scrolling the internet and I was unable to sort out this problem. I am trying to install the graphframes package (Version: 0.2.0-spark2.0-s_2.11) to run with spark through PyCharm, but, despite my best efforts, it's been impossible.

I have tried almost everything. Please, know that I have checked this site here as well before posting an answer.

Here is the code I am trying to run:

# IMPORT OTHER LIBS --------------------------------------------------------
import os
import sys
import pandas as pd

# IMPORT SPARK ------------------------------------------------------------------------------------#
# Path to Spark source folder
USER_FILE_PATH = "/Users/<username>"
SPARK_PATH = "/PycharmProjects/GenesAssociation"
SPARK_FILE = "/spark-2.0.0-bin-hadoop2.7"
os.environ['SPARK_HOME'] = SPARK_HOME

# Append pySpark to Python Path
sys.path.append(SPARK_HOME + "/python")
sys.path.append(SPARK_HOME + "/python" + "/lib/")

    from pyspark import SparkContext
    from pyspark import SparkConf
    from pyspark.sql import SQLContext
    from pyspark.graphframes import GraphFrame

except ImportError as ex:
    print "Can not import Spark Modules", ex

# GLOBAL VARIABLES ---------------------------------------------------------    -----------------------#
SC = SparkContext('local')

# MAIN CODE ---------------------------------------------------------------------------------------#
if __name__ == "__main__":

    # Main Path to CSV files
    DATA_PATH = '/PycharmProjects/GenesAssociation/data/'
    FILE_NAME = 'gene_gene_associations_50k.csv'

    # LOAD DATA CSV USING  PANDAS -----------------------------------------------------------------#
    print "STEP 1: Loading Gene Nodes -------------------------------------------------------------"
    # Read csv file and load as df

    # Concatenate chunks into list & convert to dataFrame
    GENES_DF = pd.DataFrame(pd.concat(list(GENES), ignore_index=True))

    # Remove duplicates
    GENES_DF_CLEAN = GENES_DF.drop_duplicates(keep='first')

    # Name Columns
    GENES_DF_CLEAN.columns = ['gene_id']

    # Output dataFrame
    print GENES_DF_CLEAN

    # Create vertices

    # Show some vertices
    print VERTICES.take(5)

    print "STEP 2: Loading Gene Edges -------------------------------------------------------------"
    # Read csv file and load as df
                        usecols=['OFFICIAL_SYMBOL_A', 'OFFICIAL_SYMBOL_B', 'EXPERIMENTAL_SYSTEM'],

    # Concatenate chunks into list & convert to dataFrame
    EDGES_DF = pd.DataFrame(pd.concat(list(EDGES), ignore_index=True))

    # Name Columns
    EDGES_DF.columns = ["src", "dst", "rel_type"]

    # Output dataFrame
    print EDGES_DF

    # Create vertices
    EDGES = SQL_CONTEXT.createDataFrame(EDGES_DF)

    # Show some edges
    print EDGES.take(5)

    g = gf.GraphFrame(VERTICES, EDGES)

Needless to say, I have tried including the graphframes directory (look here to understand what I did) into spark's pyspark directory. But it seems like this not enough... Anything else I have tried just failed. Would appreciate some help with this. You can see below the error message I am getting:

Using Spark's default log4j profile: org/apache/spark/
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel).
16/09/19 12:46:02 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/09/19 12:46:03 WARN Utils: Service 'SparkUI' could not bind on port 4040.     Attempting port 4041.

STEP 1: Loading Gene Nodes -------------------------------------------------------------
0         MAP2K4
1           MYPN
2          ACVR1
3          GATA2
4           RPA2
5           ARF1
6           ARF3
8           XRN1
9            APP
10         APLP1
11        CITED2
12         EP300
13          APOB
14         ARRB2
15         CSF1R
16        PRRC2A
17          LSM1
18        SLC4A1
19          BCL3
20         ADRB1
21         BRCA1
25         ARVCF
26         PCBD1
27         PSEN2
28         CAPN3
29         ITPR1
30         MAGI1
31           RB1
32        TSG101
33          ORC1
...          ...
49379      WDR26
49380      WDR5B
49382       NLE1
49383      WDR12
49385      WDR53
49386      WDR59
49387      WDR61
49409       CHD6
49422      DACT1
49424      KMT2B
49438    SMARCA1
49459    DCLRE1A
49469      F2RL1
49472      SENP8
49475      TSPY1
49479   SERPINB5
49521     HOXA11
49548       SYF2
49553      FOXN3
49557      MLANA
49608     REPIN1
49609       GMNN
49670  HIST2H2BE
49767      BCL7C
49797      SIRT3
49810       KLF4
49858        RHO
49896     MAGEA2
49907   SUV420H2
49958     SAP30L

[6025 rows x 1 columns]
16/09/19 12:46:08 WARN TaskSetManager: Stage 0 contains a task of very large size (107 KB). The maximum recommended task size is 100 KB.
[Row(gene_id=u'MAP2K4'), Row(gene_id=u'MYPN'), Row(gene_id=u'ACVR1'), Row(gene_id=u'GATA2'), Row(gene_id=u'RPA2')]
STEP 2: Loading Gene Edges -------------------------------------------------------------
           src       dst                  rel_type
0       MAP2K4      FLNC                Two-hybrid
1         MYPN     ACTN2                Two-hybrid
2        ACVR1      FNTA                Two-hybrid
3        GATA2       PML                Two-hybrid
4         RPA2     STAT3                Two-hybrid
5         ARF1      GGA3                Two-hybrid
6         ARF3    ARFIP2                Two-hybrid
7         ARF3    ARFIP1                Two-hybrid
8         XRN1     ALDOA                Two-hybrid
9          APP    APPBP2                Two-hybrid
10       APLP1      DAB1                Two-hybrid
11      CITED2    TFAP2A                Two-hybrid
12       EP300    TFAP2A                Two-hybrid
13        APOB      MTTP                Two-hybrid
14       ARRB2    RALGDS                Two-hybrid
15       CSF1R      GRB2                Two-hybrid
16      PRRC2A      GRB2                Two-hybrid
17        LSM1      NARS                Two-hybrid
18      SLC4A1  SLC4A1AP                Two-hybrid
19        BCL3     BARD1                Two-hybrid
20       ADRB1     GIPC1                Two-hybrid
21       BRCA1      ATF1                Two-hybrid
22       BRCA1      MSH2                Two-hybrid
23       BRCA1     BARD1                Two-hybrid
24       BRCA1      MSH6                Two-hybrid
25       ARVCF     CDH15                Two-hybrid
26       PCBD1   CACNA1C                Two-hybrid
27       PSEN2     CAPN1                Two-hybrid
28       CAPN3       TTN                Two-hybrid
29       ITPR1       CA8                Two-hybrid
...        ...       ...                       ...
49969    SAP30     HDAC3  Affinity Capture-Western
49970    BRCA1     RBBP8           Co-localization
49971    BRCA1     BRCA1      Biochemical Activity
49972      SET     TREX1           Co-purification
49973      SET     TREX1     Reconstituted Complex
49974   PLAGL1     EP300     Reconstituted Complex
49975   PLAGL1    CREBBP     Reconstituted Complex
49976    EP300    PLAGL1  Affinity Capture-Western
49977     MTA1      ESR1     Reconstituted Complex
49978    SIRT2     EP300  Affinity Capture-Western
49979    EP300     SIRT2  Affinity Capture-Western
49980    EP300     HDAC1  Affinity Capture-Western
49981    EP300     SIRT2      Biochemical Activity
49982    MIER1    CREBBP     Reconstituted Complex
49983  SMARCA4     SIN3A  Affinity Capture-Western
49984  SMARCA4     HDAC2  Affinity Capture-Western
49985     ESR1     NCOA6  Affinity Capture-Western
49986     ESR1     TOP2B  Affinity Capture-Western
49987     ESR1     PRKDC  Affinity Capture-Western
49988     ESR1     PARP1  Affinity Capture-Western
49989     ESR1     XRCC5  Affinity Capture-Western
49990     ESR1     XRCC6  Affinity Capture-Western
49991    PARP1     TOP2B  Affinity Capture-Western
49992    PARP1     PRKDC  Affinity Capture-Western
49993    PARP1     XRCC5  Affinity Capture-Western
49994    PARP1     XRCC6  Affinity Capture-Western
49995    SIRT3     XRCC6  Affinity Capture-Western
49996    SIRT3     XRCC6     Reconstituted Complex
49997    SIRT3     XRCC6      Biochemical Activity
49998    HDAC1      PAX3  Affinity Capture-Western

[49999 rows x 3 columns]
16/09/19 12:46:11 WARN TaskSetManager: Stage 1 contains a task of very large size (1211 KB). The maximum recommended task size is 100 KB.
[Row(src=u'MAP2K4', dst=u'FLNC', rel_type=u'Two-hybrid'), Row(src=u'MYPN', dst=u'ACTN2', rel_type=u'Two-hybrid'), Row(src=u'ACVR1', dst=u'FNTA', rel_type=u'Two-hybrid'), Row(src=u'GATA2', dst=u'PML', rel_type=u'Two-hybrid'), Row(src=u'RPA2', dst=u'STAT3', rel_type=u'Two-hybrid')]
Traceback (most recent call last):
  File "/Users/username/PycharmProjects/GenesAssociation/", line 99, in <module>
    g = gf.GraphFrame(VERTICES, EDGES)
  File "/Users/username/PycharmProjects/GenesAssociation/spark-2.0.0-bin-hadoop2.7/python/pyspark/graphframes/", line 62, in __init__
    self._jvm_gf_api = _java_api(self._sc)
  File "/Users/username/PycharmProjects/GenesAssociation/spark-2.0.0-bin-hadoop2.7/python/pyspark/graphframes/", line 34, in _java_api
    return jsc._jvm.Thread.currentThread().getContextClassLoader().loadClass(javaClassName) \
  File "/Users/username/PycharmProjects/GenesAssociation/spark-2.0.0-bin-hadoop2.7/python/lib/", line 933, in __call__
  File "/Users/username/PycharmProjects/GenesAssociation/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/", line 63, in deco
    return f(*a, **kw)
  File "/Users/username/PycharmProjects/GenesAssociation/spark-2.0.0-bin-hadoop2.7/python/lib/", line 312, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o50.loadClass.
: java.lang.ClassNotFoundException: org.graphframes.GraphFramePythonAPI
    at java.lang.ClassLoader.loadClass(
    at java.lang.ClassLoader.loadClass(
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(
    at java.lang.reflect.Method.invoke(
    at py4j.reflection.MethodInvoker.invoke(
    at py4j.reflection.ReflectionEngine.invoke(
    at py4j.Gateway.invoke(
    at py4j.commands.AbstractCommand.invokeMethod(
    at py4j.commands.CallCommand.execute(

Process finished with exit code 1

Thanks in advance.


You can set PYSPARK_SUBMIT_ARGS either in your code

os.environ["PYSPARK_SUBMIT_ARGS"] = (
    "--packages graphframes:graphframes:0.2.0-spark2.0-s_2.11 pyspark-shell"
spark = SparkSession.builder.getOrCreate()

or in PyCharm edit run configuration (Run -> Edit configuration -> Choose configuration -> Select configuration tab -> Choose Environment variables -> Add PYSPARK_SUBMIT_ARGS):

with a minimal working example:

import os
import sys

os.environ["SPARK_HOME"] = SPARK_HOME
# os.environ["PYSPARK_SUBMIT_ARGS"] = ... If not set in PyCharm config

sys.path.append(os.path.join(SPARK_HOME, "python"))
sys.path.append(os.path.join(SPARK_HOME, "python/lib/"))

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

v = spark.createDataFrame([("a",  "foo"), ("b", "bar"),], ["id", "attr"])
e = spark.createDataFrame([("a", "b", "foobar")], ["src", "dst", "rel"])

from graphframes import *

g = GraphFrame(v, e)


You could also add the packages or jars to your spark-defaults.conf.

If you use Python 3 with graphframes 0.2 there is a known issue with extracting Python libraries from JAR so you'll have to do it manually. You can for example download JAR file, unzip it, and make sure that root directory with graphframes is on your Python path. This has been fixed in graphframes 0.3.

