Reading from google storage gs:// filesystem from local spark instance

前端 未结 3 1112
情深已故
情深已故 2021-01-07 05:58

The problem is quite simple: You have a local spark instance (either cluster or just running it in local mode) and you want to read from gs://

相关标签:
3条回答
  • 2021-01-07 06:31

    Considering that it has been awhile since the last answer, I though I would share my recent solution. Note, the following instruction is for Spark 2.4.4.

    1. Download the "gcs-connector" for the type of Spark/Hadoop you have got from here. Search for "Other Spark/Hadoop clusters" topic.
    2. Move the "gcs-connector" to $SPARK_HOME/jars. See more about $SPARK_HOME below.
    3. Make sure that all the environment variables are properly set up for you Spark application to run. This is:
      a. SPARK_HOME pointing to the location where you have saved Spark installations.
      b. GOOGLE_APPLICATION_CREDENTIALS pointing to the location where json key is. If you have just downloaded it, it will be in your ~/Downloads
      c. JAVA_HOME pointing to the location where you have your Java 8* "Home" folder.

      If you are on Linux/Mac OS you can use export VAR=DIR, where VAR is variable and DIR the location, or if you want to set them up permanently, you can add them to ~/.bash_profile or ~/.zshrc files. For Windows OS users, in cmd write set VAR=DIR for shell related operations, or setx VAR DIR to store the variables permanently.

    That has worked for me and I hope it help others too.

    * Spark works on Java 8, therefore some of its features might not be compatible with the latest Java Development Kit.

    0 讨论(0)
  • 2021-01-07 06:37

    I am submitting here the solution I have come up with by combining different resources:

    1. Download the google cloud storage connector : gs-connector and store it in $SPARK/jars/ folder (Check Alternative 1 at the bottom)

    2. Download the core-site.xml file from here, or copy it from below. This is a configuration file used by hadoop, (which is used by spark).

    3. Store the core-site.xml file in a folder. Personally I create the $SPARK/conf/hadoop/conf/ folder and store it there.

    4. In the spark-env.sh file indicate the hadoop conf fodler by adding the following line: export HADOOP_CONF_DIR= =</absolute/path/to/hadoop/conf/>

    5. Create an OAUTH2 key from the respective page of Google (Google Console-> API-Manager-> Credentials).

    6. Copy the credentials to the core-site.xml file.

    Alternative 1: Instead of copying the file to the $SPARK/jars folder, you can store the jar in any folder and add the folder in the spark classpath. One way is to edit SPARK_CLASSPATH in the spark-env.sh``folder butSPARK_CLASSPATH` is now deprecated. Therefore one can look here on how to add a jar in the spark classpath

    <configuration>
        <property>
            <name>fs.gs.impl</name>
            <value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem</value>
            <description>Register GCS Hadoop filesystem</description>
        </property>
        <property>
            <name>fs.gs.auth.service.account.enable</name>
            <value>false</value>
            <description>Force OAuth2 flow</description>
         </property>
         <property>
            <name>fs.gs.auth.client.id</name>
            <value>32555940559.apps.googleusercontent.com</value>
            <description>Client id of Google-managed project associated with the Cloud SDK</description>
         </property>
         <property>
            <name>fs.gs.auth.client.secret</name>
            <value>fslkfjlsdfj098ejkjhsdf</value>
            <description>Client secret of Google-managed project associated with the Cloud SDK</description>
         </property>
         <property>
            <name>fs.gs.project.id</name>
            <value>_THIS_VALUE_DOES_NOT_MATTER_</value>
            <description>This value is required by GCS connector, but not used in the tools provided here.
      The value provided is actually an invalid project id (starts with `_`).
          </description>
       </property>
    </configuration>
    
    0 讨论(0)
  • 2021-01-07 06:39

    In my case on Spark 2.4.3 I needed to do the following to enable GCS access from Spark local. I used a JSON keyfile vs. the client.id/secret proposed above.

    1. In $SPARK_HOME/jars/, use the shaded gcs-connector jar from here: http://repo2.maven.org/maven2/com/google/cloud/bigdataoss/gcs-connector/hadoop2-1.9.17/ or else I had various failures with transitive dependencies.

    2. (Optional) To my build.sbt add:

      "com.google.cloud.bigdataoss" % "gcs-connector" % "hadoop2-1.9.17"
          exclude("javax.jms", "jms")
          exclude("com.sun.jdmk", "jmxtools")
          exclude("com.sun.jmx", "jmxri")
      
    3. In $SPARK_HOME/conf/spark-defaults.conf, add:

      spark.hadoop.google.cloud.auth.service.account.enable       true
      spark.hadoop.google.cloud.auth.service.account.json.keyfile /path/to/my/keyfile
      

    And everything is working.

    0 讨论(0)
提交回复
热议问题