Reading from google storage gs:// filesystem from local spark instance

前端未结

关注

 3  1114

情深已故 2021-01-07 05:58

The problem is quite simple: You have a local spark instance (either cluster or just running it in local mode) and you want to read from gs://

3条回答

攒了一身酷 (楼主)

2021-01-07 06:37
I am submitting here the solution I have come up with by combining different resources:
1. Download the google cloud storage connector : gs-connector and store it in $SPARK/jars/ folder (Check Alternative 1 at the bottom)
2. Download the core-site.xml file from here, or copy it from below. This is a configuration file used by hadoop, (which is used by spark).
3. Store the core-site.xml file in a folder. Personally I create the $SPARK/conf/hadoop/conf/ folder and store it there.
4. In the spark-env.sh file indicate the hadoop conf fodler by adding the following line: export HADOOP_CONF_DIR= =
5. Create an OAUTH2 key from the respective page of Google (Google Console-> API-Manager-> Credentials).
6. Copy the credentials to the core-site.xml file.
Alternative 1: Instead of copying the file to the $SPARK/jars folder, you can store the jar in any folder and add the folder in the spark classpath. One way is to edit SPARK_CLASSPATH in the spark-env.sh``folder butSPARK_CLASSPATH` is now deprecated. Therefore one can look here on how to add a jar in the spark classpath
```
    
        fs.gs.impl
        com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem
        Register GCS Hadoop filesystem
    
    
        fs.gs.auth.service.account.enable
        false
        Force OAuth2 flow
     
     
        fs.gs.auth.client.id
        32555940559.apps.googleusercontent.com
        Client id of Google-managed project associated with the Cloud SDK
     
     
        fs.gs.auth.client.secret
        fslkfjlsdfj098ejkjhsdf
        Client secret of Google-managed project associated with the Cloud SDK
     
     
        fs.gs.project.id
        _THIS_VALUE_DOES_NOT_MATTER_
        This value is required by GCS connector, but not used in the tools provided here.
  The value provided is actually an invalid project id (starts with `_`).
      
   
```
0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...