google-cloud-dataproc | 易学教程

Pausing Dataproc cluster - Google Compute engine

阅读更多关于 Pausing Dataproc cluster - Google Compute engine

问题 is there a way of pausing a Dataproc cluster so I don't get billed when I am not actively running spark-shell or spark-submit jobs ? The cluster management instructions at this link: https://cloud.google.com/sdk/gcloud/reference/beta/dataproc/clusters/ only show how to destroy a cluster but I have installed spark cassandra connector API for example. Is my only alternative to just creating an image that I'll need to install every time ? 回答1: In general, the best thing to do is to distill out

load table from bigquery to spark cluster with pyspark script

阅读更多关于 load table from bigquery to spark cluster with pyspark script

问题 I have a data table loaded in bigquery, and I want to import it in my spark cluster via a pyspark .py file. I saw in Dataproc + BigQuery examples - any available? that there was a way to load a bigquery table in the spark cluster with scala, but is there a way to do it in a pyspark script? 回答1: This comes from @MattJ in this question. Here's an example to connect to BigQuery in Spark and perform a word count. import json import pyspark sc = pyspark.SparkContext() hadoopConf=sc._jsc

pickle.PicklingError: Cannot pickle files that are not opened for reading

阅读更多关于 pickle.PicklingError: Cannot pickle files that are not opened for reading

问题 i'm getting this error while running pyspark job on dataproc. What could be the reason ? This is the stack trace of error. File "/usr/lib/python2.7/pickle.py", line 331, in save self.save_reduce(obj=obj, *rv) File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 553, in save_reduce File "/usr/lib/python2.7/pickle.py", line 286, in save f(self, obj) # Call unbound method with explicit self File "/usr/lib/python2.7/pickle.py", line 649, in save_dict self._batch_setitems(obj

How to manage conflicting DataProc Guava, Protobuf, and GRPC dependencies

阅读更多关于 How to manage conflicting DataProc Guava, Protobuf, and GRPC dependencies

问题 I am working on a scala Spark job which needs to use java library (youtube/vitess) which is dependent upon newer versions of GRPC (1.01), Guava (19.0), and Protobuf (3.0.0) than currently provided on the DataProc 1.1 image. When running the project locally and building with maven, the correct versions of these dependencies are loaded an the job will run without issue. When submitting the job to DataProc, the DataProc version of these libraries are preferred and the job will reference class

Error while running PySpark DataProc Job due to python version

阅读更多关于 Error while running PySpark DataProc Job due to python version

问题 I create a dataproc cluster using the following command gcloud dataproc clusters create datascience \ --initialization-actions \ gs://dataproc-initialization-actions/jupyter/jupyter.sh \ However when I submit my PySpark Job I got the following error Exception: Python in worker has different version 3.4 than that in driver 3.7, PySpark cannot run with different minor versions.Please check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set. Any Thoughts? 回答1: This

ImportError: No module named numpy - Google Cloud Dataproc when using Jupyter Notebook

阅读更多关于 ImportError: No module named numpy - Google Cloud Dataproc when using Jupyter Notebook

When starting Jupyter Notebook on Google Dataproc, importing modules fails. I have tried to install the modules using different commands. Some examples: import os os.sytem("sudo apt-get install python-numpy") os.system("sudo pip install numpy") #after having installed pip os.system("sudo pip install python-numpy") #after having installed pip import numpy None of the above examples work and return an import error: enter image description here When using command line I am able to install modules, but still the import error remains. I guess I am installing modules in a wrong location. Any

load table from bigquery to spark cluster with pyspark script

阅读更多关于 load table from bigquery to spark cluster with pyspark script

I have a data table loaded in bigquery, and I want to import it in my spark cluster via a pyspark .py file. I saw in Dataproc + BigQuery examples - any available? that there was a way to load a bigquery table in the spark cluster with scala, but is there a way to do it in a pyspark script? James This comes from @MattJ in this question . Here's an example to connect to BigQuery in Spark and perform a word count. import json import pyspark sc = pyspark.SparkContext() hadoopConf=sc._jsc.hadoopConfiguration() hadoopConf.get("fs.gs.system.bucket") conf = {"mapred.bq.project.id": "<project_id>",

Pausing Dataproc cluster - Google Compute engine

阅读更多关于 Pausing Dataproc cluster - Google Compute engine

is there a way of pausing a Dataproc cluster so I don't get billed when I am not actively running spark-shell or spark-submit jobs ? The cluster management instructions at this link: https://cloud.google.com/sdk/gcloud/reference/beta/dataproc/clusters/ only show how to destroy a cluster but I have installed spark cassandra connector API for example. Is my only alternative to just creating an image that I'll need to install every time ? In general, the best thing to do is to distill out the steps you used to customize your cluster into some setup scripts, and then use Dataproc's initialization

pickle.PicklingError: Cannot pickle files that are not opened for reading

阅读更多关于 pickle.PicklingError: Cannot pickle files that are not opened for reading

i'm getting this error while running pyspark job on dataproc. What could be the reason ? This is the stack trace of error. File "/usr/lib/python2.7/pickle.py", line 331, in save self.save_reduce(obj=obj, *rv) File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 553, in save_reduce File "/usr/lib/python2.7/pickle.py", line 286, in save f(self, obj) # Call unbound method with explicit self File "/usr/lib/python2.7/pickle.py", line 649, in save_dict self._batch_setitems(obj.iteritems()) File "/usr/lib/python2.7/pickle.py", line 681, in _batch_setitems save(v) File "/usr/lib

How do you use the Google DataProc Java Client to submit spark jobs using jar files and classes in associated GS bucket?

阅读更多关于 How do you use the Google DataProc Java Client to submit spark jobs using jar files and classes in associated GS bucket?

问题 I need to trigger Spark Jobs to aggregate data from a JSON file using an API call. I use spring-boot to create the resources. Thus, the steps for the solution is the following: User makes an POST request with a json file as the input The JSON file is stored in google bucket associated with dataproc cluster. A aggregating spark job is triggered from within the REST method with the specified jars, classes and the argument is the json file link. I want the job to be triggered using Dataproc's