google-cloud-dataproc

Pausing Dataproc cluster - Google Compute engine

夙愿已清 提交于 2019-12-07 10:32:58
问题 is there a way of pausing a Dataproc cluster so I don't get billed when I am not actively running spark-shell or spark-submit jobs ? The cluster management instructions at this link: https://cloud.google.com/sdk/gcloud/reference/beta/dataproc/clusters/ only show how to destroy a cluster but I have installed spark cassandra connector API for example. Is my only alternative to just creating an image that I'll need to install every time ? 回答1: In general, the best thing to do is to distill out

load table from bigquery to spark cluster with pyspark script

若如初见. 提交于 2019-12-07 07:54:45
问题 I have a data table loaded in bigquery, and I want to import it in my spark cluster via a pyspark .py file. I saw in Dataproc + BigQuery examples - any available? that there was a way to load a bigquery table in the spark cluster with scala, but is there a way to do it in a pyspark script? 回答1: This comes from @MattJ in this question. Here's an example to connect to BigQuery in Spark and perform a word count. import json import pyspark sc = pyspark.SparkContext() hadoopConf=sc._jsc

pickle.PicklingError: Cannot pickle files that are not opened for reading

梦想的初衷 提交于 2019-12-07 07:11:31
问题 i'm getting this error while running pyspark job on dataproc. What could be the reason ? This is the stack trace of error. File "/usr/lib/python2.7/pickle.py", line 331, in save self.save_reduce(obj=obj, *rv) File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 553, in save_reduce File "/usr/lib/python2.7/pickle.py", line 286, in save f(self, obj) # Call unbound method with explicit self File "/usr/lib/python2.7/pickle.py", line 649, in save_dict self._batch_setitems(obj

How to manage conflicting DataProc Guava, Protobuf, and GRPC dependencies

时间秒杀一切 提交于 2019-12-07 05:58:28
问题 I am working on a scala Spark job which needs to use java library (youtube/vitess) which is dependent upon newer versions of GRPC (1.01), Guava (19.0), and Protobuf (3.0.0) than currently provided on the DataProc 1.1 image. When running the project locally and building with maven, the correct versions of these dependencies are loaded an the job will run without issue. When submitting the job to DataProc, the DataProc version of these libraries are preferred and the job will reference class

Error while running PySpark DataProc Job due to python version

爷,独闯天下 提交于 2019-12-06 02:49:27
问题 I create a dataproc cluster using the following command gcloud dataproc clusters create datascience \ --initialization-actions \ gs://dataproc-initialization-actions/jupyter/jupyter.sh \ However when I submit my PySpark Job I got the following error Exception: Python in worker has different version 3.4 than that in driver 3.7, PySpark cannot run with different minor versions.Please check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set. Any Thoughts? 回答1: This

ImportError: No module named numpy - Google Cloud Dataproc when using Jupyter Notebook

ぃ、小莉子 提交于 2019-12-05 19:12:30
When starting Jupyter Notebook on Google Dataproc, importing modules fails. I have tried to install the modules using different commands. Some examples: import os os.sytem("sudo apt-get install python-numpy") os.system("sudo pip install numpy") #after having installed pip os.system("sudo pip install python-numpy") #after having installed pip import numpy None of the above examples work and return an import error: enter image description here When using command line I am able to install modules, but still the import error remains. I guess I am installing modules in a wrong location. Any

load table from bigquery to spark cluster with pyspark script

喜欢而已 提交于 2019-12-05 13:27:43
I have a data table loaded in bigquery, and I want to import it in my spark cluster via a pyspark .py file. I saw in Dataproc + BigQuery examples - any available? that there was a way to load a bigquery table in the spark cluster with scala, but is there a way to do it in a pyspark script? James This comes from @MattJ in this question . Here's an example to connect to BigQuery in Spark and perform a word count. import json import pyspark sc = pyspark.SparkContext() hadoopConf=sc._jsc.hadoopConfiguration() hadoopConf.get("fs.gs.system.bucket") conf = {"mapred.bq.project.id": "<project_id>",

Pausing Dataproc cluster - Google Compute engine

孤街浪徒 提交于 2019-12-05 11:49:38
is there a way of pausing a Dataproc cluster so I don't get billed when I am not actively running spark-shell or spark-submit jobs ? The cluster management instructions at this link: https://cloud.google.com/sdk/gcloud/reference/beta/dataproc/clusters/ only show how to destroy a cluster but I have installed spark cassandra connector API for example. Is my only alternative to just creating an image that I'll need to install every time ? In general, the best thing to do is to distill out the steps you used to customize your cluster into some setup scripts, and then use Dataproc's initialization

pickle.PicklingError: Cannot pickle files that are not opened for reading

白昼怎懂夜的黑 提交于 2019-12-05 11:48:31
i'm getting this error while running pyspark job on dataproc. What could be the reason ? This is the stack trace of error. File "/usr/lib/python2.7/pickle.py", line 331, in save self.save_reduce(obj=obj, *rv) File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 553, in save_reduce File "/usr/lib/python2.7/pickle.py", line 286, in save f(self, obj) # Call unbound method with explicit self File "/usr/lib/python2.7/pickle.py", line 649, in save_dict self._batch_setitems(obj.iteritems()) File "/usr/lib/python2.7/pickle.py", line 681, in _batch_setitems save(v) File "/usr/lib

How do you use the Google DataProc Java Client to submit spark jobs using jar files and classes in associated GS bucket?

a 夏天 提交于 2019-12-04 23:51:08
问题 I need to trigger Spark Jobs to aggregate data from a JSON file using an API call. I use spring-boot to create the resources. Thus, the steps for the solution is the following: User makes an POST request with a json file as the input The JSON file is stored in google bucket associated with dataproc cluster. A aggregating spark job is triggered from within the REST method with the specified jars, classes and the argument is the json file link. I want the job to be triggered using Dataproc's