google-cloud-dataproc

Understanding GCP Dataproc billing and how it is affected by labels

江枫思渺然 提交于 2019-12-24 00:59:51
问题 I'm trying to make sure I have a clear understanding of how my organisation gets billed for Google Cloud Platform Dataproc. We have exported our billing history to BigQuery so that we can analyse it. This morning we had two dataproc clusters running and the screenshot below shows a subset of the billing history for those two clusters. I have filtered on labels.key = "goog-dataproc-cluster-uuid" or labels.key = "goog-dataproc-cluster-name" or labels.key = "goog-dataproc-location" . Here is a

How to get jobId that was submitted using Dataproc Workflow Template

爱⌒轻易说出口 提交于 2019-12-24 00:53:18
问题 I have submitted a Hive job using Dataproc Workflow Template with the help of Airflow operator (DataprocWorkflowTemplateInstantiateInlineOperator) written in Python. Once the job is submitted some name will be assigned as jobId (example: job0-abc2def65gh12 ). Since I was not able to get jobId I tried to pass jobId as a parameter from REST API which isn't working. Can I fetch jobId or, if it's not possible, can I pass jobId as a parameter? 回答1: The JobId will be available as part of metadata

Google Cloud Dataproc drop BigQuery table not working

ぐ巨炮叔叔 提交于 2019-12-23 22:27:32
问题 Hi I tried to delete a table from BigQuery using java client library in Dataproc, started spark-shell as below: spark-shell --packages com.google.cloud:google-cloud-bigquery:1.59.0 and import following dependency import com.google.cloud.bigquery.BigQuery; import com.google.cloud.bigquery.BigQueryOptions; import com.google.cloud.bigquery.FieldValueList; import com.google.cloud.bigquery.Job; import com.google.cloud.bigquery.JobId; import com.google.cloud.bigquery.JobInfo; import com.google

How can I perform data lineage in GCP?

只谈情不闲聊 提交于 2019-12-23 15:15:08
问题 When we realize the data lake with GCP Cloud storage, and data processing with Cloud services such as Dataproc, Dataflow How can we generated data lineage report in GCP. Thanks. 回答1: Google Cloud Platform doesn't have serverless data lineage offering. Instead, you may want to install Apache Atlas on Google Cloud Dataproc and use it for data lineage. 回答2: If data lineage is important for you, you will find yourself wanting an Enterprise Data Cloud. Cloudera is the main supplier in this space,

ModuleNotFoundError because PySpark serializer is not able to locate library folder

房东的猫 提交于 2019-12-23 10:26:17
问题 I have the following folder structure - libfolder - lib1.py - lib2.py - main.py main.py calls libfolder.lib1.py which then calls libfolder.lib2.py and others. It all works perfectly fine in local machine but after I deploy it to Dataproc I get the following error File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 455, in loads return pickle.loads(obj, encoding=encoding) ModuleNotFoundError: No module named 'libfolder' I have zipped the folder into xyz.zip and run the

Google Dataproc - disconnect with executors often

三世轮回 提交于 2019-12-23 03:23:07
问题 I am using Dataproc to run Spark commands over a cluster using spark-shell. I frequently get error/warning messages indicating that I lose connection with my executors. The messages look like this: [Stage 6:> (0 + 2) / 2]16/01/20 10:10:24 ERROR org.apache.spark.scheduler.cluster.YarnScheduler: Lost executor 5 on spark-cluster-femibyte-w-0.c.gcebook-1039.internal: remote Rpc client disassociated 16/01/20 10:10:24 WARN akka.remote.ReliableDeliverySupervisor: Association with remote system [akka

Google Dataproc - disconnect with executors often

佐手、 提交于 2019-12-23 03:23:02
问题 I am using Dataproc to run Spark commands over a cluster using spark-shell. I frequently get error/warning messages indicating that I lose connection with my executors. The messages look like this: [Stage 6:> (0 + 2) / 2]16/01/20 10:10:24 ERROR org.apache.spark.scheduler.cluster.YarnScheduler: Lost executor 5 on spark-cluster-femibyte-w-0.c.gcebook-1039.internal: remote Rpc client disassociated 16/01/20 10:10:24 WARN akka.remote.ReliableDeliverySupervisor: Association with remote system [akka

Why does Spark running in Google Dataproc store temporary files on external storage (GCS) instead of local disk or HDFS while using saveAsTextFile?

对着背影说爱祢 提交于 2019-12-23 02:55:23
问题 I have run the following PySpark code: from pyspark import SparkContext sc = SparkContext() data = sc.textFile('gs://bucket-name/input_blob_path') sorted_data = data.sortBy(lambda x: sort_criteria(x)) sorted_data.saveAsTextFile( 'gs://bucket-name/output_blob_path', compressionCodecClass="org.apache.hadoop.io.compress.GzipCodec" ) Job finished successfully. However, during the job execution Spark created many temporary blobs in the following path gs://bucket-name/output_blob_path/_temporary/0/

ImportError: No module named numpy - Google Cloud Dataproc when using Jupyter Notebook

一世执手 提交于 2019-12-22 10:36:42
问题 When starting Jupyter Notebook on Google Dataproc, importing modules fails. I have tried to install the modules using different commands. Some examples: import os os.sytem("sudo apt-get install python-numpy") os.system("sudo pip install numpy") #after having installed pip os.system("sudo pip install python-numpy") #after having installed pip import numpy None of the above examples work and return an import error: enter image description here When using command line I am able to install

Dataproc set number of vcores per executor container

瘦欲@ 提交于 2019-12-20 04:24:26
问题 I'm building a spark application which will run on Dataproc. I plan to use ephemeral clusters, and spin a new one up for each execution of the application. So I basically want my job to eat up as much of the cluster resources as possible, and I have a very good idea of the requirements. I've been playing around with turning off dynamic allocation and setting up the executor instances and cores myself. Currently I'm using 6 instances and 30 cores a pop. Perhaps it's more of a yarn question,