google-cloud-dataproc | 易学教程

Understanding GCP Dataproc billing and how it is affected by labels

阅读更多关于 Understanding GCP Dataproc billing and how it is affected by labels

问题 I'm trying to make sure I have a clear understanding of how my organisation gets billed for Google Cloud Platform Dataproc. We have exported our billing history to BigQuery so that we can analyse it. This morning we had two dataproc clusters running and the screenshot below shows a subset of the billing history for those two clusters. I have filtered on labels.key = "goog-dataproc-cluster-uuid" or labels.key = "goog-dataproc-cluster-name" or labels.key = "goog-dataproc-location" . Here is a

How to get jobId that was submitted using Dataproc Workflow Template

阅读更多关于 How to get jobId that was submitted using Dataproc Workflow Template

问题 I have submitted a Hive job using Dataproc Workflow Template with the help of Airflow operator (DataprocWorkflowTemplateInstantiateInlineOperator) written in Python. Once the job is submitted some name will be assigned as jobId (example: job0-abc2def65gh12 ). Since I was not able to get jobId I tried to pass jobId as a parameter from REST API which isn't working. Can I fetch jobId or, if it's not possible, can I pass jobId as a parameter? 回答1: The JobId will be available as part of metadata

Google Cloud Dataproc drop BigQuery table not working

阅读更多关于 Google Cloud Dataproc drop BigQuery table not working

问题 Hi I tried to delete a table from BigQuery using java client library in Dataproc, started spark-shell as below: spark-shell --packages com.google.cloud:google-cloud-bigquery:1.59.0 and import following dependency import com.google.cloud.bigquery.BigQuery; import com.google.cloud.bigquery.BigQueryOptions; import com.google.cloud.bigquery.FieldValueList; import com.google.cloud.bigquery.Job; import com.google.cloud.bigquery.JobId; import com.google.cloud.bigquery.JobInfo; import com.google

How can I perform data lineage in GCP?

阅读更多关于 How can I perform data lineage in GCP?

问题 When we realize the data lake with GCP Cloud storage, and data processing with Cloud services such as Dataproc, Dataflow How can we generated data lineage report in GCP. Thanks. 回答1: Google Cloud Platform doesn't have serverless data lineage offering. Instead, you may want to install Apache Atlas on Google Cloud Dataproc and use it for data lineage. 回答2: If data lineage is important for you, you will find yourself wanting an Enterprise Data Cloud. Cloudera is the main supplier in this space,

ModuleNotFoundError because PySpark serializer is not able to locate library folder

阅读更多关于 ModuleNotFoundError because PySpark serializer is not able to locate library folder

问题 I have the following folder structure - libfolder - lib1.py - lib2.py - main.py main.py calls libfolder.lib1.py which then calls libfolder.lib2.py and others. It all works perfectly fine in local machine but after I deploy it to Dataproc I get the following error File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 455, in loads return pickle.loads(obj, encoding=encoding) ModuleNotFoundError: No module named 'libfolder' I have zipped the folder into xyz.zip and run the

Google Dataproc - disconnect with executors often

阅读更多关于 Google Dataproc - disconnect with executors often

问题 I am using Dataproc to run Spark commands over a cluster using spark-shell. I frequently get error/warning messages indicating that I lose connection with my executors. The messages look like this: [Stage 6:> (0 + 2) / 2]16/01/20 10:10:24 ERROR org.apache.spark.scheduler.cluster.YarnScheduler: Lost executor 5 on spark-cluster-femibyte-w-0.c.gcebook-1039.internal: remote Rpc client disassociated 16/01/20 10:10:24 WARN akka.remote.ReliableDeliverySupervisor: Association with remote system [akka

Google Dataproc - disconnect with executors often

阅读更多关于 Google Dataproc - disconnect with executors often

Why does Spark running in Google Dataproc store temporary files on external storage (GCS) instead of local disk or HDFS while using saveAsTextFile?

阅读更多关于 Why does Spark running in Google Dataproc store temporary files on external storage (GCS) instead of local disk or HDFS while using saveAsTextFile?

问题 I have run the following PySpark code: from pyspark import SparkContext sc = SparkContext() data = sc.textFile('gs://bucket-name/input_blob_path') sorted_data = data.sortBy(lambda x: sort_criteria(x)) sorted_data.saveAsTextFile( 'gs://bucket-name/output_blob_path', compressionCodecClass="org.apache.hadoop.io.compress.GzipCodec" ) Job finished successfully. However, during the job execution Spark created many temporary blobs in the following path gs://bucket-name/output_blob_path/_temporary/0/

ImportError: No module named numpy - Google Cloud Dataproc when using Jupyter Notebook

阅读更多关于 ImportError: No module named numpy - Google Cloud Dataproc when using Jupyter Notebook

问题 When starting Jupyter Notebook on Google Dataproc, importing modules fails. I have tried to install the modules using different commands. Some examples: import os os.sytem("sudo apt-get install python-numpy") os.system("sudo pip install numpy") #after having installed pip os.system("sudo pip install python-numpy") #after having installed pip import numpy None of the above examples work and return an import error: enter image description here When using command line I am able to install

Dataproc set number of vcores per executor container

阅读更多关于 Dataproc set number of vcores per executor container

问题 I'm building a spark application which will run on Dataproc. I plan to use ephemeral clusters, and spin a new one up for each execution of the application. So I basically want my job to eat up as much of the cluster resources as possible, and I have a very good idea of the requirements. I've been playing around with turning off dynamic allocation and setting up the executor instances and cores myself. Currently I'm using 6 instances and 30 cores a pop. Perhaps it's more of a yarn question,