google-cloud-dataproc | 易学教程

Invalid region error when using google-cloud-python API to access Dataproc

阅读更多关于 Invalid region error when using google-cloud-python API to access Dataproc

问题 I am trying to create a cluster in Dataproc using google-cloud-python library, however, when setting region = 'us-central1' I get below exception: google.api_core.exceptions.InvalidArgument: 400 Region 'us-central1' is invalid. Please see https://cloud.google.com/dataproc/docs/concepts/regional-endpoints for additional information on regional endpoints Code (based on example): #!/usr/bin/python from google.cloud import dataproc_v1 client = dataproc_v1.ClusterControllerClient() project_id =

What is the most elegant and robust way on dataproc to adjust log levels for Spark?

阅读更多关于 What is the most elegant and robust way on dataproc to adjust log levels for Spark?

问题 As explained in previous answers, the ideal way to change the verbosity of a Spark cluster is changing the corresponding log4j.properties. However, on dataproc Spark runs on Yarn, therefore we have to adjust the global configuration and not /usr/lib/spark/conf Several suggestions: On dataproc we have several gcloud commands and properties we can pass during cluster creation. See documentation Is it possible to change the log4j.properties under /etc/hadoop/conf by specifying --properties

Kinesis Stream with Empty Records in Google Dataproc with Spark 1.6.1 Hadoop 2.7.2

阅读更多关于 Kinesis Stream with Empty Records in Google Dataproc with Spark 1.6.1 Hadoop 2.7.2

问题 I am trying to connect to Amazon Kinesis Stream from Google Dataproc but am only getting Empty RDDs. Command: spark-submit --verbose --packages org.apache.spark:spark-streaming-kinesis-asl_2.10:1.6.2 demo_kinesis_streaming.py --awsAccessKeyId XXXXX --awsSecretKey XXXX Detailed Log: https://gist.github.com/sshrestha-datalicious/e3fc8ebb4916f27735a97e9fcc42136c More Details Spark 1.6.1 Hadoop 2.7.2 Assembly Used: /usr/lib/spark/lib/spark-assembly-1.6.1-hadoop2.7.2.jar Surprisingly that works

PySpark Yarn Application fails on groupBy

阅读更多关于 PySpark Yarn Application fails on groupBy

问题 I'm trying to run a job on Yarn mode that processes a large amount of data ( 2TB ) read from google cloud storage. The pipeline can be summarized like this : sc.textFile("gs://path/*.json")\ .map(lambda row: json.loads(row))\ .map(toKvPair)\ .groupByKey().take(10) [...] later processing on collections and output to GCS. This computation over the elements of collections is not associative, each element is sorted in it's keyspace. When run on 10GB of data, it's completed without any issue.

Container killed by YARN for exceeding memory limits

阅读更多关于 Container killed by YARN for exceeding memory limits

问题 I am creating a cluster in google dataproc with the following characteristics: Master Standard (1 master, N workers) Machine n1-highmem-2 (2 vCPU, 13.0 GB memory) Primary disk 250 GB Worker nodes 2 Machine type n1-highmem-2 (2 vCPU, 13.0 GB memory) Primary disk size 250 GB I am also adding in Initialization actions the .sh file from this repository in order to use zeppelin. The code that I use works fine with some data but if I use bigger amount of, I got the following error: Container killed

I am not finding evidence of NodeInitializationAction for Dataproc having run

阅读更多关于 I am not finding evidence of NodeInitializationAction for Dataproc having run

问题 I am specifying a NodeInitializationAction for Dataproc as follows: ClusterConfig clusterConfig = new ClusterConfig(); clusterConfig.setGceClusterConfig(...); clusterConfig.setMasterConfig(...); clusterConfig.setWorkerConfig(...); List<NodeInitializationAction> initActions = new ArrayList<>(); NodeInitializationAction action = new NodeInitializationAction(); action.setExecutableFile("gs://mybucket/myExecutableFile"); initActions.add(action); clusterConfig.setInitializationActions(initActions)

Read from BigQuery into Spark in efficient way?

阅读更多关于 Read from BigQuery into Spark in efficient way?

问题 When using BigQuery Connector to read data from BigQuery I found that it copies all data first to Google Cloud Storage. Then reads this data in parallel into Spark, but when reading big table it takes very long time in copying data stage. So is there more efficient way to read data from BigQuery into Spark? Another Question: reading from BigQuery composed of 2 stages (copying to GCS, reading in parallel from GCS). does copying stage affected by Spark cluster size or it take fixed time? 回答1:

How can I run create Dataproc cluster, run job, delete cluster from Cloud Function

阅读更多关于 How can I run create Dataproc cluster, run job, delete cluster from Cloud Function

问题 I would like to start a Dataproc job in response to log files arriving in GCS bucket. I also do not want to keep a persistent cluster running as new log files arrive only several times a day and it would be idle most of the time. 回答1: I can use WorkflowTemplate API to manage the cluster lifecycle for me. With Dataproc Workflows I don't have to poll for either cluster to be created, or job created, or do any error handling. Here's my Cloud Function. Set to Cloud Storage bucket to trigger on

Google Dataproc - disconnect with executors often

阅读更多关于 Google Dataproc - disconnect with executors often

I am using Dataproc to run Spark commands over a cluster using spark-shell. I frequently get error/warning messages indicating that I lose connection with my executors. The messages look like this: [Stage 6:> (0 + 2) / 2]16/01/20 10:10:24 ERROR org.apache.spark.scheduler.cluster.YarnScheduler: Lost executor 5 on spark-cluster-femibyte-w-0.c.gcebook-1039.internal: remote Rpc client disassociated 16/01/20 10:10:24 WARN akka.remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkExecutor@spark-cluster- femibyte-w-0.c.gcebook-1039.internal:60599] has failed, address is

How do I connect to a dataproc cluster with Jupyter notebooks from cloud shell

阅读更多关于 How do I connect to a dataproc cluster with Jupyter notebooks from cloud shell

问题 I have seen the instructions here https://cloud.google.com/dataproc/docs/tutorials/jupyter-notebook for setting up Jupyter notebooks with dataproc but I can't figure out how to alter the process in order to use Cloud shell instead of creating an SSH tunnel locally. I have been able to connect to a datalab notebook by running datalab connect vmname from the cloud shell and then using the preview function. I would like to do something similar but with Jupyter notebooks and a dataproc cluster.