google-cloud-dataproc | 易学教程

Google Dataproc timing out and killing excutors

阅读更多关于 Google Dataproc timing out and killing excutors

问题 I have a google dataproc spark cluster set up with one master node, and 16 worker nodes. The master has 2 cpus and 13g of memory and each worker has 2 cpus and 3.5g of memory. I am running a rather network-intensive job where I have an array of 16 objects and I partition this array into 16 partitions so each worker gets one object. The objects make about 2.5 million web requests and aggregates them to send back to the master. Each request is a Solr response and is less than 50k. One field (an

Insufficient number of DataNodes reporting when creating dataproc cluster

阅读更多关于 Insufficient number of DataNodes reporting when creating dataproc cluster

问题 I am getting "Insufficient number of DataNodes reporting" error when creating dataproc cluster with gs:// as default FS. Below is the command i am using dataproc cluster. gcloud dataproc clusters create cluster-538f --image-version 1.2 \ --bucket dataproc_bucket_test --subnet default --zone asia-south1-b \ --master-machine-type n1-standard-1 --master-boot-disk-size 500 \ --num-workers 2 --worker-machine-type n1-standard-1 --worker-boot-disk-size 500 \ --scopes 'https://www.googleapis.com/auth

Unable to create cluster on Dataproc after deleting default service account

阅读更多关于 Unable to create cluster on Dataproc after deleting default service account

问题 I had mistakenly deleted a default "service account" for my project - {project_id}- compute@developer.gserviceaccount.com Now whenever I try to create a cluster on Data proc I get the following error: The resource '{project_id}-compute@developer.gserviceaccount.com' of type 'serviceAccount' was not found. Is there an easy way to resolve this issue - without losing any data for the project. 回答1: To clarify to anyone else who encounters this issue, this error is caused by actually deleting the

Passing typesafe config conf files to DataProcSparkOperator

阅读更多关于 Passing typesafe config conf files to DataProcSparkOperator

问题 I am using Google dataproc to submit spark jobs and google cloud composer to schedule them. Unfortunately, I am facing difficulties. I am relying on .conf files (typesafe config files) to pass arguments to my spark jobs. I am using the following python code for the airflow dataproc: t3 = dataproc_operator.DataProcSparkOperator( task_id ='execute_spark_job_cluster_test', dataproc_spark_jars='gs://snapshots/jars/pubsub-assembly-0.1.14-SNAPSHOT.jar', cluster_name='cluster', main_class = 'com

In Dataproc how can I access the Spark and Hadoop job history?

阅读更多关于 In Dataproc how can I access the Spark and Hadoop job history?

问题 In Google Cloud Dataproc how can I access the Spark or Hadoop job history servers? I want to be able to look at my job history details when I run jobs. 回答1: To do this, you will need to create an SSH tunnel to the cluster and then use a SOCKS proxy with your browser. This is due to the fact that while the web interfaces are open on the cluster, firewall rules prevent anyone from connecting (for security.) To access the Spark or Hadoop job history server, you will first need to create an SSH

Dataproc Spark returns java.lang.UnsatisfiedLinkError: org.xerial.snappy.SnappyNative.uncompressedLength(Ljava/nio/ByteBuffer;II) when accessing Hive

阅读更多关于 Dataproc Spark returns java.lang.UnsatisfiedLinkError: org.xerial.snappy.SnappyNative.uncompressedLength(Ljava/nio/ByteBuffer;II) when accessing Hive

问题 I'm moving from Dataproc 1.2 to 1.3. When I created a new Spark cluster on Dataproc using image version 1.3. I got HiveMetaException: Metastore schema version is not compatible. Hive Version: 2.3.0, Database Schema Version: 2.1.0 because of database schema incompatibility. So I ssh-ed to Dataproc master instance and ran schematool -dbType mysql -upgradeSchemaFrom 2.1.0 everything worked as expected. I then recreated a new Spark cluster to make sure it doesn't throw this exception again.

Dataproc: Jupyter pyspark notebook unable to import graphframes package

阅读更多关于 Dataproc: Jupyter pyspark notebook unable to import graphframes package

问题 In Dataproc spark cluster, graphframe package is available in spark-shell but not in jupyter pyspark notebook. Pyspark kernel config: PACKAGES_ARG='--packages graphframes:graphframes:0.2.0-spark2.0-s_2.11' Following is the cmd to initialize cluster : gcloud dataproc clusters create my-dataproc-cluster --properties spark.jars.packages=com.databricks:graphframes:graphframes:0.2.0-spark2.0-s_2.11 --metadata "JUPYTER_PORT=8124,INIT_ACTIONS_REPO=https://github.com/{xyz}/dataproc-initialization

Reading a GCS file using standalone on premise spark java program

阅读更多关于 Reading a GCS file using standalone on premise spark java program

问题 I am trying to read a file stored on GCS bucket using on premise standalone spark job in java. I have configured SparkContext with all necessary spark configuration. I am getting following error: at com.vr.HadoopSample.main(HadoopSample.java:78) java.io.IOException: Error getting access token from metadata server at: http://metadata/computeMetadata/v1/instance/service-accounts/default/token at com.google.cloud.hadoop.util.CredentialFactory.getCredentialFromMetadataServiceAccount

Can't connect to Bigtable to scan HTable data due to hardcoded managed=true in hbase client jars

阅读更多关于 Can't connect to Bigtable to scan HTable data due to hardcoded managed=true in hbase client jars

问题 I'm working on a custom load function to load data from Bigtable using Pig on Dataproc. I compile my java code using the following list of jar files I grabbed from Dataproc. When I run the following Pig script, it fails when it tries to establish a connection with Bigtable. Error message is: Bigtable does not support managed connections. Questions: Is there a work around for this problem? Is this a known issue and is there a plan to fix or adjust? Is there a different way of implementing

GCP: You do not have sufficient permissions to SSH into this instance

阅读更多关于 GCP: You do not have sufficient permissions to SSH into this instance

问题 I have a (non-admin) account on one GCP project. When I start the Dataproc cluster, GCP spins up 3 VMs. When I try to access one of the VM via SSH (in browser) I get the following error: I tried to add recommended permissions, but I cannot add the iam.serviceAccounts.actAs permission. Any idea how to solve this? I read through the GCP documentation, but I just cannot find the solution for this. I have the following roles associated with my account: 回答1: In the end, we managed to solve it by