google-cloud-dataproc

Google Dataproc timing out and killing excutors

一曲冷凌霜 提交于 2019-12-12 03:34:19
问题 I have a google dataproc spark cluster set up with one master node, and 16 worker nodes. The master has 2 cpus and 13g of memory and each worker has 2 cpus and 3.5g of memory. I am running a rather network-intensive job where I have an array of 16 objects and I partition this array into 16 partitions so each worker gets one object. The objects make about 2.5 million web requests and aggregates them to send back to the master. Each request is a Solr response and is less than 50k. One field (an

Insufficient number of DataNodes reporting when creating dataproc cluster

泄露秘密 提交于 2019-12-11 16:58:16
问题 I am getting "Insufficient number of DataNodes reporting" error when creating dataproc cluster with gs:// as default FS. Below is the command i am using dataproc cluster. gcloud dataproc clusters create cluster-538f --image-version 1.2 \ --bucket dataproc_bucket_test --subnet default --zone asia-south1-b \ --master-machine-type n1-standard-1 --master-boot-disk-size 500 \ --num-workers 2 --worker-machine-type n1-standard-1 --worker-boot-disk-size 500 \ --scopes 'https://www.googleapis.com/auth

Unable to create cluster on Dataproc after deleting default service account

坚强是说给别人听的谎言 提交于 2019-12-11 12:46:41
问题 I had mistakenly deleted a default "service account" for my project - {project_id}- compute@developer.gserviceaccount.com Now whenever I try to create a cluster on Data proc I get the following error: The resource '{project_id}-compute@developer.gserviceaccount.com' of type 'serviceAccount' was not found. Is there an easy way to resolve this issue - without losing any data for the project. 回答1: To clarify to anyone else who encounters this issue, this error is caused by actually deleting the

Passing typesafe config conf files to DataProcSparkOperator

℡╲_俬逩灬. 提交于 2019-12-11 12:42:37
问题 I am using Google dataproc to submit spark jobs and google cloud composer to schedule them. Unfortunately, I am facing difficulties. I am relying on .conf files (typesafe config files) to pass arguments to my spark jobs. I am using the following python code for the airflow dataproc: t3 = dataproc_operator.DataProcSparkOperator( task_id ='execute_spark_job_cluster_test', dataproc_spark_jars='gs://snapshots/jars/pubsub-assembly-0.1.14-SNAPSHOT.jar', cluster_name='cluster', main_class = 'com

In Dataproc how can I access the Spark and Hadoop job history?

僤鯓⒐⒋嵵緔 提交于 2019-12-11 11:56:48
问题 In Google Cloud Dataproc how can I access the Spark or Hadoop job history servers? I want to be able to look at my job history details when I run jobs. 回答1: To do this, you will need to create an SSH tunnel to the cluster and then use a SOCKS proxy with your browser. This is due to the fact that while the web interfaces are open on the cluster, firewall rules prevent anyone from connecting (for security.) To access the Spark or Hadoop job history server, you will first need to create an SSH

Dataproc Spark returns java.lang.UnsatisfiedLinkError: org.xerial.snappy.SnappyNative.uncompressedLength(Ljava/nio/ByteBuffer;II) when accessing Hive

这一生的挚爱 提交于 2019-12-11 09:12:12
问题 I'm moving from Dataproc 1.2 to 1.3. When I created a new Spark cluster on Dataproc using image version 1.3. I got HiveMetaException: Metastore schema version is not compatible. Hive Version: 2.3.0, Database Schema Version: 2.1.0 because of database schema incompatibility. So I ssh-ed to Dataproc master instance and ran schematool -dbType mysql -upgradeSchemaFrom 2.1.0 everything worked as expected. I then recreated a new Spark cluster to make sure it doesn't throw this exception again.

Dataproc: Jupyter pyspark notebook unable to import graphframes package

时光毁灭记忆、已成空白 提交于 2019-12-11 07:58:33
问题 In Dataproc spark cluster, graphframe package is available in spark-shell but not in jupyter pyspark notebook. Pyspark kernel config: PACKAGES_ARG='--packages graphframes:graphframes:0.2.0-spark2.0-s_2.11' Following is the cmd to initialize cluster : gcloud dataproc clusters create my-dataproc-cluster --properties spark.jars.packages=com.databricks:graphframes:graphframes:0.2.0-spark2.0-s_2.11 --metadata "JUPYTER_PORT=8124,INIT_ACTIONS_REPO=https://github.com/{xyz}/dataproc-initialization

Reading a GCS file using standalone on premise spark java program

▼魔方 西西 提交于 2019-12-11 07:24:34
问题 I am trying to read a file stored on GCS bucket using on premise standalone spark job in java. I have configured SparkContext with all necessary spark configuration. I am getting following error: at com.vr.HadoopSample.main(HadoopSample.java:78) java.io.IOException: Error getting access token from metadata server at: http://metadata/computeMetadata/v1/instance/service-accounts/default/token at com.google.cloud.hadoop.util.CredentialFactory.getCredentialFromMetadataServiceAccount

Can't connect to Bigtable to scan HTable data due to hardcoded managed=true in hbase client jars

时光毁灭记忆、已成空白 提交于 2019-12-11 06:37:43
问题 I'm working on a custom load function to load data from Bigtable using Pig on Dataproc. I compile my java code using the following list of jar files I grabbed from Dataproc. When I run the following Pig script, it fails when it tries to establish a connection with Bigtable. Error message is: Bigtable does not support managed connections. Questions: Is there a work around for this problem? Is this a known issue and is there a plan to fix or adjust? Is there a different way of implementing

GCP: You do not have sufficient permissions to SSH into this instance

心已入冬 提交于 2019-12-11 04:25:52
问题 I have a (non-admin) account on one GCP project. When I start the Dataproc cluster, GCP spins up 3 VMs. When I try to access one of the VM via SSH (in browser) I get the following error: I tried to add recommended permissions, but I cannot add the iam.serviceAccounts.actAs permission. Any idea how to solve this? I read through the GCP documentation, but I just cannot find the solution for this. I have the following roles associated with my account: 回答1: In the end, we managed to solve it by