google-cloud-dataproc | 易学教程

Cannot create a Dataproc cluster when setting the fs.defaultFS property?

阅读更多关于 Cannot create a Dataproc cluster when setting the fs.defaultFS property?

问题 This was already the object of discussion in previous post, however, I'm not convinced with the answers as the Google docs specify that it is possible to create a cluster setting the fs.defaultFS property. Moreover, even if possible to set this property programmatically, sometimes, it's more convenient to set it from command line. So I wanted to know why the following option when passed to my cluster creation command does not work: --properties core:fs.defaultFS=gs://my-bucket ? Please note I

How can I use dataproc to pull data from bigquery that is not in the same project as my dataproc cluster?

阅读更多关于 How can I use dataproc to pull data from bigquery that is not in the same project as my dataproc cluster?

问题 I work for an organisation that needs to pull data from one of our client's bigquery datasets using Spark and given that both the client and ourselves use GCP it makes sense to use Dataproc to achieve this. I have read Use the BigQuery connector with Spark which looks very useful however it seems to make the assumption that the dataproc cluster, the bigquery dataset and the storage bucket for temporary BigQuery export are all in the same GCP project - that is not the case for me. I have a

Errors for block matrix multiplification in Spark

阅读更多关于 Errors for block matrix multiplification in Spark

问题 I have created a coordinate matrix cmat with 9 million rows and 85K columns. I would like to perform cmat.T * cmat operations. I first converted cmat to block matrix bmat: bmat = cmat.toBlockMatrix(1000, 1000) However, I got errors when performing multiply(): mtm = bmat.transpose.multiply(bmat) Traceback (most recent call last): File "", line 1, in AttributeError: 'function' object has no attribute 'multiply' The Spark version is 2.2.0, scale version is 2.11.8 on DataProc, Google cloud

Spark on Dataproc fails with java.io.FileNotFoundException:

阅读更多关于 Spark on Dataproc fails with java.io.FileNotFoundException:

问题 Spark job launched in Dataproc cluster fails with below exception. I have tried with various cluster configs but the result is same. I am getting this error in Dataproc image 1.2. Note: There are no preemptive workers also there is sufficient space in the disks. However I have noticed that there is no /hadoop/yarn/nm-local-dir/usercache/root folder at all in worker nodes. But I can see a folder named dr.who . java.io.IOException: Failed to create local dir in /hadoop/yarn/nm-local-dir

Tachyon on Dataproc Master Replication Error

阅读更多关于 Tachyon on Dataproc Master Replication Error

问题 I have a simple example running on a Dataproc master node where Tachyon, Spark, and Hadoop are installed. I have a replication error writing to Tachyon from Spark. Is there any way to specify it needs no replication? 15/10/17 08:45:21 WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /tmp/tachyon/workers/1445071000001/3/8 could only be replicated to 0 nodes instead of minReplication (=1). There are 0 datanode(s)

Accessing google cloud storage using hadoop FileSystem api

阅读更多关于 Accessing google cloud storage using hadoop FileSystem api

问题 From my machine, I've configured the hadoop core-site.xml to recognize the gs:// scheme and added gcs-connector-1.2.8.jar as a Hadoop lib. I can run hadoop fs -ls gs://mybucket/ and get the expected results. However, if I try to do the analogue from java using: Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(conf); FileStatus[] status = fs.listStatus(new Path("gs://mybucket/")); I get the files under root in my local HDFS instead of in gs://mybucket/ , but with those

Keep running Dataproc Master node

阅读更多关于 Keep running Dataproc Master node

问题 Is it possible to keep the master machine running in Dataproc? Every time that I run the job after a while (~1 hour), I see the master node is stopped. It is not a real issue since I would easily start it again but I would like to know if there is a way to keep it awake. A possible way that occurs to me is to do a schedule job in the master machine, but want to know if there is more official way to achieve this. 回答1: Dataproc does not stop any cluster nodes (including master) when they are

add file to spark driver classpath file on dataproc

阅读更多关于 add file to spark driver classpath file on dataproc

问题 I need to add a config file to driver spark classpath on google dataproc. I have try to use --files option of gcloud dataproc jobs submit spark but this not work. Is there a way to do it on google dataproc? 回答1: In Dataproc, anything listed as a --jar will be added to the classpath and anything listed as a --file will be made available in each spark executor's working directory. Even though the flag is --jars, it should be safe to put non-jar entries in this list if you require the file to be

Issue with partioning sql table data when reading from Spark

阅读更多关于 Issue with partioning sql table data when reading from Spark

问题 I have written a Scala program for loading data from an MS SQL Server and writing it to BigQuery. I execute this in a Spark cluster (Google Dataproc). My issue is that even though I have a cluster with 64 cores, and I specify the executor parameters when running the job, and I partition the data I'm reading, Spark only reads data from a single executor. When I start the job I can see all the executors firing up and on the SQL Server I can see connections from all 4 workers, but within a

where are the individual dataproc spark logs?

阅读更多关于 where are the individual dataproc spark logs?

问题 Where are the dataproc spark job logs located? I know there are logs from the driver under "Logging" section but what about the execution nodes? Also, where are the detailed steps that Spark is executing logged (I know I can see them in the Application Master)? I am attempting to debug a script that seems to hang and spark seems to freeze. 回答1: The task logs are stored on each worker node under /tmp . It is possible to collect them in one place via yarn log aggregation. Set these properties