google-cloud-dataproc

Cannot create a Dataproc cluster when setting the fs.defaultFS property?

二次信任 提交于 2019-12-25 04:22:28
问题 This was already the object of discussion in previous post, however, I'm not convinced with the answers as the Google docs specify that it is possible to create a cluster setting the fs.defaultFS property. Moreover, even if possible to set this property programmatically, sometimes, it's more convenient to set it from command line. So I wanted to know why the following option when passed to my cluster creation command does not work: --properties core:fs.defaultFS=gs://my-bucket ? Please note I

How can I use dataproc to pull data from bigquery that is not in the same project as my dataproc cluster?

て烟熏妆下的殇ゞ 提交于 2019-12-25 03:17:21
问题 I work for an organisation that needs to pull data from one of our client's bigquery datasets using Spark and given that both the client and ourselves use GCP it makes sense to use Dataproc to achieve this. I have read Use the BigQuery connector with Spark which looks very useful however it seems to make the assumption that the dataproc cluster, the bigquery dataset and the storage bucket for temporary BigQuery export are all in the same GCP project - that is not the case for me. I have a

Errors for block matrix multiplification in Spark

旧巷老猫 提交于 2019-12-24 20:30:29
问题 I have created a coordinate matrix cmat with 9 million rows and 85K columns. I would like to perform cmat.T * cmat operations. I first converted cmat to block matrix bmat: bmat = cmat.toBlockMatrix(1000, 1000) However, I got errors when performing multiply(): mtm = bmat.transpose.multiply(bmat) Traceback (most recent call last): File "", line 1, in AttributeError: 'function' object has no attribute 'multiply' The Spark version is 2.2.0, scale version is 2.11.8 on DataProc, Google cloud

Spark on Dataproc fails with java.io.FileNotFoundException:

守給你的承諾、 提交于 2019-12-24 19:35:48
问题 Spark job launched in Dataproc cluster fails with below exception. I have tried with various cluster configs but the result is same. I am getting this error in Dataproc image 1.2. Note: There are no preemptive workers also there is sufficient space in the disks. However I have noticed that there is no /hadoop/yarn/nm-local-dir/usercache/root folder at all in worker nodes. But I can see a folder named dr.who . java.io.IOException: Failed to create local dir in /hadoop/yarn/nm-local-dir

Tachyon on Dataproc Master Replication Error

試著忘記壹切 提交于 2019-12-24 17:17:03
问题 I have a simple example running on a Dataproc master node where Tachyon, Spark, and Hadoop are installed. I have a replication error writing to Tachyon from Spark. Is there any way to specify it needs no replication? 15/10/17 08:45:21 WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /tmp/tachyon/workers/1445071000001/3/8 could only be replicated to 0 nodes instead of minReplication (=1). There are 0 datanode(s)

Accessing google cloud storage using hadoop FileSystem api

左心房为你撑大大i 提交于 2019-12-24 13:59:00
问题 From my machine, I've configured the hadoop core-site.xml to recognize the gs:// scheme and added gcs-connector-1.2.8.jar as a Hadoop lib. I can run hadoop fs -ls gs://mybucket/ and get the expected results. However, if I try to do the analogue from java using: Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(conf); FileStatus[] status = fs.listStatus(new Path("gs://mybucket/")); I get the files under root in my local HDFS instead of in gs://mybucket/ , but with those

Keep running Dataproc Master node

不想你离开。 提交于 2019-12-24 10:18:31
问题 Is it possible to keep the master machine running in Dataproc? Every time that I run the job after a while (~1 hour), I see the master node is stopped. It is not a real issue since I would easily start it again but I would like to know if there is a way to keep it awake. A possible way that occurs to me is to do a schedule job in the master machine, but want to know if there is more official way to achieve this. 回答1: Dataproc does not stop any cluster nodes (including master) when they are

add file to spark driver classpath file on dataproc

江枫思渺然 提交于 2019-12-24 08:29:30
问题 I need to add a config file to driver spark classpath on google dataproc. I have try to use --files option of gcloud dataproc jobs submit spark but this not work. Is there a way to do it on google dataproc? 回答1: In Dataproc, anything listed as a --jar will be added to the classpath and anything listed as a --file will be made available in each spark executor's working directory. Even though the flag is --jars, it should be safe to put non-jar entries in this list if you require the file to be

Issue with partioning sql table data when reading from Spark

隐身守侯 提交于 2019-12-24 05:09:08
问题 I have written a Scala program for loading data from an MS SQL Server and writing it to BigQuery. I execute this in a Spark cluster (Google Dataproc). My issue is that even though I have a cluster with 64 cores, and I specify the executor parameters when running the job, and I partition the data I'm reading, Spark only reads data from a single executor. When I start the job I can see all the executors firing up and on the SQL Server I can see connections from all 4 workers, but within a

where are the individual dataproc spark logs?

时间秒杀一切 提交于 2019-12-24 05:04:59
问题 Where are the dataproc spark job logs located? I know there are logs from the driver under "Logging" section but what about the execution nodes? Also, where are the detailed steps that Spark is executing logged (I know I can see them in the Application Master)? I am attempting to debug a script that seems to hang and spark seems to freeze. 回答1: The task logs are stored on each worker node under /tmp . It is possible to collect them in one place via yarn log aggregation. Set these properties