google-cloud-dataproc

PySpark reduceByKey causes out of memory

馋奶兔 提交于 2020-01-01 19:38:08
问题 I'm trying to run a job on Yarn mode that processes a large amount of data (2TB) read from google cloud storage. My pipeline works just fine with 10GB of data. The specs of my cluster and the beginning of my pipeline is detailed here : PySpark Yarn Application fails on groupBy Here is the rest of the pipeline : input.groupByKey()\ [...] processing on sorted groups for each key shard .mapPartitions(sendPartition)\ .map(mergeShardsbyKey) .reduceByKey(lambda list1, list2: list1 + list2).take(10)

Spark - Adding JDBC Driver JAR to Google Dataproc

最后都变了- 提交于 2019-12-30 19:57:09
问题 I am trying to write via JDBC: df.write.jdbc("jdbc:postgresql://123.123.123.123:5432/myDatabase", "myTable", props) The Spark docs explain that the configuration option spark.driver.extraClassPath cannot be used to add JDBC Driver JARs if running in client mode (which is the mode Dataproc runs in) since the JVM has already been started. I tried adding the JAR path in Dataproc's submit command: gcloud beta dataproc jobs submit spark ... --jars file:///home/bryan/org.postgresql.postgresql-9.4

“No Filesystem for Scheme: gs” when running spark job locally

那年仲夏 提交于 2019-12-30 18:08:10
问题 I am running a Spark job (version 1.2.0), and the input is a folder inside a Google Clous Storage bucket (i.e. gs://mybucket/folder) When running the job locally on my Mac machine, I am getting the following error: 5932 [main] ERROR com.doit.customer.dataconverter.Phase1 - Job for date: 2014_09_23 failed with error: No FileSystem for scheme: gs I know that 2 things need to be done in order for gs paths to be supported. One is install the GCS connector, and the other is have the following

Spark streaming on dataproc throws FileNotFoundException

偶尔善良 提交于 2019-12-30 14:48:13
问题 When I try to submit a spark streaming job to google dataproc cluster, I get this exception: 16/12/13 00:44:20 ERROR org.apache.spark.SparkContext: Error initializing SparkContext. java.io.FileNotFoundException: File file:/tmp/0afbad25-cb65-49f1-87b8-9cf6523512dd/skyfall-assembly-0.0.1.jar does not exist at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:611) at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java

Flink checkpoints to Google Cloud Storage

无人久伴 提交于 2019-12-30 11:05:21
问题 I am trying to configure checkpoints for flink jobs in GCS. Everything works fine if I run a test job locally (no docker and any cluster setup) but it fails with an error if I run it using docker-compose or cluster setup and deploy fat jar with jobs in flink dashboard. Any thoughts of it? Thanks! Caused by: org.apache.flink.core.fs.UnsupportedFileSystemSchemeException: Could not find a file system implementation for scheme 'gs'. The scheme is not directly supported by Flink and no Hadoop file

GCP Dataproc - configure YARN fair scheduler

北城以北 提交于 2019-12-29 09:07:58
问题 I was trying to set up a dataproc cluster that would compute only one job (or specified max jobs) at a time and the rest would be in queue. I have found this solution, How to configure monopolistic FIFO application queue in YARN? , but as I'm always creating a new cluster, I needed to automatize this. I have added this to cluster creation: "softwareConfig": { "properties": { "yarn:yarn.resourcemanager.scheduler.class":"org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler

Which HBase connector for Spark 2.0 should I use?

ぐ巨炮叔叔 提交于 2019-12-28 13:51:51
问题 Our stack is composed of Google Data Proc (Spark 2.0) and Google BigTable (HBase 1.2.0) and I am looking for a connector working with these versions. The Spark 2.0 and the new DataSet API support is not clear to me for the connectors I have found: spark-hbase : https://github.com/apache/hbase/tree/master/hbase-spark spark-hbase-connector : https://github.com/nerdammer/spark-hbase-connector hortonworks-spark/shc : https://github.com/hortonworks-spark/shc The project is written in Scala 2.11

Which HBase connector for Spark 2.0 should I use?

陌路散爱 提交于 2019-12-28 13:50:24
问题 Our stack is composed of Google Data Proc (Spark 2.0) and Google BigTable (HBase 1.2.0) and I am looking for a connector working with these versions. The Spark 2.0 and the new DataSet API support is not clear to me for the connectors I have found: spark-hbase : https://github.com/apache/hbase/tree/master/hbase-spark spark-hbase-connector : https://github.com/nerdammer/spark-hbase-connector hortonworks-spark/shc : https://github.com/hortonworks-spark/shc The project is written in Scala 2.11

Request had insufficient authentication scopes [403] when creating a cluster with Google Cloud Dataproc

亡梦爱人 提交于 2019-12-25 04:59:17
问题 In Google Cloud Platform the DataProc API is enabled. I am using the same key I use to access GCS and Big query to create a new cluster per this example. I get a Request had insufficient authentication scopes error on the following line. Operation createOperation = service.Projects.Regions.Clusters.Create(newCluster, project, dataprocGlobalRegion).Execute(); My complete code: public static class DataProcClient { public static void Test() { string project = ConfigurationManager.AppSettings[

Cannot create a Dataproc cluster when setting the fs.defaultFS property?

落爺英雄遲暮 提交于 2019-12-25 04:22:33
问题 This was already the object of discussion in previous post, however, I'm not convinced with the answers as the Google docs specify that it is possible to create a cluster setting the fs.defaultFS property. Moreover, even if possible to set this property programmatically, sometimes, it's more convenient to set it from command line. So I wanted to know why the following option when passed to my cluster creation command does not work: --properties core:fs.defaultFS=gs://my-bucket ? Please note I