google-cloud-dataproc | 易学教程

PySpark reduceByKey causes out of memory

阅读更多关于 PySpark reduceByKey causes out of memory

问题 I'm trying to run a job on Yarn mode that processes a large amount of data (2TB) read from google cloud storage. My pipeline works just fine with 10GB of data. The specs of my cluster and the beginning of my pipeline is detailed here : PySpark Yarn Application fails on groupBy Here is the rest of the pipeline : input.groupByKey()\ [...] processing on sorted groups for each key shard .mapPartitions(sendPartition)\ .map(mergeShardsbyKey) .reduceByKey(lambda list1, list2: list1 + list2).take(10)

Spark - Adding JDBC Driver JAR to Google Dataproc

阅读更多关于 Spark - Adding JDBC Driver JAR to Google Dataproc

问题 I am trying to write via JDBC: df.write.jdbc("jdbc:postgresql://123.123.123.123:5432/myDatabase", "myTable", props) The Spark docs explain that the configuration option spark.driver.extraClassPath cannot be used to add JDBC Driver JARs if running in client mode (which is the mode Dataproc runs in) since the JVM has already been started. I tried adding the JAR path in Dataproc's submit command: gcloud beta dataproc jobs submit spark ... --jars file:///home/bryan/org.postgresql.postgresql-9.4

“No Filesystem for Scheme: gs” when running spark job locally

阅读更多关于 “No Filesystem for Scheme: gs” when running spark job locally

问题 I am running a Spark job (version 1.2.0), and the input is a folder inside a Google Clous Storage bucket (i.e. gs://mybucket/folder) When running the job locally on my Mac machine, I am getting the following error: 5932 [main] ERROR com.doit.customer.dataconverter.Phase1 - Job for date: 2014_09_23 failed with error: No FileSystem for scheme: gs I know that 2 things need to be done in order for gs paths to be supported. One is install the GCS connector, and the other is have the following

Spark streaming on dataproc throws FileNotFoundException

阅读更多关于 Spark streaming on dataproc throws FileNotFoundException

问题 When I try to submit a spark streaming job to google dataproc cluster, I get this exception: 16/12/13 00:44:20 ERROR org.apache.spark.SparkContext: Error initializing SparkContext. java.io.FileNotFoundException: File file:/tmp/0afbad25-cb65-49f1-87b8-9cf6523512dd/skyfall-assembly-0.0.1.jar does not exist at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:611) at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java

Flink checkpoints to Google Cloud Storage

阅读更多关于 Flink checkpoints to Google Cloud Storage

问题 I am trying to configure checkpoints for flink jobs in GCS. Everything works fine if I run a test job locally (no docker and any cluster setup) but it fails with an error if I run it using docker-compose or cluster setup and deploy fat jar with jobs in flink dashboard. Any thoughts of it? Thanks! Caused by: org.apache.flink.core.fs.UnsupportedFileSystemSchemeException: Could not find a file system implementation for scheme 'gs'. The scheme is not directly supported by Flink and no Hadoop file

GCP Dataproc - configure YARN fair scheduler

阅读更多关于 GCP Dataproc - configure YARN fair scheduler

问题 I was trying to set up a dataproc cluster that would compute only one job (or specified max jobs) at a time and the rest would be in queue. I have found this solution, How to configure monopolistic FIFO application queue in YARN? , but as I'm always creating a new cluster, I needed to automatize this. I have added this to cluster creation: "softwareConfig": { "properties": { "yarn:yarn.resourcemanager.scheduler.class":"org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler

Which HBase connector for Spark 2.0 should I use?

阅读更多关于 Which HBase connector for Spark 2.0 should I use?

问题 Our stack is composed of Google Data Proc (Spark 2.0) and Google BigTable (HBase 1.2.0) and I am looking for a connector working with these versions. The Spark 2.0 and the new DataSet API support is not clear to me for the connectors I have found: spark-hbase : https://github.com/apache/hbase/tree/master/hbase-spark spark-hbase-connector : https://github.com/nerdammer/spark-hbase-connector hortonworks-spark/shc : https://github.com/hortonworks-spark/shc The project is written in Scala 2.11

Which HBase connector for Spark 2.0 should I use?

阅读更多关于 Which HBase connector for Spark 2.0 should I use?

Request had insufficient authentication scopes [403] when creating a cluster with Google Cloud Dataproc

阅读更多关于 Request had insufficient authentication scopes [403] when creating a cluster with Google Cloud Dataproc

问题 In Google Cloud Platform the DataProc API is enabled. I am using the same key I use to access GCS and Big query to create a new cluster per this example. I get a Request had insufficient authentication scopes error on the following line. Operation createOperation = service.Projects.Regions.Clusters.Create(newCluster, project, dataprocGlobalRegion).Execute(); My complete code: public static class DataProcClient { public static void Test() { string project = ConfigurationManager.AppSettings[

Cannot create a Dataproc cluster when setting the fs.defaultFS property?

阅读更多关于 Cannot create a Dataproc cluster when setting the fs.defaultFS property?

问题 This was already the object of discussion in previous post, however, I'm not convinced with the answers as the Google docs specify that it is possible to create a cluster setting the fs.defaultFS property. Moreover, even if possible to set this property programmatically, sometimes, it's more convenient to set it from command line. So I wanted to know why the following option when passed to my cluster creation command does not work: --properties core:fs.defaultFS=gs://my-bucket ? Please note I