google-cloud-dataproc

PySpark reduceByKey causes out of memory

馋奶兔 提交于 2019-12-04 19:34:54
I'm trying to run a job on Yarn mode that processes a large amount of data (2TB) read from google cloud storage. My pipeline works just fine with 10GB of data. The specs of my cluster and the beginning of my pipeline is detailed here : PySpark Yarn Application fails on groupBy Here is the rest of the pipeline : input.groupByKey()\ [...] processing on sorted groups for each key shard .mapPartitions(sendPartition)\ .map(mergeShardsbyKey) .reduceByKey(lambda list1, list2: list1 + list2).take(10) [...] output the map function that is applied over partitions is the following : def sendPartition

PySpark + Google Cloud Storage (wholeTextFiles)

假装没事ソ 提交于 2019-12-04 12:35:49
I am trying to parse about 1 million HTML files using PySpark (Google Dataproc) and write the relevant fields out to a condensed file. Each HTML file is about 200KB. Hence, all the data is about 200GB. The code below works fine if I use a subset of the data, but runs for hours and then crashes when running on the whole dataset. Furthermore, the worker nodes are not utilized (<5% CPU) so I know there is some issue. I believe the system is choking on ingesting the data from GCS. Is there a better way to do this? Also, when I use wholeTextFiles in this fashion, does the master attempt to download

Where is the Spark UI on Google Dataproc?

六月ゝ 毕业季﹏ 提交于 2019-12-04 10:49:53
问题 What port should I use to access the Spark UI on Google Dataproc? I tried port 4040 and 7077 as well as a bunch of other ports I found using netstat -pln Firewall is properly configured. 回答1: Dataproc runs Spark on top of YARN, so you won't find the typical "Spark standalone" ports; instead, when running a Spark job, you can visit port 8088 which will show you the YARN ResourceManager's main page. Any running Spark jobs will be accessible through the Application Master link on that page. The

How to resolve Guava dependency issue while submitting Uber Jar to Google Dataproc

。_饼干妹妹 提交于 2019-12-04 05:25:45
问题 I am using maven shade plugin to build Uber jar for submitting it as a job to google dataproc cluster. Google have installed Apache Spark 2.0.2 Apache Hadoop 2.7.3 on their cluster. Apache spark 2.0.2 uses 14.0.1 of com.google.guava and apache hadoop 2.7.3 uses 11.0.2, these both should be in the classpath already. <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-shade-plugin</artifactId> <version>3.0.0</version> <executions> <execution> <phase>package</phase> <goals>

How to restart Spark Streaming job from checkpoint on Dataproc?

生来就可爱ヽ(ⅴ<●) 提交于 2019-12-04 04:40:52
问题 This is a follow up to Spark streaming on dataproc throws FileNotFoundException Over the past few weeks (not sure since exactly when), restart of a spark streaming job, even with the "kill dataproc.agent" trick is throwing this exception: 17/05/16 17:39:02 INFO org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at stream-event-processor-m/10.138.0.3:8032 17/05/16 17:39:03 INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl: Submitted application application

How do you use the Google DataProc Java Client to submit spark jobs using jar files and classes in associated GS bucket?

馋奶兔 提交于 2019-12-03 16:02:55
I need to trigger Spark Jobs to aggregate data from a JSON file using an API call. I use spring-boot to create the resources. Thus, the steps for the solution is the following: User makes an POST request with a json file as the input The JSON file is stored in google bucket associated with dataproc cluster. A aggregating spark job is triggered from within the REST method with the specified jars, classes and the argument is the json file link. I want the job to be triggered using Dataproc's Java Client instead of console or command line. How do you do it? We're hoping to have a more thorough

What is the difference between Google Cloud Dataflow and Google Cloud Dataproc?

我与影子孤独终老i 提交于 2019-12-03 08:14:44
问题 I am using Google Data Flow to implement an ETL data ware house solution. Looking into google cloud offering, it seems DataProc can also do the same thing. It also seems DataProc is little bit cheaper than DataFlow. Does anybody know the pros / cons of DataFlow over DataProc Why does google offer both? 回答1: Yes, Cloud Dataflow and Cloud Dataproc can both be used to implement ETL data warehousing solutions. An overview of why each of these products exist can be found in the Google Cloud

Where is the Spark UI on Google Dataproc?

≡放荡痞女 提交于 2019-12-03 06:58:09
What port should I use to access the Spark UI on Google Dataproc? I tried port 4040 and 7077 as well as a bunch of other ports I found using netstat -pln Firewall is properly configured. Dataproc runs Spark on top of YARN, so you won't find the typical "Spark standalone" ports; instead, when running a Spark job, you can visit port 8088 which will show you the YARN ResourceManager's main page. Any running Spark jobs will be accessible through the Application Master link on that page. The Spark Application Master's page looks the same as the familiar Spark-standalone landing page that you would

What is the difference between Google Cloud Dataflow and Google Cloud Dataproc?

跟風遠走 提交于 2019-12-02 23:36:00
I am using Google Data Flow to implement an ETL data ware house solution. Looking into google cloud offering, it seems DataProc can also do the same thing. It also seems DataProc is little bit cheaper than DataFlow. Does anybody know the pros / cons of DataFlow over DataProc Why does google offer both? Yes, Cloud Dataflow and Cloud Dataproc can both be used to implement ETL data warehousing solutions. An overview of why each of these products exist can be found in the Google Cloud Platform Big Data Solutions Articles Quick takeaways: Cloud Dataproc provides you with a Hadoop cluster, on GCP,

pyspark rdd isCheckPointed() is false

天大地大妈咪最大 提交于 2019-12-02 09:53:10
问题 I was encountering stackoverflowerrors when I was iteratively adding over 500 columns to my pyspark dataframe. So, I included checkpoints. The checkpoints did not help. So, I created the following toy application to test if my checkpoints were working correctly. All I do in this example is iteratively create columns by copying the original column over and over again. I persist, checkpoint and count every 10 iterations. I notice that my dataframe.rdd.isCheckpointed() always returns False. I