google-cloud-dataproc

StackOverflow-error when applying pyspark ALS's “recommendProductsForUsers” (although cluster of >300GB Ram available)

ぐ巨炮叔叔 提交于 2019-12-20 03:43:24
问题 Looking for expertise to guide me on issue below. Background: I'm trying to get going with a basic PySpark script inspired on this example As deploy infrastructure I use a Google Cloud Dataproc Cluster. Cornerstone in my code is the function "recommendProductsForUsers" documented here which gives me back the top X products for all users in the model Issue I incur The ALS.Train script runs smoothly and scales well on GCP (Easily >1mn customers). However, applying the predictions: i.e. using

Sqoop on Dataproc cannot export data to Avro format

吃可爱长大的小学妹 提交于 2019-12-20 02:37:26
问题 I want to use Sqoop to pull data from Postgres database, I use Google Dataproc to execute Sqoop. However, I get an error when I submit the Sqoop job. I use the following commands: Create a cluster with 1.3.24-deb9 image version gcloud dataproc clusters create <CLUSTER_NAME> \ --region=asia-southeast1 --zone=asia-southeast1-a \ --properties=hive:hive.metastore.warehouse.dir=gs://<BUCKET>/hive-warehouse \ --master-boot-disk-size=100 Submit a job gcloud dataproc jobs submit hadoop --cluster=

BigQuery connector for Spark on Dataproc - cannot authenticate using service account key file

走远了吗. 提交于 2019-12-19 11:17:52
问题 I have followed Use the BigQuery connector with Spark to successfully get data from a publicly available dataset. I now need to access a bigquery dataset that is owned by one of our clients and for which I have been given a service account key file (I know that the service account key file is valid because I can use it to connect using the Google BigQuery library for Python). I have followed what Igor Dvorzhak recommended here To use service account key file authorization you need to set

How do I install Python libraries automatically on Dataproc cluster startup?

风流意气都作罢 提交于 2019-12-19 07:04:13
问题 How can I automatically install Python libraries on my Dataproc cluster when the cluster starts? This would save me the trouble of manually logging into the master and/or worker nodes to manually install the libraries I need. It would be great to also know if this automated installation could install things only on the master and not the workers. 回答1: Initialization actions are the best way to do this. Initialization actions are shell scripts which are run when the cluster is created. This

Guava version while using spark-shell

戏子无情 提交于 2019-12-19 05:34:27
问题 I'm trying to use the spark-cassandra-connector via spark-shell on dataproc, however I am unable to connect to my cluster. It appears that there is a version mismatch since the classpath is including a much older guava version from somewhere else, even when I specify the proper version on startup. I suspect this is likely caused by all the Hadoop dependencies put into the classpath by default. Is there anyway to have spark-shell use only the proper version of guava, without getting rid of all

BigQuery Hadoop connector & Dataproc

帅比萌擦擦* 提交于 2019-12-13 18:45:41
问题 Is the BigQuery Hadoop connector automatically deployed with a Dataproc cluster? 回答1: Yes, the BigQuery Hadoop connector is automatically deployed with Dataproc clusters. The Dataproc version detail page lists which version of the Google Cloud Platform connectors, including the BigQuery connector, are included with each Dataproc release. 来源: https://stackoverflow.com/questions/33006121/bigquery-hadoop-connector-dataproc

Upload Pandas DataFrame to GCP Bucket for Dataproc [duplicate]

孤者浪人 提交于 2019-12-13 18:05:35
问题 This question already has answers here : Save pandas data frame as csv on to gcloud storage bucket (2 answers) Closed last year . I have been working on Spark Cluster using Data Proc google cloud services for Machine Learning Modelling. I have been successful to load the data from the Google Storage bucket. However, I am not sure how to write the panda's data frame and spark data frame to the cloud storage bucket as csv. When I use the below command it gives me an error df.to_csv("gs:/

Downloading files from Google Storage using Spark (Python) and Dataproc

丶灬走出姿态 提交于 2019-12-13 14:08:02
问题 I have an application that parallelizes the execution of Python objects that process data to be downloaded from Google Storage (my project bucket). The cluster is created using Google Dataproc. The problem is that the data is never downloaded! I wrote a test program to try and understand the problem. I wrote the following function to copy the files from the bucket and to see if creating files on workers does work: from subprocess import call from os.path import join def copyDataFromBucket

Google cloud dataproc failing to create new cluster with initialization scripts

ε祈祈猫儿з 提交于 2019-12-13 07:37:30
问题 I am using the below command to create data proc cluster: gcloud dataproc clusters create informetis-dev --initialization-actions “gs://dataproc-initialization-actions/jupyter/jupyter.sh,gs://dataproc-initialization-actions/cloud-sql-proxy/cloud-sql-proxy.sh,gs://dataproc-initialization-actions/hue/hue.sh,gs://dataproc-initialization-actions/ipython-notebook/ipython.sh,gs://dataproc-initialization-actions/tez/tez.sh,gs://dataproc-initialization-actions/oozie/oozie.sh,gs://dataproc

Spark on Google's Dataproc failed due to java.io.FileNotFoundException: /hadoop/yarn/nm-local-dir/usercache/root/appcache/

六月ゝ 毕业季﹏ 提交于 2019-12-13 03:14:52
问题 I've been using Spark/Hadoop on Dataproc for months both via Zeppelin and Dataproc console but just recently I got the following error. Caused by: java.io.FileNotFoundException: /hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1530998908050_0001/blockmgr-9d6a2308-0d52-40f5-8ef3-0abce2083a9c/21/temp_shuffle_3f65e1ca-ba48-4cb0-a2ae-7a81dcdcf466 (No such file or directory) at java.io.FileOutputStream.open0(Native Method) at java.io.FileOutputStream.open(FileOutputStream.java:270) at