google-cloud-dataproc | 易学教程

Dataproc + BigQuery examples - any available?

阅读更多关于 Dataproc + BigQuery examples - any available?

According to the Dataproc docos , it has " native and automatic integrations with BigQuery ". I have a table in BigQuery. I want to read that table and perform some analysis on it using the Dataproc cluster that I've created (using a PySpark job). Then write the results of this analysis back to BigQuery. You may be asking "why not just do the analysis in BigQuery directly!?" - the reason is because we are creating complex statistical models, and SQL is too high level for developing them. We need something like Python or R, ergo Dataproc. Are they any Dataproc + BigQuery examples available? I

Which HBase connector for Spark 2.0 should I use?

阅读更多关于 Which HBase connector for Spark 2.0 should I use?

Our stack is composed of Google Data Proc (Spark 2.0) and Google BigTable (HBase 1.2.0) and I am looking for a connector working with these versions. The Spark 2.0 and the new DataSet API support is not clear to me for the connectors I have found: spark-hbase : https://github.com/apache/hbase/tree/master/hbase-spark spark-hbase-connector : https://github.com/nerdammer/spark-hbase-connector hortonworks-spark/shc : https://github.com/hortonworks-spark/shc The project is written in Scala 2.11 with SBT. Thanks for your help Update : SHC now seems to work with Spark 2 and the Table API. See https:/

use an external library in pyspark job in a Spark cluster from google-dataproc

阅读更多关于 use an external library in pyspark job in a Spark cluster from google-dataproc

I have a spark cluster I created via google dataproc. I want to be able to use the csv library from databricks (see https://github.com/databricks/spark-csv ). So I first tested it like this: I started a ssh session with the master node of my cluster, then I input: pyspark --packages com.databricks:spark-csv_2.11:1.2.0 Then it launched a pyspark shell in which I input: df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('gs:/xxxx/foo.csv') df.show() And it worked. My next step is to launch this job from my main machine using the command:

Save pandas data frame as csv on to gcloud storage bucket

阅读更多关于 Save pandas data frame as csv on to gcloud storage bucket

问题 from pyspark import SparkContext, SparkConf from pyspark.sql import SparkSession import gc import pandas as pd import datetime import numpy as np import sys APP_NAME = "DataFrameToCSV" spark = SparkSession\ .builder\ .appName(APP_NAME)\ .config("spark.sql.crossJoin.enabled","true")\ .getOrCreate() group_ids = [1,1,1,1,1,1,1,2,2,2,2,2,2,2] dates = ["2016-04-01","2016-04-01","2016-04-01","2016-04-20","2016-04-20","2016-04-28","2016-04-28","2016-04-05","2016-04-05","2016-04-05","2016-04-05",

How to get path to the uploaded file

阅读更多关于 How to get path to the uploaded file

I am running an spark cluster on google cloud and I upload a configuration file with each job. What is the path to a file that is uploaded with a submit command? In the example below how can I read the file Configuration.properties before the SparkContext has been initialized? I am using Scala. gcloud dataproc jobs submit spark --cluster my-cluster --class MyJob --files config/Configuration.properties --jars my.jar Local path to a file distributed using SparkFiles mechanism ( --files argument, SparkContext.addFile ) method can be obtained using SparkFiles.get : org.apache.spark.SparkFiles.get

Dataproc + BigQuery examples - any available?

阅读更多关于 Dataproc + BigQuery examples - any available?

问题 According to the Dataproc docos, it has " native and automatic integrations with BigQuery ". I have a table in BigQuery. I want to read that table and perform some analysis on it using the Dataproc cluster that I've created (using a PySpark job). Then write the results of this analysis back to BigQuery. You may be asking "why not just do the analysis in BigQuery directly!?" - the reason is because we are creating complex statistical models, and SQL is too high level for developing them. We

use an external library in pyspark job in a Spark cluster from google-dataproc

阅读更多关于 use an external library in pyspark job in a Spark cluster from google-dataproc

问题 I have a spark cluster I created via google dataproc. I want to be able to use the csv library from databricks (see https://github.com/databricks/spark-csv). So I first tested it like this: I started a ssh session with the master node of my cluster, then I input: pyspark --packages com.databricks:spark-csv_2.11:1.2.0 Then it launched a pyspark shell in which I input: df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('gs:/xxxx/foo.csv') df

spark.sql.crossJoin.enabled for Spark 2.x

阅读更多关于 spark.sql.crossJoin.enabled for Spark 2.x

I am using the 'preview' Google DataProc Image 1.1 with Spark 2.0.0. To complete one of my operations I have to complete a cartesian product. Since version 2.0.0 there has been a spark configuration parameter created (spark.sql.cross Join.enabled) that prohibits cartesian products and an Exception is thrown. How can I set spark.sql.crossJoin.enabled=true, preferably by using an initialization action? spark.sql.crossJoin.enabled=true For changing default values of configuration settings in Dataproc, you don't even need an init action, you can use the --properties flag when creating your cluster

How to get path to the uploaded file

阅读更多关于 How to get path to the uploaded file

问题 I am running an spark cluster on google cloud and I upload a configuration file with each job. What is the path to a file that is uploaded with a submit command? In the example below how can I read the file Configuration.properties before the SparkContext has been initialized? I am using Scala. gcloud dataproc jobs submit spark --cluster my-cluster --class MyJob --files config/Configuration.properties --jars my.jar 回答1: Local path to a file distributed using SparkFiles mechanism ( --files

spark.sql.crossJoin.enabled for Spark 2.x

阅读更多关于 spark.sql.crossJoin.enabled for Spark 2.x

问题 I am using the \'preview\' Google DataProc Image 1.1 with Spark 2.0.0. To complete one of my operations I have to complete a cartesian product. Since version 2.0.0 there has been a spark configuration parameter created (spark.sql.cross Join.enabled) that prohibits cartesian products and an Exception is thrown. How can I set spark.sql.crossJoin.enabled=true, preferably by using an initialization action? spark.sql.crossJoin.enabled=true 回答1: For changing default values of configuration settings