google-hadoop

How to manage conflicting DataProc Guava, Protobuf, and GRPC dependencies

时间秒杀一切 提交于 2019-12-07 05:58:28
问题 I am working on a scala Spark job which needs to use java library (youtube/vitess) which is dependent upon newer versions of GRPC (1.01), Guava (19.0), and Protobuf (3.0.0) than currently provided on the DataProc 1.1 image. When running the project locally and building with maven, the correct versions of these dependencies are loaded an the job will run without issue. When submitting the job to DataProc, the DataProc version of these libraries are preferred and the job will reference class

SparkR collect method crashes with OutOfMemory on Java heap space

可紊 提交于 2019-12-05 21:07:02
With SparkR, I'm trying for a PoC to collect an RDD that I created from text files which contains around 4M lines. My Spark cluster is running in Google Cloud, is bdutil deployed and is composed with 1 master and 2 workers with 15gb of RAM and 4 cores each. My HDFS repository is based on Google Storage with gcs-connector 1.4.0. SparkR is intalled on each machine, and basic tests are working on small files. Here is the script I use : Sys.setenv("SPARK_MEM" = "1g") sc <- sparkR.init("spark://xxxx:7077", sparkEnvir=list(spark.executor.memory="1g")) lines <- textFile(sc, "gs://xxxx/dir/") test <-

Rate limit with Apache Spark GCS connector

删除回忆录丶 提交于 2019-12-05 16:16:08
I'm using Spark on a Google Compute Engine cluster with the Google Cloud Storage connector (instead of HDFS, as recommended ), and get a lot of "rate limit" errors, as follows: java.io.IOException: Error inserting: bucket: *****, object: ***** at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.wrapException(GoogleCloudStorageImpl.java:1600) at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl$3.run(GoogleCloudStorageImpl.java:475) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor

GoogleHadoopFileSystem cannot be cast to hadoop FileSystem?

痴心易碎 提交于 2019-12-04 12:06:10
The original question was trying to deploy spark 1.4 on Google Cloud . After downloaded and set SPARK_HADOOP2_TARBALL_URI='gs://my_bucket/my-images/spark-1.4.1-bin-hadoop2.6.tgz' deployment with bdutil was fine; however, when trying to call SqlContext.parquetFile("gs://my_bucket/some_data.parquet"), it runs into following exception: java.lang.ClassCastException: com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem cannot be cast to org.apache.hadoop.fs.FileSystem at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2595) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem

How to enable Snappy/Snappy Codec over hadoop cluster for Google Compute Engine

≡放荡痞女 提交于 2019-12-02 06:09:15
问题 I am trying to run Hadoop Job on Google Compute engine against our compressed data, which is sitting on Google Cloud Storage. While trying to read the data through SequenceFileInputFormat, I get the following exception: hadoop@hadoop-m:/home/salikeeno$ hadoop jar ${JAR} ${PROJECT} ${OUTPUT_TABLE} 14/08/21 19:56:00 INFO jaws.JawsApp: Using export bucket 'askbuckerthroughhadoop' as specified in 'mapred.bq.gcs.bucket' 14/08/21 19:56:00 INFO bigquery.BigQueryConfiguration: Using specified project

BigQuery connector for pyspark via Hadoop Input Format example

我是研究僧i 提交于 2019-11-29 08:25:50
问题 I have a large dataset stored into a BigQuery table and I would like to load it into a pypark RDD for ETL data processing. I realized that BigQuery supports the Hadoop Input / Output format https://cloud.google.com/hadoop/writing-with-bigquery-connector and pyspark should be able to use this interface in order to create an RDD by using the method "newAPIHadoopRDD". http://spark.apache.org/docs/latest/api/python/pyspark.html Unfortunately, the documentation on both ends seems scarce and goes

Migrating 50TB data from local Hadoop cluster to Google Cloud Storage

五迷三道 提交于 2019-11-26 23:08:21
I am trying to migrate existing data (JSON) in my Hadoop cluster to Google Cloud Storage. I have explored GSUtil and it seems that it is the recommended option to move big data sets to GCS. It seems that it can handle huge datasets. It seems though that GSUtil can only move data from Local machine to GCS or S3<->GCS, however cannot move data from local Hadoop cluster. What is a recommended way of moving data from local Hadoop cluster to GCS ? In case of GSUtil, can it directly move data from local Hadoop cluster(HDFS) to GCS or do first need to copy files on machine running GSUtil and then

Migrating 50TB data from local Hadoop cluster to Google Cloud Storage

孤者浪人 提交于 2019-11-26 08:31:19
问题 I am trying to migrate existing data (JSON) in my Hadoop cluster to Google Cloud Storage. I have explored GSUtil and it seems that it is the recommended option to move big data sets to GCS. It seems that it can handle huge datasets. It seems though that GSUtil can only move data from Local machine to GCS or S3<->GCS, however cannot move data from local Hadoop cluster. What is a recommended way of moving data from local Hadoop cluster to GCS ? In case of GSUtil, can it directly move data from