google-hadoop | 易学教程

How to manage conflicting DataProc Guava, Protobuf, and GRPC dependencies

阅读更多关于 How to manage conflicting DataProc Guava, Protobuf, and GRPC dependencies

问题 I am working on a scala Spark job which needs to use java library (youtube/vitess) which is dependent upon newer versions of GRPC (1.01), Guava (19.0), and Protobuf (3.0.0) than currently provided on the DataProc 1.1 image. When running the project locally and building with maven, the correct versions of these dependencies are loaded an the job will run without issue. When submitting the job to DataProc, the DataProc version of these libraries are preferred and the job will reference class

SparkR collect method crashes with OutOfMemory on Java heap space

阅读更多关于 SparkR collect method crashes with OutOfMemory on Java heap space

With SparkR, I'm trying for a PoC to collect an RDD that I created from text files which contains around 4M lines. My Spark cluster is running in Google Cloud, is bdutil deployed and is composed with 1 master and 2 workers with 15gb of RAM and 4 cores each. My HDFS repository is based on Google Storage with gcs-connector 1.4.0. SparkR is intalled on each machine, and basic tests are working on small files. Here is the script I use : Sys.setenv("SPARK_MEM" = "1g") sc <- sparkR.init("spark://xxxx:7077", sparkEnvir=list(spark.executor.memory="1g")) lines <- textFile(sc, "gs://xxxx/dir/") test <-

Rate limit with Apache Spark GCS connector

阅读更多关于 Rate limit with Apache Spark GCS connector

I'm using Spark on a Google Compute Engine cluster with the Google Cloud Storage connector (instead of HDFS, as recommended ), and get a lot of "rate limit" errors, as follows: java.io.IOException: Error inserting: bucket: *****, object: ***** at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.wrapException(GoogleCloudStorageImpl.java:1600) at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl$3.run(GoogleCloudStorageImpl.java:475) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor

GoogleHadoopFileSystem cannot be cast to hadoop FileSystem?

阅读更多关于 GoogleHadoopFileSystem cannot be cast to hadoop FileSystem?

The original question was trying to deploy spark 1.4 on Google Cloud . After downloaded and set SPARK_HADOOP2_TARBALL_URI='gs://my_bucket/my-images/spark-1.4.1-bin-hadoop2.6.tgz' deployment with bdutil was fine; however, when trying to call SqlContext.parquetFile("gs://my_bucket/some_data.parquet"), it runs into following exception: java.lang.ClassCastException: com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem cannot be cast to org.apache.hadoop.fs.FileSystem at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2595) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem

How to enable Snappy/Snappy Codec over hadoop cluster for Google Compute Engine

阅读更多关于 How to enable Snappy/Snappy Codec over hadoop cluster for Google Compute Engine

问题 I am trying to run Hadoop Job on Google Compute engine against our compressed data, which is sitting on Google Cloud Storage. While trying to read the data through SequenceFileInputFormat, I get the following exception: hadoop@hadoop-m:/home/salikeeno$ hadoop jar ${JAR} ${PROJECT} ${OUTPUT_TABLE} 14/08/21 19:56:00 INFO jaws.JawsApp: Using export bucket 'askbuckerthroughhadoop' as specified in 'mapred.bq.gcs.bucket' 14/08/21 19:56:00 INFO bigquery.BigQueryConfiguration: Using specified project

BigQuery connector for pyspark via Hadoop Input Format example

阅读更多关于 BigQuery connector for pyspark via Hadoop Input Format example

问题 I have a large dataset stored into a BigQuery table and I would like to load it into a pypark RDD for ETL data processing. I realized that BigQuery supports the Hadoop Input / Output format https://cloud.google.com/hadoop/writing-with-bigquery-connector and pyspark should be able to use this interface in order to create an RDD by using the method "newAPIHadoopRDD". http://spark.apache.org/docs/latest/api/python/pyspark.html Unfortunately, the documentation on both ends seems scarce and goes

Migrating 50TB data from local Hadoop cluster to Google Cloud Storage

阅读更多关于 Migrating 50TB data from local Hadoop cluster to Google Cloud Storage

I am trying to migrate existing data (JSON) in my Hadoop cluster to Google Cloud Storage. I have explored GSUtil and it seems that it is the recommended option to move big data sets to GCS. It seems that it can handle huge datasets. It seems though that GSUtil can only move data from Local machine to GCS or S3<->GCS, however cannot move data from local Hadoop cluster. What is a recommended way of moving data from local Hadoop cluster to GCS ? In case of GSUtil, can it directly move data from local Hadoop cluster(HDFS) to GCS or do first need to copy files on machine running GSUtil and then

Migrating 50TB data from local Hadoop cluster to Google Cloud Storage

阅读更多关于 Migrating 50TB data from local Hadoop cluster to Google Cloud Storage

问题 I am trying to migrate existing data (JSON) in my Hadoop cluster to Google Cloud Storage. I have explored GSUtil and it seems that it is the recommended option to move big data sets to GCS. It seems that it can handle huge datasets. It seems though that GSUtil can only move data from Local machine to GCS or S3<->GCS, however cannot move data from local Hadoop cluster. What is a recommended way of moving data from local Hadoop cluster to GCS ? In case of GSUtil, can it directly move data from