google-hadoop

Hadoop 2.4.1 and Google Cloud Storage connector for Hadoop

陌路散爱 提交于 2020-01-15 06:51:07
问题 I am trying to run Oryx on top of Hadoop using Google's Cloud Storage Connector for Hadoop: https://cloud.google.com/hadoop/google-cloud-storage-connector I prefer to use Hadoop 2.4.1 with Oryx, so I use the hadoop2_env.sh set-up for the hadoop cluster I create on google compute engine, e.g.: .bdutil -b <BUCKET_NAME> -n 2 --env_var_files hadoop2_env.sh \ --default_fs gs --prefix <PREFIX_NAME> deploy I face two main problems when I try to run oryx using hadoop. 1) Despite confirming that my

Hadoop cannot connect to Google Cloud Storage

断了今生、忘了曾经 提交于 2020-01-01 11:45:29
问题 I'm trying to connect Hadoop running on Google Cloud VM to Google Cloud Storage. I have: Modified the core-site.xml to include properties of fs.gs.impl and fs.AbstractFileSystem.gs.impl Downloaded and referenced the gcs-connector-latest-hadoop2.jar in a generated hadoop-env.sh authenticated via gcloud auth login using my personal account (instead of a service account). I'm able to run gsutil -ls gs://mybucket/ without any issues but when I execute hadoop fs -ls gs://mybucket/ I get the output

“No Filesystem for Scheme: gs” when running spark job locally

那年仲夏 提交于 2019-12-30 18:08:10
问题 I am running a Spark job (version 1.2.0), and the input is a folder inside a Google Clous Storage bucket (i.e. gs://mybucket/folder) When running the job locally on my Mac machine, I am getting the following error: 5932 [main] ERROR com.doit.customer.dataconverter.Phase1 - Job for date: 2014_09_23 failed with error: No FileSystem for scheme: gs I know that 2 things need to be done in order for gs paths to be supported. One is install the GCS connector, and the other is have the following

Accessing google cloud storage using hadoop FileSystem api

左心房为你撑大大i 提交于 2019-12-24 13:59:00
问题 From my machine, I've configured the hadoop core-site.xml to recognize the gs:// scheme and added gcs-connector-1.2.8.jar as a Hadoop lib. I can run hadoop fs -ls gs://mybucket/ and get the expected results. However, if I try to do the analogue from java using: Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(conf); FileStatus[] status = fs.listStatus(new Path("gs://mybucket/")); I get the files under root in my local HDFS instead of in gs://mybucket/ , but with those

SparkR collect method crashes with OutOfMemory on Java heap space

好久不见. 提交于 2019-12-22 10:53:00
问题 With SparkR, I'm trying for a PoC to collect an RDD that I created from text files which contains around 4M lines. My Spark cluster is running in Google Cloud, is bdutil deployed and is composed with 1 master and 2 workers with 15gb of RAM and 4 cores each. My HDFS repository is based on Google Storage with gcs-connector 1.4.0. SparkR is intalled on each machine, and basic tests are working on small files. Here is the script I use : Sys.setenv("SPARK_MEM" = "1g") sc <- sparkR.init("spark:/

Rate limit with Apache Spark GCS connector

拈花ヽ惹草 提交于 2019-12-22 08:52:32
问题 I'm using Spark on a Google Compute Engine cluster with the Google Cloud Storage connector (instead of HDFS, as recommended), and get a lot of "rate limit" errors, as follows: java.io.IOException: Error inserting: bucket: *****, object: ***** at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.wrapException(GoogleCloudStorageImpl.java:1600) at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl$3.run(GoogleCloudStorageImpl.java:475) at java.util.concurrent.ThreadPoolExecutor.runWorker

GoogleHadoopFileSystem cannot be cast to hadoop FileSystem?

北战南征 提交于 2019-12-21 20:27:35
问题 The original question was trying to deploy spark 1.4 on Google Cloud. After downloaded and set SPARK_HADOOP2_TARBALL_URI='gs://my_bucket/my-images/spark-1.4.1-bin-hadoop2.6.tgz' deployment with bdutil was fine; however, when trying to call SqlContext.parquetFile("gs://my_bucket/some_data.parquet"), it runs into following exception: java.lang.ClassCastException: com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem cannot be cast to org.apache.hadoop.fs.FileSystem at org.apache.hadoop.fs

Spark 1.4 image for Google Cloud?

只愿长相守 提交于 2019-12-12 01:40:45
问题 With bdutil, the latest version of tarball I can find is on spark 1.3.1: gs://spark-dist/spark-1.3.1-bin-hadoop2.6.tgz There are a few new DataFrame features in Spark 1.4 that I want to use. Any chance the Spark 1.4 image be available for bdutil, or any workaround? UPDATE: Following the suggestion from Angus Davis, I downloaded and pointed to spark-1.4.1-bin-hadoop2.6.tgz, the deployment went well; however, run into error when calling SqlContext.parquetFile(). I cannot explain why this

What is the minimal setup needed to write to HDFS/GS on Google Cloud Storage with flume?

怎甘沉沦 提交于 2019-12-11 22:35:12
问题 I would like to write data from flume-ng to Google Cloud Storage. It is a little bit complicated, because I observed a very strange behavior. Let me explain: Introduction I've launched a hadoop cluster on google cloud (one click) set up to use a bucket. When I ssh on the master and add a file with hdfs command, I can see it immediately in my bucket $ hadoop fs -ls / 14/11/27 15:01:41 INFO gcs.GoogleHadoopFileSystemBase: GHFS version: 1.2.9-hadoop2 Found 1 items -rwx------ 3 hadoop hadoop 40

Read from BigQuery into Spark in efficient way?

好久不见. 提交于 2019-12-10 13:38:37
问题 When using BigQuery Connector to read data from BigQuery I found that it copies all data first to Google Cloud Storage. Then reads this data in parallel into Spark, but when reading big table it takes very long time in copying data stage. So is there more efficient way to read data from BigQuery into Spark? Another Question: reading from BigQuery composed of 2 stages (copying to GCS, reading in parallel from GCS). does copying stage affected by Spark cluster size or it take fixed time? 回答1: