emr

Nutch - Getting Error: JAVA_HOME is not set. when trying to crawl

依然范特西╮ 提交于 2020-01-06 20:35:34
问题 First and foremost I'm a Nutch/Hadoop newbie. I have installed Cassandra. I have installed Nutch on the Master node of my EMR cluster. When I attempt to execute a crawl using the following command: sudo bin/crawl crawl urls -dir crawl -depth 3 -topN 5 I get Error: JAVA_HOME is not set. If I run the command without 'sudo' I get: Injector: starting at 2014-07-16 02:12:24 Injector: crawlDb: urls/crawldb Injector: urlDir: crawl Injector: Converting injected urls to crawl db entries. Injector: org

Nutch - Getting Error: JAVA_HOME is not set. when trying to crawl

霸气de小男生 提交于 2020-01-06 20:33:57
问题 First and foremost I'm a Nutch/Hadoop newbie. I have installed Cassandra. I have installed Nutch on the Master node of my EMR cluster. When I attempt to execute a crawl using the following command: sudo bin/crawl crawl urls -dir crawl -depth 3 -topN 5 I get Error: JAVA_HOME is not set. If I run the command without 'sudo' I get: Injector: starting at 2014-07-16 02:12:24 Injector: crawlDb: urls/crawldb Injector: urlDir: crawl Injector: Converting injected urls to crawl db entries. Injector: org

AWS EMR how to use shell script as bootstrap action?

跟風遠走 提交于 2020-01-04 06:27:14
问题 I need to be able to use Java 8 in EMR I have found this post https://crazydoc1.wordpress.com/2015/08/23/java-8-on-amazon-emr-ami-4-0-0/ Which provides a bootstrap shell script https://gist.github.com/pstorch/c217d8324c4133a003c4 Which installs java 8. When looking at documentation on how to use bootstrap scripts its not apparent at all how to use a shell script with bootstrap actions since in documentation it asks for a Jar location (https://docs.aws.amazon.com/ElasticMapReduce/latest

External checkpoints to S3 on EMR

放肆的年华 提交于 2020-01-02 22:07:25
问题 I am trying to deploy a production cluster for my Flink program. I am using a standard hadoop-core EMR cluster with Flink 1.3.2 installed, using YARN to run it. I am trying to configure my RocksDB to write my checkpoints to an S3 bucket. I am trying to go through these docs: https://ci.apache.org/projects/flink/flink-docs-release-1.3/setup/aws.html#set-s3-filesystem. The problem seems to be getting the dependencies working correctly. I receive this error when trying run the program: java.lang

Amazon EMR and Hive: Getting a “java.io.IOException: Not a file” exception when loading subdirectories to an external table

半城伤御伤魂 提交于 2020-01-02 09:58:29
问题 I'm using Amazon EMR. I have some log data in s3, all in the same bucket, but under different subdirectories like: "s3://bucketname/2014/08/01/abc/file1.bz" "s3://bucketname/2014/08/01/abc/file2.bz" "s3://bucketname/2014/08/01/xyz/file1.bz" "s3://bucketname/2014/08/01/xyz/file3.bz" I'm using : Set hive.mapred.supports.subdirectories=true; Set mapred.input.dir.recursive=true; When trying to load all data from "s3://bucketname/2014/08/": CREATE EXTERNAL TABLE table1(id string, at string, custom

Amazon EMR and Hive: Getting a “java.io.IOException: Not a file” exception when loading subdirectories to an external table

自作多情 提交于 2020-01-02 09:58:10
问题 I'm using Amazon EMR. I have some log data in s3, all in the same bucket, but under different subdirectories like: "s3://bucketname/2014/08/01/abc/file1.bz" "s3://bucketname/2014/08/01/abc/file2.bz" "s3://bucketname/2014/08/01/xyz/file1.bz" "s3://bucketname/2014/08/01/xyz/file3.bz" I'm using : Set hive.mapred.supports.subdirectories=true; Set mapred.input.dir.recursive=true; When trying to load all data from "s3://bucketname/2014/08/": CREATE EXTERNAL TABLE table1(id string, at string, custom

SparkUI for pyspark - corresponding line of code for each stage?

懵懂的女人 提交于 2020-01-02 02:55:09
问题 I have some pyspark program running on AWS cluster. I am monitoring the job through Spark UI (see attached). However, I noticed that unlike the scala or Java spark program, which shows each Stage is corresponding to which line of code, I can't find which Stage is corresponding to which line of code in the pyspark code. Is there a way I can figure out which Stage is corresponding to which line of the pyspark code? Thanks! 来源: https://stackoverflow.com/questions/38315344/sparkui-for-pyspark

terminating a spark step in aws

拟墨画扇 提交于 2019-12-31 22:24:35
问题 I want to set up a series of spark steps on an EMR spark cluster, and terminate the current step if it's taking too long. However, when I ssh into the master node and run hadoop jobs -list, the master node seems to believe that there is no jobs running. I don't want to terminate the cluster, because doing so would force me to buy a whole new hour of whatever cluster I'm running. Can anyone please help me terminate a spark-step in EMR without terminating the entire cluster? 回答1: That's easy:

Exporting Hive Table to a S3 bucket

故事扮演 提交于 2019-12-30 00:52:08
问题 I've created a Hive Table through an Elastic MapReduce interactive session and populated it from a CSV file like this: CREATE TABLE csvimport(id BIGINT, time STRING, log STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'; LOAD DATA LOCAL INPATH '/home/hadoop/file.csv' OVERWRITE INTO TABLE csvimport; I now want to store the Hive table in a S3 bucket so the table is preserved once I terminate the MapReduce instance. Does anyone know how to do this? 回答1: Yes you have to export and import

Spark AWS emr checkpoint location

六月ゝ 毕业季﹏ 提交于 2019-12-25 09:10:45
问题 I'm running a spark job on EMR but need to create a checkpoint. I tried using s3 but got this error message 17/02/24 14:34:35 ERROR ApplicationMaster: User class threw exception: java.lang.IllegalArgumentException: Wrong FS: s3://spark- jobs/checkpoint/31d57e4f-dbd8-4a50-ba60-0ab1d5b7b14d/connected- components-e3210fd6/2, expected: hdfs://ip-172-18-13-18.ec2.internal:8020 java.lang.IllegalArgumentException: Wrong FS: s3://spark- jobs/checkpoint/31d57e4f-dbd8-4a50-ba60-0ab1d5b7b14d/connected-