ssh into glue dev-endpoint as hadoop user `File '/var/aws/emr/userData.json' cannot be read`

问题

Basically I am trying to solve this problem after setting up my PyCharm to the Glue ETL dev endpoint following this tutorial.

java.io.IOException: File '/var/aws/emr/userData.json' cannot be read

The above file is owned by hadoop.

[glue@ip-xx.xx.xx.xx ~]$ ls -la /var/aws/emr/
total 32
drwxr-xr-x 4 root   root    4096 Mar 24 19:35 .
drwxr-xr-x 3 root   root    4096 Feb 12  2019 ..
drwxr-xr-x 3 root   root    4096 Feb 12  2019 bigtop-deploy
drwxr-xr-x 3 root   root    4096 Mar 24 19:35 packages
-rw-r--r-- 1 root   root    1713 Feb 12  2019 repoPublicKey.txt
-r--r----- 1 hadoop hadoop 10221 Mar 24 19:34 userData.json

And I am not able to change its permission as suggested by Eric here. I ssh into my dev endpoint using my public key.

ssh -i ~/.ssh/<my_private_key> glue@ec2-xx.xx.xx.xx.eu-west-1.compute.amazonaws.com

and cannot change the user to hadoop sudo -su hadoop because it asks me for root password which I don't know [sudo] password for glue:. Neither I can ssh into the endpoint using hadoop user (instead of root(glue)), it says permission denied (publickey). My question is ... How on earth I would know the root user (glue) password of dev-endpoint ? I was never asked to setup any while creating the dev-endpoint. Or how can I ssh into dev-endpoint via Hadoop user ?

回答1:

So this wasn't the actual problem. Got review from AWS team and they said you'll get these rubbish warnings and errors while running spark scripts on EMR via PyCharm, but that shouldn't affect the actual task of your script. Turned out that the dataFrame that I was creating;

persons_DyF = glueContext.create_dynamic_frame.from_catalog(database="database", table_name="table")

was not showing me any schema when I did persons_DyF.printSchema(). Whereas I am pretty sure I defined that table schema. It just outputs root and persons_DyF.count() = 0. So I'd to use pySpark instead

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()
df = spark.read.table("ingestion.login_emr_testing")
print(df.printSchema())
df.select(df["feed"], df["timestamp_utc"], df['date'], df['hour']).show()

gave me following result;

.
.
allot of rubbish errors and warning including `java.io.IOException: File '/var/aws/emr/userData.json' cannot be read`
.
.
+------+--------------------+----------+----+
| feed |       timestamp_utc|      date|hour|
+------+--------------------+----------+----+
|TWEAKS|19-Mar-2020 18:59...|2020-03-19|  19|
|TWEAKS|19-Mar-2020 18:59...|2020-03-19|  19|
|TWEAKS|19-Mar-2020 19:00...|2020-03-19|  19|
|TWEAKS|19-Mar-2020 18:59...|2020-03-19|  19|
|TWEAKS|19-Mar-2020 19:00...|2020-03-19|  19|
|TWEAKS|19-Mar-2020 19:00...|2020-03-19|  19|
|TWEAKS|19-Mar-2020 19:00...|2020-03-19|  19|
|TWEAKS|19-Mar-2020 19:00...|2020-03-19|  19|
|TWEAKS|19-Mar-2020 19:00...|2020-03-19|  19|
|TWEAKS|19-Mar-2020 19:00...|2020-03-19|  19|
|TWEAKS|19-Mar-2020 19:00...|2020-03-19|  19|
|TWEAKS|19-Mar-2020 19:00...|2020-03-19|  19|
|TWEAKS|19-Mar-2020 19:00...|2020-03-19|  19|
+-----+--------------------+----------+----+

来源：https://stackoverflow.com/questions/60839451/ssh-into-glue-dev-endpoint-as-hadoop-user-file-var-aws-emr-userdata-json-can

标签

amazon-web-services

Hadoop

etl

aws-glue

endpoint