问题
Basically I am trying to solve this problem after setting up my PyCharm to the Glue ETL dev endpoint following this tutorial.
java.io.IOException: File '/var/aws/emr/userData.json' cannot be read
The above file is owned by hadoop.
[glue@ip-xx.xx.xx.xx ~]$ ls -la /var/aws/emr/
total 32
drwxr-xr-x 4 root root 4096 Mar 24 19:35 .
drwxr-xr-x 3 root root 4096 Feb 12 2019 ..
drwxr-xr-x 3 root root 4096 Feb 12 2019 bigtop-deploy
drwxr-xr-x 3 root root 4096 Mar 24 19:35 packages
-rw-r--r-- 1 root root 1713 Feb 12 2019 repoPublicKey.txt
-r--r----- 1 hadoop hadoop 10221 Mar 24 19:34 userData.json
And I am not able to change its permission as suggested by Eric here. I ssh into my dev endpoint using my public key.
ssh -i ~/.ssh/<my_private_key> glue@ec2-xx.xx.xx.xx.eu-west-1.compute.amazonaws.com
and cannot change the user to hadoop sudo -su hadoop
because it asks me for root
password which I don't know [sudo] password for glue:
. Neither I can ssh into the endpoint using hadoop user (instead of root(glue)), it says permission denied (publickey). My question is ... How on earth I would know the root user (glue) password of dev-endpoint ? I was never asked to setup any while creating the dev-endpoint. Or how can I ssh into dev-endpoint via Hadoop user ?
回答1:
So this wasn't the actual problem. Got review from AWS team and they said you'll get these rubbish warnings and errors while running spark scripts on EMR via PyCharm, but that shouldn't affect the actual task of your script. Turned out that the dataFrame that I was creating;
persons_DyF = glueContext.create_dynamic_frame.from_catalog(database="database", table_name="table")
was not showing me any schema when I did persons_DyF.printSchema()
. Whereas I am pretty sure I defined that table schema. It just outputs root
and persons_DyF.count() = 0
. So I'd to use pySpark instead
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.read.table("ingestion.login_emr_testing")
print(df.printSchema())
df.select(df["feed"], df["timestamp_utc"], df['date'], df['hour']).show()
gave me following result;
.
.
allot of rubbish errors and warning including `java.io.IOException: File '/var/aws/emr/userData.json' cannot be read`
.
.
+------+--------------------+----------+----+
| feed | timestamp_utc| date|hour|
+------+--------------------+----------+----+
|TWEAKS|19-Mar-2020 18:59...|2020-03-19| 19|
|TWEAKS|19-Mar-2020 18:59...|2020-03-19| 19|
|TWEAKS|19-Mar-2020 19:00...|2020-03-19| 19|
|TWEAKS|19-Mar-2020 18:59...|2020-03-19| 19|
|TWEAKS|19-Mar-2020 19:00...|2020-03-19| 19|
|TWEAKS|19-Mar-2020 19:00...|2020-03-19| 19|
|TWEAKS|19-Mar-2020 19:00...|2020-03-19| 19|
|TWEAKS|19-Mar-2020 19:00...|2020-03-19| 19|
|TWEAKS|19-Mar-2020 19:00...|2020-03-19| 19|
|TWEAKS|19-Mar-2020 19:00...|2020-03-19| 19|
|TWEAKS|19-Mar-2020 19:00...|2020-03-19| 19|
|TWEAKS|19-Mar-2020 19:00...|2020-03-19| 19|
|TWEAKS|19-Mar-2020 19:00...|2020-03-19| 19|
+-----+--------------------+----------+----+
来源:https://stackoverflow.com/questions/60839451/ssh-into-glue-dev-endpoint-as-hadoop-user-file-var-aws-emr-userdata-json-can