Emrfs file sync with s3 not working

前端未结

关注

 3  1056

After running a spark job on an Amazon EMR cluster, I deleted the output files directly from s3 and tried to rerun the job again. I received the following error upon trying to w

相关标签:

3条回答

别那么骄傲

2021-02-20 09:07
It turned out that I needed to run
```
emrfs delete s3://bucket/folder
```
first before running sync. Running the above solved the issue.
0 讨论(0)
发布评论:

提交评论
- 加载中...
情书的邮戳

2021-02-20 09:18
I arrived at this page because I was getting the error "key is marked as directory in metadata but is file in s3" and was very puzzled. I think what happened is that I accidentally created both a file and directory by the same name. By deleting the file it solved my issue:
```
aws s3 rm s3://bucket/directory_name_without_trailing_slash
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
臣服心动

2021-02-20 09:24
Mostly the consistent problem comes due to retry logic in spark and hadoop systems. When a process of creating a file on s3 failed, but it already updated in the dynamodb. when the hadoop process restarts the process as the entry is already present in the dynamodb. It throws the consistent error.

If you want to delete the metadata of s3 which is stored in the dynamaoDB, whose objects are already removed. This are the steps, Delete all the metadata

Deletes all the objects in the path, emrfs delete uses the hash function to delete the records, so it may delete unwanted entries also, so we are doing the import and sync in the consequent steps
```
emrfs delete   s3://path
```
Retrieves the metadata for the objects that are physically present in s3 into dynamo db
```
emrfs import s3://path
```
Sync the data between s3 and the metadata.
```
emrfs sync s3://path      
```
After all the operations, to see whether that particular object is present in both s3 and metadata
```
emrfs diff s3://path 
```
http://docs.aws.amazon.com/emr/latest/ManagementGuide/emrfs-cli-reference.html
0 讨论(0)
发布评论:

提交评论
- 加载中...