问题
this are my steps:
- Submit the spark app to a EMR cluster
- The driver starts and I can see the Spark-ui (no stages have been created yet)
- The driver reads an orc file with ~3000 parts from s3, make some transformations and save it back to s3
- The execution of the save should create some stages in the spark-ui but the stages take really long time to appear in the spark-ui
- The stages appear and start the execution
Why am I getting that huge delay in step 4? During this time the cluster is apparently waiting for something and the CPU usage is 0%
Thanks
回答1:
Despite its merits S3 is not a file system and it makes it a suboptimal choice for working with complex binary formats which are typically designed with actual file system in mind. In many cases secondary tasks (like reading metadata) are more expensive than the actual data fetching.
回答2:
It's probably the commit process between 3&4; the Hadoop MR and spark committers assume that rename is an O(1) atomic operation, and rely on it to do atomic commits of work. On S3, rename is O(data) and non-atomic when multiple files in a directory are involved. the 0-CPU load is the giveaway: the client is just awaiting a response from S3, which is doing the COPY internally at 6-10 MB/S
There's work underway in HADOOP-13345 to do a 0-rename commit in S3. For now, you can look for the famed-but-fails-in-interesting-ways Direct Committer from Databricks.
One more thing: make sure you are using "algorithm 2" for commiting, as algorithm 1 does a lot more renaming in the final job master commit. My full recommended setting for ORC/Parquet perf on Hadoop 2.7 is (along with use s3a: urls):
spark.sql.parquet.filterPushdown true
spark.sql.parquet.mergeSchema false
spark.hadoop.parquet.enable.summary-metadata false
spark.sql.orc.filterPushdown true
spark.sql.orc.splits.include.file.footer true
spark.sql.orc.cache.stripe.details.size 10000
spark.sql.hive.metastorePartitionPruning true
spark.speculation false
spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2
spark.hadoop.mapreduce.fileoutputcommitter.cleanup.skipped true
来源:https://stackoverflow.com/questions/41558052/huge-delays-translating-the-dag-to-tasks