问题
I found similar question Hadoop HDFS is not distributing blocks of data evenly
but my ask is when replication factor = 1
I still want to understand why HDFS is not evenly distributing file blocks across the cluster nodes? This will result in data skew from start, when I load/run dataframe ops on such files. Am I missing something?
回答1:
Even if replication factor is one, files are still split and stored in multiples of the HDFS block size. Block placement is on best effort, AFAIK, not purely balanced; replication placement of 3 picks a random node, then another node on the same rack, then another node off rack at random
You'll need to clarify how large your files are and where you are looking to see if data is being split
Note: not all file formats are splittable
来源:https://stackoverflow.com/questions/59363801/hdfs-put-movefromlocal-not-distributing-data-across-data-nodes