问题
I use pySpark to write parquet file. I would like to change the hdfs block size of that file. I set the block size like this and it doesn't work:
sc._jsc.hadoopConfiguration().set("dfs.block.size", "128m")
Does this have to be set before starting the pySpark job? If so, how to do it.
回答1:
Try setting it through sc._jsc.hadoopConfiguration()
with SparkContext
from pyspark import SparkConf, SparkContext
conf = (SparkConf().setMaster("yarn"))
sc = SparkContext(conf = conf)
sc._jsc.hadoopConfiguration().set("dfs.block.size", "128m")
txt = sc.parallelize(("Hello", "world", "!"))
txt.saveAsTextFile("hdfs/output/path") #saving output with 128MB block size
in Scala:
sc.hadoopConfiguration.set("dfs.block.size", "128m")
回答2:
I had a similiar issue, but I figured out the issue. It needs a number not "128m". Therefore this should work (worked for me at least!):
block_size = str(1024 * 1024 * 128)
sc._jsc.hadoopConfiguration().set("dfs.block.size", block_size)
来源:https://stackoverflow.com/questions/40954825/how-to-change-hdfs-block-size-in-pyspark