How to change hdfs block size in pyspark?

问题

I use pySpark to write parquet file. I would like to change the hdfs block size of that file. I set the block size like this and it doesn't work:

sc._jsc.hadoopConfiguration().set("dfs.block.size", "128m")

Does this have to be set before starting the pySpark job? If so, how to do it.

回答1:

Try setting it through sc._jsc.hadoopConfiguration() with SparkContext

from pyspark import SparkConf, SparkContext 
conf = (SparkConf().setMaster("yarn")) 
sc = SparkContext(conf = conf)
sc._jsc.hadoopConfiguration().set("dfs.block.size", "128m")
txt = sc.parallelize(("Hello", "world", "!"))
txt.saveAsTextFile("hdfs/output/path") #saving output with 128MB block size

in Scala:

sc.hadoopConfiguration.set("dfs.block.size", "128m")

回答2:

I had a similiar issue, but I figured out the issue. It needs a number not "128m". Therefore this should work (worked for me at least!):

block_size = str(1024 * 1024 * 128)
sc._jsc.hadoopConfiguration().set("dfs.block.size", block_size)

来源：https://stackoverflow.com/questions/40954825/how-to-change-hdfs-block-size-in-pyspark

标签

Hadoop

apache-spark

HDFS

pyspark

apache-spark-1.6

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!