Spark: difference when read in .gz and .bz2

可紊 提交于 2019-12-09 00:39:25

问题


I normally read and write files in Spark using .gz, which the number of files should be the same as the number of RDD partitions. I.e. one giant .gz file will read in to a single partition. However, if I read in one single .bz2, would I still get one single giant partition? Or will Spark support automatic split one .bz2 to multiple partitions?

Also, how do I know how many partitions it would be while Hadoop read in it from one bz2 file. Thanks!


回答1:


    However, if I read in one single .bz2, would I still get one single giant partition?   
Or will Spark support automatic split one .bz2 to multiple partitions?

If you specify n partitions to read a bzip2 file, Spark will spawn n tasks to read the file in parallel. The default value of n is set to sc.defaultParallelism. The number of partitions is the second argument in the call to textFile (docs).


. one giant .gz file will read in to a single partition.

Please note that you can always do a

sc.textFile(myGiantGzipFile).repartition(desiredNumberOfPartitions)

to get the desired number of partitions after the file has been read.


Also, how do I know how many partitions it would be while Hadoop read in it from one bz2 file.

That would be yourRDD.partitions.size for the scala api or yourRDD.getNumPartitions() for the python api.




回答2:


I don't know why my test-program run on one executor, after some test I think I get it, like that:

by pySpark

// Load a DataFrame of users. Each line in the file is a JSON 

// document, representing one row.

val sqlContext = new org.apache.spark.sql.SQLContext(sc)

val user = sqlContext.read.json("users.json.bz2")


来源:https://stackoverflow.com/questions/37445054/spark-difference-when-read-in-gz-and-bz2

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!