How to load only first n files in pyspark spark.read.csv from a single directory

可紊 提交于 2019-12-11 05:07:59

问题


  • I have a scenario where I am loading and processing 4TB of data, which is about 15000 .csv files in a folder.
  • since I have limited resources, I am planning to process them in two batches and them union them.
  • I am trying to understand if I can load only 50% (or first n number of files in batch1 and the rest in batch 2) using
    spark.read.csv.

  • I can not use a regular expression as these files are generated from multiple sources and they are of uneven number(from some sources they are few and from other sources there are many ). If I consider processing files in uneven batches using wild cards or regex i may not get optimized performance.

  • Is there a way where i can tell the spark.read.csv reader to pick first n files and next I would just mention to load last n-1 files

  • I know this can be doneby writing another program. but I would not prefer as I have more than 20000 files and I dont want to iterate over them.


回答1:


It's easy if you use hadoop API to list files first and then create dataframes based on this list chunks. For example:

path = '/path/to/files/'
from py4j.java_gateway import java_import

fs = spark._jvm.org.apache.hadoop.fs.FileSystem.get(spark._jsc.hadoopConfiguration())
list_status = fs.listStatus(spark._jvm.org.apache.hadoop.fs.Path(path))
paths = [file.getPath().toString() for file in list_status]

df1 = spark.read.csv(paths[:7500])
df2 = spark.read.csv(paths[7500:])


来源:https://stackoverflow.com/questions/46533508/how-to-load-only-first-n-files-in-pyspark-spark-read-csv-from-a-single-directory

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!