问题
- I have a scenario where I am loading and processing 4TB of data, which is about 15000 .csv files in a folder.
- since I have limited resources, I am planning to process them in two batches and them union them.
I am trying to understand if I can load only 50% (or first n number of files in batch1 and the rest in batch 2) using
spark.read.csv.I can not use a regular expression as these files are generated from multiple sources and they are of uneven number(from some sources they are few and from other sources there are many ). If I consider processing files in uneven batches using wild cards or regex i may not get optimized performance.
Is there a way where i can tell the spark.read.csv reader to pick first n files and next I would just mention to load last n-1 files
I know this can be doneby writing another program. but I would not prefer as I have more than 20000 files and I dont want to iterate over them.
回答1:
It's easy if you use hadoop API to list files first and then create dataframes based on this list chunks. For example:
path = '/path/to/files/'
from py4j.java_gateway import java_import
fs = spark._jvm.org.apache.hadoop.fs.FileSystem.get(spark._jsc.hadoopConfiguration())
list_status = fs.listStatus(spark._jvm.org.apache.hadoop.fs.Path(path))
paths = [file.getPath().toString() for file in list_status]
df1 = spark.read.csv(paths[:7500])
df2 = spark.read.csv(paths[7500:])
来源:https://stackoverflow.com/questions/46533508/how-to-load-only-first-n-files-in-pyspark-spark-read-csv-from-a-single-directory