Is there a way to load multiple text files into a single dataframe using Databricks?

问题

I am trying to test a few ideas to recursively loop through all files in a folder and sub-folders, and load everything into a single dataframe. I have 12 different kinds of files, and the differences are based on the file naming conventions. So, I have file names that start with 'ABC', file names that start with 'CN', file names that start with 'CZ', and so on. I tried the following 3 ideas.

import pyspark  
import os.path
from pyspark.sql import SQLContext
from pyspark.sql.functions import input_file_name

df = sqlContext.read.format("com.databricks.spark.text").option("header", "false").load("dbfs/mnt/rawdata/2019/06/28/Parent/ABC*.gz")
df.withColumn('input', input_file_name())
print(dfCW)

df = sc.textFile('/mnt/rawdata/2019/06/28/Parent/ABC*.gz')
print(df)

df = sc.sequenceFile('dbfs/mnt/rawdata/2019/06/28/Parent/ABC*.gz/').toDF()
df.withColumn('input', input_file_name())
print(dfCW)

This can be done with PySpark or PySpark SQL. I just need to get everything loaded, from a data lake, into a dataframe so I can push the dataframe into Azure SQL Server. I'm doing all coding in Azure Databricks. If this was regular Python, I could do it pretty easily. I just don't know PySpark well enough to get this working.

Just to illustrate the point, I have 3 zipped files that look like this (ABC0006.gz, ABC00015.gz, and ABC0022.gz):

ABC0006.gz
0x0000fa00|ABC|T3|1995
0x00102c55|ABC|K2|2017
0x00024600|ABC|V0|1993

ABC00015.gz
0x00102c54|ABC|G1|2016
0x00102cac|ABC|S4|2017
0x00038600|ABC|F6|2003

ABC0022.gz
0x00102c57|ABC|J0|2017
0x0000fa00|ABC|J6|1994
0x00102cec|ABC|V2|2017

I want to merge everything into one datdframe that looks like this (the .gz is the name of the file; each file has exactly the same headers):

0x0000fa00|ABC|T3|1995
0x00102c55|ABC|K2|2017
0x00024600|ABC|V0|1993
0x00102c54|ABC|G1|2016
0x00102cac|ABC|S4|2017
0x00038600|ABC|F6|2003
0x00102c57|ABC|J0|2017
0x0000fa00|ABC|J6|1994
0x00102cec|ABC|V2|2017

I've got 1000s of these files to get through. Fortunately, there are just 12 distinct types of files and thus 12 types of names...starting with 'ABC', 'CN', 'CZ', etc. Thanks for the look here.

Based on your comments, Abraham, it seems like my code should look like this, right...

file_list=[]
path = 'dbfs/rawdata/2019/06/28/Parent/'
files  = dbutils.fs.ls(path)
for file in files:
    if(file.name.startswith('ABC')):
       file_list.append(file.name)
df = spark.read.load(path=file_list)

Is this correct, or is this not correct? Please advise. I think we are close, but this still doesn't work for me, or I wouldn't be re-posting here. Thanks!!

回答1:

PySpark support loading a list of files using the load function. I believe this is what you are looking for

file_list=[]
path = 'dbfs/mnt/rawdata/2019/06/28/Parent/'
files  = dbutils.fs.ls(path)
for file in files:
    if(file.name.startswith('ABC')):
       file_list.append(file.name)
df = spark.read.load(path=file_list)

if the files are CSV and has header use the below command

df = spark.read.load(path=file_list,format="csv", sep=",", inferSchema="true", header="true")

for more example code refer https://spark.apache.org/docs/latest/sql-data-sources-load-save-functions.html

回答2:

I finally, finally, finally got this working.

val myDFCsv = spark.read.format("csv")
   .option("sep","|")
   .option("inferSchema","true")
   .option("header","false")
   .load("mnt/rawdata/2019/01/01/client/ABC*.gz")

myDFCsv.show()
myDFCsv.count()

Apparently all the zipped files and infer-schema tasks are handled automatically. Thus, the code is super, super lightweight, and VERY fast too.

来源：https://stackoverflow.com/questions/58205129/is-there-a-way-to-load-multiple-text-files-into-a-single-dataframe-using-databri

标签

dataframe

pyspark

apache-spark-sql

pyspark-sql

azure-databricks