问题
I have fixed width file as below
00120181120xyz12341
00220180203abc56792
00320181203pqr25483
And a corresponding JSON
file that specifies the schema:
{"Column":"id","From":"1","To":"3"}
{"Column":"date","From":"4","To":"8"}
{"Column":"name","From":"12","To":"3"}
{"Column":"salary","From":"15","To":"5"}
I read the schema file into DataFrame using:
SchemaFile = spark.read\
.format("json")\
.option("header","true")\
.json('C:\Temp\schemaFile\schema.json')
SchemaFile.show()
#+------+----+---+
#|Column|From| To|
#+------+----+---+
#| id| 1| 3|
#| date| 4| 8|
#| name| 12| 3|
#|salary| 15| 5|
#+------+----+---+
Likewise, I am parsing the fixed width file into a pyspark DataFrame as below:
File = spark.read\
.format("csv")\
.option("header","false")\
.load("C:\Temp\samplefile.txt")
File.show()
#+-------------------+
#| _c0|
#+-------------------+
#|00120181120xyz12341|
#|00220180203abc56792|
#|00320181203pqr25483|
#+-------------------+
I can obviously hard code the values for the positions and lengths of each column to get the desired output:
from pyspark.sql.functions import substring
data = File.select(
substring(File._c0,1,3).alias('id'),
substring(File._c0,4,8).alias('date'),
substring(File._c0,12,3).alias('name'),
substring(File._c0,15,5).alias('salary')
)
data.show()
#+---+--------+----+------+
#| id| date|name|salary|
#+---+--------+----+------+
#|001|20181120| xyz| 12341|
#|002|20180203| abc| 56792|
#|003|20181203| pqr| 25483|
#+---+--------+----+------+
But how can I use the SchemaFile
DataFrame to specify the widths and column names for the lines so that the schema can be applied dynamically (without hard coding) at run time?
回答1:
The easiest thing to do here would be to collect
the contents of SchemaFile
and loop over its rows to extract the desired data.
First read the schema file as JSON into a DataFrame. Then call collect and map each row to a dictionary:
sfDict = map(lambda x: x.asDict(), SchemaFile.collect())
print(sfDict)
#[{'Column': u'id', 'From': u'1', 'To': u'3'},
# {'Column': u'date', 'From': u'4', 'To': u'8'},
# {'Column': u'name', 'From': u'12', 'To': u'3'},
# {'Column': u'salary', 'From': u'15', 'To': u'5'}]
Now you can loop over the rows in sfDict
and use the values to substring your column:
from pyspark.sql.functions import substring
File.select(
*[
substring(
str='_c0',
pos=int(row['From']),
len=int(row['To'])
).alias(row['Column'])
for row in sfDict
]
).show()
#+---+--------+----+------+
#| id| date|name|salary|
#+---+--------+----+------+
#|001|20181120| xyz| 12341|
#|002|20180203| abc| 56792|
#|003|20181203| pqr| 25483|
#+---+--------+----+------+
Note that we have to cast To
and From
to integers since they are specified as strings in your json
file.
来源:https://stackoverflow.com/questions/53817746/read-fixed-width-file-using-schema-from-json-file-in-pyspark