Read fixed width file using schema from json file in pyspark

流过昼夜 提交于 2021-02-16 05:33:52

问题


I have fixed width file as below

00120181120xyz12341
00220180203abc56792
00320181203pqr25483 

And a corresponding JSON file that specifies the schema:

{"Column":"id","From":"1","To":"3"}
{"Column":"date","From":"4","To":"8"}
{"Column":"name","From":"12","To":"3"}
{"Column":"salary","From":"15","To":"5"}

I read the schema file into DataFrame using:

SchemaFile = spark.read\
    .format("json")\
    .option("header","true")\
    .json('C:\Temp\schemaFile\schema.json')

SchemaFile.show()
#+------+----+---+
#|Column|From| To|
#+------+----+---+
#|    id|   1|  3|
#|  date|   4|  8|
#|  name|  12|  3|
#|salary|  15|  5|
#+------+----+---+

Likewise, I am parsing the fixed width file into a pyspark DataFrame as below:

File = spark.read\
    .format("csv")\
    .option("header","false")\
    .load("C:\Temp\samplefile.txt")

File.show()
#+-------------------+
#|                _c0|
#+-------------------+
#|00120181120xyz12341|
#|00220180203abc56792|
#|00320181203pqr25483|
#+-------------------+

I can obviously hard code the values for the positions and lengths of each column to get the desired output:

from pyspark.sql.functions import substring
data = File.select(
    substring(File._c0,1,3).alias('id'),
    substring(File._c0,4,8).alias('date'),
    substring(File._c0,12,3).alias('name'),
    substring(File._c0,15,5).alias('salary')
)

data.show()
#+---+--------+----+------+
#| id|    date|name|salary|
#+---+--------+----+------+
#|001|20181120| xyz| 12341|
#|002|20180203| abc| 56792|
#|003|20181203| pqr| 25483|
#+---+--------+----+------+

But how can I use the SchemaFile DataFrame to specify the widths and column names for the lines so that the schema can be applied dynamically (without hard coding) at run time?


回答1:


The easiest thing to do here would be to collect the contents of SchemaFile and loop over its rows to extract the desired data.

First read the schema file as JSON into a DataFrame. Then call collect and map each row to a dictionary:

sfDict = map(lambda x: x.asDict(), SchemaFile.collect())
print(sfDict)
#[{'Column': u'id', 'From': u'1', 'To': u'3'},
# {'Column': u'date', 'From': u'4', 'To': u'8'},
# {'Column': u'name', 'From': u'12', 'To': u'3'},
# {'Column': u'salary', 'From': u'15', 'To': u'5'}]

Now you can loop over the rows in sfDict and use the values to substring your column:

from pyspark.sql.functions import substring
File.select(
    *[
        substring(
            str='_c0',
            pos=int(row['From']),
            len=int(row['To'])
        ).alias(row['Column']) 
        for row in sfDict
    ]
).show()
#+---+--------+----+------+
#| id|    date|name|salary|
#+---+--------+----+------+
#|001|20181120| xyz| 12341|
#|002|20180203| abc| 56792|
#|003|20181203| pqr| 25483|
#+---+--------+----+------+

Note that we have to cast To and From to integers since they are specified as strings in your json file.



来源:https://stackoverflow.com/questions/53817746/read-fixed-width-file-using-schema-from-json-file-in-pyspark

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!