How to write / writeStream each row of a dataframe into a different delta table

南笙酒味 提交于 2019-12-13 18:38:14

问题


Each row of my dataframe has a CSV content.

I am strugling to save each row in a different and specific table.

I believe I need to use a foreach or UDF in order to accomplish this, but this is simply not working.

All the content I managed to find was just like simple prints inside foreachs or codes using .collect() (which I really don't want to use).

I also found the repartition way, but that doesn't allow me to choose where each row will go.

rows = df.count()
df.repartition(rows).write.csv('save-dir')

Can you give me a simple and working example of it?


回答1:


Saving each row as a Table is a costly operation and not recommended. But what you are trying can be achieve like this -

df.write.format("delta").partitionBy("<primary-key-column>").save("/delta/save-dir")

Now each row will be saved as a .parquet format and you can create External table from each partition. This will only work if you have unique value for every row i.e. a primary key.




回答2:


Well, at the end of all, as always it is something very simple, but I dind't see this anywere.

Basically when you perform a foreach and the dataframe you want to save is built inside the loop. The worker unlike the driver, won't automatically setup the "/dbfs/" path on the saving, so if you don't manually add the "/dbfs/", it will save the data locally in the worker.

That is why my loops weren't working.




回答3:


Did you tried .mode("append").repartionBy("ID"), it will create a directory for each ID, then don't forget to put the mode



来源:https://stackoverflow.com/questions/56811304/how-to-write-writestream-each-row-of-a-dataframe-into-a-different-delta-table

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!