How to write / writeStream each row of a dataframe into a different delta table

问题

Each row of my dataframe has a CSV content.

I am strugling to save each row in a different and specific table.

I believe I need to use a foreach or UDF in order to accomplish this, but this is simply not working.

All the content I managed to find was just like simple prints inside foreachs or codes using .collect() (which I really don't want to use).

I also found the repartition way, but that doesn't allow me to choose where each row will go.

rows = df.count()
df.repartition(rows).write.csv('save-dir')

Can you give me a simple and working example of it?

回答1:

Saving each row as a Table is a costly operation and not recommended. But what you are trying can be achieve like this -

df.write.format("delta").partitionBy("<primary-key-column>").save("/delta/save-dir")

Now each row will be saved as a .parquet format and you can create External table from each partition. This will only work if you have unique value for every row i.e. a primary key.

回答2:

Well, at the end of all, as always it is something very simple, but I dind't see this anywere.

Basically when you perform a foreach and the dataframe you want to save is built inside the loop. The worker unlike the driver, won't automatically setup the "/dbfs/" path on the saving, so if you don't manually add the "/dbfs/", it will save the data locally in the worker.

That is why my loops weren't working.

回答3:

Did you tried .mode("append").repartionBy("ID"), it will create a directory for each ID, then don't forget to put the mode

来源：https://stackoverflow.com/questions/56811304/how-to-write-writestream-each-row-of-a-dataframe-into-a-different-delta-table

标签

pyspark

azure-databricks

delta-lake