问题
Each row of my dataframe has a CSV content.
I am strugling to save each row in a different and specific table.
I believe I need to use a foreach or UDF in order to accomplish this, but this is simply not working.
All the content I managed to find was just like simple prints inside foreachs or codes using .collect() (which I really don't want to use).
I also found the repartition way, but that doesn't allow me to choose where each row will go.
rows = df.count()
df.repartition(rows).write.csv('save-dir')
Can you give me a simple and working example of it?
回答1:
Saving each row as a Table is a costly operation and not recommended. But what you are trying can be achieve like this -
df.write.format("delta").partitionBy("<primary-key-column>").save("/delta/save-dir")
Now each row will be saved as a .parquet
format and you can create External table from each partition. This will only work if you have unique value for every row i.e. a primary key.
回答2:
Well, at the end of all, as always it is something very simple, but I dind't see this anywere.
Basically when you perform a foreach and the dataframe you want to save is built inside the loop. The worker unlike the driver, won't automatically setup the "/dbfs/" path on the saving, so if you don't manually add the "/dbfs/", it will save the data locally in the worker.
That is why my loops weren't working.
回答3:
Did you tried .mode("append").repartionBy("ID")
, it will create a directory for each ID, then don't forget to put the mode
来源:https://stackoverflow.com/questions/56811304/how-to-write-writestream-each-row-of-a-dataframe-into-a-different-delta-table