问题
I have a requirement to do a load/delete specific records from postgres db for my spark application. For loading , I am using spark dataframe in the below format
sqlContext.read.format("jdbc").options(Map("url" -> "postgres url",
"user" -> "user" ,
"password" -> "xxxxxx" ,
"table" -> "(select * from employee where emp_id > 1000) as filtered_emp")).load()
To delete the data, I am writing direct sql instead of using dataframes
delete from employee where emp_id > 1000
The question is , is there a spark way of deleting records in database something similar to below? Or the only way is to use direct sql?
sqlContext.read.format("jdbc").options(Map("url" -> "postgres url",
"user" -> "user" ,
"password" -> "xxxxxx" ,
"table" -> "(delete from employee where emp_id > 1000) as filtered_emp")).load()
回答1:
If you want to modify(delete records) the actual source of data i.e. tables in postgres then Spark wouldn't be a great way. You can use jdbc client directly for achieving the same.
If you want to do this anyway (in distrubuted manner based on some clues that you are computing as part of dataframes); you can have the same jdbc client code written in correspondence with dataframe that have logic/trigger info for deleting records and that can we executed on multiple workers parallely.
来源:https://stackoverflow.com/questions/39576874/delete-functionality-with-spark-sql-dataframe