What algorithms are known to perform the task of updating a database by inserting, updating, and deleting rows in the presence of database constraints?
More specifically
No, I don't find it fascinating. I don't find the quadrature-of-the-circle-problem fascinating either, and on that topic too there do exist people who strongly, or even violently, disagree with me.
When you say that "it has practical applications", do you mean to say that "the solution to this problem has practical applications" ? I suggest that a solution that does not exist, by definition cannot have "practical applications". (And I do suggest that the solution you're seeking does not exist, just like the quadrature of the circle.)
You argued something about "when other apps hang cascading deletes ...". Your initial problem statement contained no mention whatsoever of "other apps".
The problem I find way more fascinating is "how to build a DBMS that is good enough so that programmers will no longer be facing these kinds of problem, and no longer be forced to ask these kinds of question". Such a DBMS supports Multiple Assignment.
I wish you the best of luck.
It sounds like you are looking for a Database Diff tool. Such a tool would look for differences between two tables (or two databases), and generate the necessary scripts to align them.
See the following post for more information:
https://stackoverflow.com/questions/104203/anyone-know-of-any-good-database-diff-tools
OK, I think that this is it, though the Unique Key thing is pretty hard to figure out. Note that any errors encountered in the SQL execution should result in complete rollback of the entire transaction.
UPDATE: The original order that I implemented was:
Each Table, BottumUp(All Deletes for table) Each Table, TopDown(All Updates, then All Inserts)
After a counter-example was posted, I believe that I know haw to correct for the restriced problem only (problem #1, without UCs): by changing the order to:
Each Table, TopDown(All Inserts) Each Table, TopDown(All Updates) Each Table, BottumUp(All Deletes)
This will definitely NOT work with Unique Constraints though, which as far as I can figure will need a row-content based dependency sort (as opposed to the static table FK dependency sort I am currently using). What makes this particularily difficult is that it may require getting info about record-content other than the changed ones (in particular checking for the existence of UC conflict-values and child-dependent records for intermediate steps).
Anyway, here's the current version:
Public Class TranformChangesToSQL
Class ColVal
Public name As String
Public value As String 'note: assuming string values'
End Class
Class Row
Public Columns As List(Of ColVal)
End Class
Class FKDef
'NOTE: all FK''s are assumed to be of the same type: records in the FK table'
' must have a record in the PK table matching on FK=PK columns.'
Public PKTableName As String
Public FKTableName As String
Public FK As String
End Class
Class TableInfo
Public Name As String
Public PK As String 'name of the PK column'
Public UniqueKeys As List(Of String) 'column name of each Unique key'
'This table''s Foreign Keys (FK):'
Public DependsOn As List(Of FKDef)
'Other tables FKs that point to this table'
Public DependedBy As List(Of FKDef)
Public Columns As List(Of String)
'note: all row collections are indexed by PK'
Public inserted As List(Of Row) 'inserted after-images'
Public deleted As List(Of Row) 'deleted before-images'
Public updBefore As List(Of row)
Public updAfter As List(Of row)
End Class
Sub MakeSQL(ByVal tables As List(Of TableInfo))
'Note table dependencies(FKs) must NOT form a cycle'
'Sort the tables by dependency so that'
' child tables (FKs) are always after their parents (PK tables)'
TopologicalSort(tables)
For Each tbl As TableInfo In tables
'Do INSERTs, they *must* be done first in parent-> child order, because:'
' they may have FKs dependent on parent inserts'
' and there may be Updates that will make child records dependent on them'
For Each r As Row In tbl.inserted
Dim InsSQL As String = "INSERT INTO " & tbl.Name & "("
Dim valstr As String = ") VALUES("
Dim comma As String = ""
For Each col As ColVal In r.Columns
InsSQL = InsSQL & comma & col.name
valstr = valstr & comma & "'" & col.value & "'"
comma = ", " 'needed for second and later columns'
Next
AddSQL(InsSQL & valstr & ");")
Next
Next
For Each tbl As TableInfo In tables
'Do UPDATEs'
For Each aft In tbl.updAfter
'get the matching before-update row'
Dim bef As Row = tbl.updBefore(aft.Columns(tbl.PK.ColName).value)
Dim UpdSql As String = "UPDATE " & tbl.Name & " SET "
Dim comma As String = ""
For Each col As ColVal In aft.Columns
If bef.Columns(col.name).value <> col.value Then
UpdSql = UpdSql & comma & col.name & " = '" & col.value & "'"
comma = ", " 'needed for second and later columns'
End If
Next
'only add it if any columns were different:'
If comma <> "" Then AddSQL(UpdSql & ";")
Next
Next
'Now reverse it so that INSERTs & UPDATEs are done in parent->child order'
tables.Reverse()
For Each tbl As TableInfo In tables.Reverse
'Do DELETEs, they *must* be done last, and in child->paernt order because:'
' Parents may have children that depend on them, so children must be deleted first,'
' and there may be children dependent until after Updates pointed them away'
For Each r As Row In tbl.deleted
AddSQL("DELETE From " & tbl.Name & " WHERE " & tbl.PK.ColName & " = '" & r.Columns(tbl.PK.ColName).value) & "';"
Next
Next
End Sub
End Class
Why are you even trying to do this? The correct way to do it is to get the database engine to defer the checking of the constraints until the transaction is committed.
The problem that you pose is intractable in the general case. If you consider just a transitive closure of the foreign keys in the rows you want to update in database then it is only possible to solve this where the graph describes a tree. If there is a cycle in the graph and you can break the cycle by replacing a foreign key value with a NULL then you can re-write one SQL and add another to later update the column. If you can't replace a key value with a NULL then it can't be solved.
As I say, the correct way to do this is to turn off the constraints until all of the SQL has been run and then turn them back on for the commit. The commit will fail if the constraints aren't met. Postgres (for example) has a feature which makes this very easy.
OpenDbDiff has source code available. You could look at that and figure out the algorithms.
http://opendbiff.codeplex.com/Release/ProjectReleases.aspx?ReleaseId=25206
I wrote one once, but it's someone else's IP, so I can't go into too much detail. However, I'm willing to tell you the process that taught me how to do this. This was a tool to make a shadow copy of a customer's "database" residing on salesforce.com, written in .NET 1.1.
I started out doing it the brute-force way (create DataSet and database from schema, turn off constraints in the DataSet, iterate through each table, loading rows, ignoring errors, for rows that are not yet in the table, repeat until more rows to add, or no more errors, or no change in the number of errors, then dump the DataSet to the DataBase, until no errors, etc.).
Brute force was the starting point because it wasn't certain that we could do this at all. The "schema" of salesforce.com wasn't a true relational schema. For instance, if I remember correctly, there were some columns which were foreign keys relating to one of several parent tables.
This took forever, even while debugging. I began to notice that most of the time was being spent on handling the constraint violations in the database. I began to notice the pattern of constraint violations, as each iteration converged, slowly, toward getting all the rows saved.
All the revelations I had were due to my boredom, watching the system sit at near 100% CPU for 15-20 minutes at a time, even with a small database. "Necessity is the mother of invention", and "the prospect of waiting another 20 minutes for the same rows, tends to focus the mind", and I figured out how to speed things up by a factor of over 100.