Table1: Everything including the kitchen sink. Dates in the wrong format (year last so you cannot sort on that column), Numbers stored as VARCHAR, complete addresses in the \'st
you might be able to use maatkit's mk-table-sync tool to synchronise a staging database (your database is only very small, after all). This will "duplicate the mess"
You could then write something that, after the sync, does various queries to generate a set of more sane tables that you can then report off.
I imagine that this could be done on a daily basis without a performance problem.
Doing it all off a different server will avoid impacting the original database.
The only problem I can see is if some of the tables don't have primary keys.
I am not a MySQL person, so this is coming out of left field.
But I think the log files might be the answer.
Thankfully, you really only need to know 2 things from the log.
You need the record/rowid, and you need the operation.
In most DB's, and I assume MySQL, there's an implicit column on each row, like a rowid or recordid, or whatever. It's the internal row number used by the database. This is your "free" primary key.
Next, you need the operation. Notably whether it's an insert, update, or delete operation on the row.
You consolidate all of this information, in time order, and then run through it.
For each insert/update, you select the row from your original DB, and insert/update that row in your destination DB. If it's a delete, then you delete the row.
You don't care about field values, they're just not important. Do the whole row.
You hopefully shouldn't have to "parse" binary log files, MySQL already must have routines to do that, you just need to find and figure out how to use them (there may even be some handy "dump log" utility you could use).
This lets you keep the system pretty simple, and it should only depend on your actual activity during the day, rather than the total DB size. Finally, you could later optimize it by making it "smarter". For example, perhaps they insert a row, then update it, then delete it. You would know you can just ignore that row completely in your replay.
Obviously this takes a bit of arcane knowledge in order actually read the log files, but the rest should be straightforward. I would like to think that the log files are timestamped as well, so you can know to work on rows "from today", or whatever date range you want.
The Log Files (binary Logs) were my first thought too. If you knew how they did things you would shudder. For every row there are many many entries in the log as pieces are added and changed. Its just HUGE! For now I settled upon the Hash approach. With some clever file memory paging this is quite fast.
Can't you use the existing code which accesses this database and adapt it to your needs? Of course, the code must be horrible, but it might handle the database structure for you, no? You could hopefully concentrate on getting your work done instead of playing archaeologist then.