We have two tables: OriginalDocument and ProcessedDocument. In the first one we put an original, not processed document. After it\'s validated and processed (converted to our XM
Wow, so much bad advice and design myths in a single question it's hard to know where to start.
Is this a VLDB? are you talking 100's of TB, 100's of GB, 1-10 GB?
Is this an untra-high performance DB? Do you need to squeeze out microseconds?
Most advice tends to lean toward those extremes where you might break a few basic rules for the sake of performance.
An earlier poster said,
"Whether the document is valid or invalid, it is still a document so it makes inital sense for them all to be in the same table."
He was on the right track there. And for that matter, whether it's processed or unprocessed it's also a document. I strongly question the first table split.
He then says,
"Having the two types of document together in the same table will do nothing but slow down your queries for no immediate benefit."
I have no idea what that advice is based upon. If your RDBMS supports indexes, more data will have a very marginal additional cost at certain sizes of your index because your b-tree gets one level deeper. If you take his statement at face value, you should limit your table to n rows each and keep making new ones because "more data in your table = slower queries." I have no idea why people persist in this notion. If you have queries that REQUIRE full table scans for one type or the other, let's talk partitioning, not a new table. It takes about an extra 10 milliseconds to find a row in a billion row table than it does in a million row table because an index will probably only be one blevel deeper between the two.
Another poster said,
"5-7 columns that do not apply to invalid documents NOT NULL so valid documents are required to have them. In my opinion, with that many columns empty for invalid documents, it justifies a different table."
I wish people would explain there reasons. HOW does it justify it? On what basis would you make that decision. Is 4 too many? Why not? But 5 is too many? Maybe he assumes you're using an ancient RDBMS with fixed field lengths. I can't tell. If you put the nullable columns at the end of the row, you'll pay no cost for them. In the middle, a few extra bytes. If that's a HUGE deal, if you're really scrambling to make this multi-TB table a wee smaller... we'll talk about vertical partitioning... not a whole new table. Since you'll be extending the length of n% of rows, you'll want to carefully choose your PCTFREE, or how ever your database does that. Other than that, there's little downside of the nullable columns.
So let's talk about all the downsides of three tables.
I'm going to assume your table looks like this;
A surrogate PK column with a unique index.
A candidate key column with a unique index.
a few foreign keys to 'lookup' tables.
Several data fields.
the 5-7 nullable columns that are filled if a document becomes invalid.
The first issue is that you'll have 3 PK's across all the tables to make sure that the key is unique... but there's no cross table object to guarantee uniqueness in all three combined. Unless you're painstaking in your approach to the code that moves data from one table to the next, you could have the same document twice or more. Once in each table. If you have a single table for Original, processed, and invalid, then there's no way you'll ever have that happen.
With three tables, all of your constraints are going to be validated over and over. When you do your insert into the Original table, the PK is validated, the AK is validate, the FKs are validated, the other columns are validate. Room is made in all of the indexes for these new enteries, and perhaps causing block splits. Now you process the file and delete the entry from the Original table, all of those indexes suffer deletes, leaving empty space behind. Your insert into the next table, suffers all of that cost of your first insert again. Your indexes are acted upon, maybe causing block splits, your PK, AK and FK's are all validated again. Lather rinse repeat for invalid table.
Now, what happens to your data model if you adopt this paradigm when you discover that the business needs a 4th state? You're going to add a 4th document table for those in the unsubmitted state, or sent state. After all, the new sent state has 5-7 columns unneeded by the other states.
And there are lots of queries which become hoorible to write and run with multiple tables, with a single table they are clear, concise and fast... size of a table will really only affect Full Table Scans, which we try to avoid for tables like these.
I've seen systems like these. One major operational query is, "Where is my document?"
You've got to search 3 tables to find its state. What most people do next is build a UNION ALL view of all three tables to facilitate the myriad of questions like that. IF the other poster thinks your queries slow down with other data in your table, see how they really slow down when you do a UNION ALL to accomplish the same thing. 1 index of blevel 3 as opposed to 3 indexes of blevel 2.
I work in a trading company. We execute trades with counterparties. For accounting and legal reasons our company is defined as several companies. Well call them Trading, Holding, JointVenture. Our counterparties we'll call. JonesCo, SmithBarely, GoldSax.
So if I consider that the internal companies have a unique set of columns and the counterparties have a unique set of columns. You'd say that proper normalization would force them into two tables. So let's do that.
INT_CO_T 1 Trading 2 Holding 3 JointVenture
CNTR_PTY_T 1 JonesCo 2 SmithBarely 3 GoldSax
Now I need a trade table where I map the transaction between our company(ies) and counterparties
TRADE_T (Int_co_T.ID, Ctr_pty_T.ID, other trade columns)
Great.
Whoops, Business says that the JointVenture will execute trades with Trading. BTW, This is a very common scenario, this happens all the time. Trading house would call these Book-to-Book trades.
Now I'm left with two choices. (Three really) but.
1 is that I could do something very silly and place JointVenture and Trading into the Counterparty table so that my mapping table will still work. This leads to nightmare queries which I'm sure those involved in this conversation will recognize. Or I can build a separate Mapping table.. and that too leads to some unions if I want to see all of the trades from a given company.
The third and better way is to build a single table for both counterparties and internal companies, called Trading_entities or something. Now I need one mapping table to show either internal or external trades. I can easily see net position and net exposure with one query, two tables. etc.
If you're really hung up on the nullable fields then vertically partition that table and use three tables. But the main table will have a list and most importantly a single key for either subtype of trading participant.