We have two tables: OriginalDocument and ProcessedDocument. In the first one we put an original, not processed document. After it\'s validated and processed (converted to our XM
Another thing you might want to take into consideration is the lifecycle and use cases of the rows. If the invalid documents are purged regularly, it might help to have them in separate tables. If the attributes of invalid documents stay limited, but valid documents are getting new columns, that would be a factor in favor of separate tables, too. As the entities are more and more different in behavior and usage, there are more indications that separate tables are merited.
Think of OriginalDocuments as of intermediate table. It can change as you input formats change. And it will contain fields which are not valid for imported ("processed") documents, like import date or import error description. And you can clean this table periodically.
In contrast to OriginalDocument, ProcessedDocument table will contain only documents and fields valid for your system, with all of the check constraints, indexes and associated business logic. It's structure will change as your system's internal logic changes.
Whether the document is valid or invalid, it is still a document so it makes inital sense for them all to be in the same table.
However, if an invalid document is treated differently by your application to the point where it is almost forgotten (not queried, updated etc.) then split the tables. Having the two types of document together in the same table will do nothing but slow down your queries for no immediate benefit.
I have a document table where valid and invalid documents are kept together but only because the app re-presents the bad document to the user and asks them to fix it.
What shape are your queries? Do you frequently wish to deal with a group (all?) documents, irrespective of whether they're valid? Or does every query only every concern valid (or invalid) documents.
Or do you wish to deal with groups (irrespective of validity), but wish to frequently perform additional work with valid documents. That may point to a base table and an additional table containing the valid document columns?
Do try to make distinction between logical and physical modelling.
Even if the difference between the two entities is only seven properties, they are logically a different thing in those seven items. At the same time they are a same thing in other properties.
The way to logically represent that is this have one-to-one-or-zero relationship between the two tables, and to have one table store all the common properties (superclass) and in the other (subclass) you would only store the ID from the superclass.
In terms of performance this is not so bad:
Depending on the processes you are modelling, the frequency of these queries and other things (such as security for both entities, ownership, difference in integrity rules) you might decide to store this information in one table in the database or in two (either can be much faster in border-line cases and two table solution can also be denormalized a bit; for example you could still store information in a main table about the type of the document to avoid the join if that kind of query is all you care).
Or maybe your implementation decisions might be driven by your choice of application framework and for that reason you might really prefer working with single table or the other way around (for example automatic creation of data entry forms in frameworks such as django-admin).
Whatever you do, realize the difference between the logical and physical design. In your logical design normalize everything - it will pay off. In physical implementation make different scenarios and - test, test, test with your own data. Never confuse the order of the two (logical-conceptual and physical-practical modelling).
To me it sounds that it would make sense to have a bit column, as all documents have actually been processed, it is just that some have been determined to be invalid. And depending on the number of columns if you only have 5 or so out of say 10-15 columns that don't apply, there is no need to manage two structures for the same data.
Now, another thing you could look at is do you need to regularly get information on both valid and invalid documents at the same time? if so, then you really do want it in one table.
If you don't ever need to query them together, or if a document is "invalid" you don't need it again except for history, then it could make sense to move it to its own table.