Overview
I\'m working on some Emergency Services reporting and mapping application for California (kind of weird, considering the fires there, right now...)
Taking the example:
Firstly go back to basic principles, all of the above are distinct geospatial entities, so your address is composed of a name, and one or many geospatial specifiers. This tells us that we really should be storing them in a single table. The key here is to think of the data more abstractly,
So your address table needs a 1-many relationship to another table, called address_entities which is as follows:
This means that you will obviously need a table to link the address to the address entity table above. Now, each geospatial entity is inherently hierarchical, and whilst it makes the SQL harder, and personally I try to avoid self referencing tables there are times when it is a good solution and this is one of them.
The benefits are huge, even though it makes the code harder, it is worth it in the long run.
Also, even when it isn't an immediate requirement, think globally, not all addresses in the world have a street, or state, for example,in france a valid address could be
- la Maison des Fou
- 24500 Eymet
So, bear that in mind when designing schemas.
Is this an OLTP system and reporting system or only a reporting system? If it's only a reporting system, you can denormalize the data in a data warehouse fashion (with snowflake dimensions or not for the hierachies of geographic jurisdictions) and you'll find the reporting to be easier.
I would start from the results and work back, because it sounds to me like you are getting fed the data and you are trying to bring it into a database to support the reporting and mapping. In this case, the database schema being a traditional normalized system is not important because redundancy in the data is not something that will cause maintenance problems for users, etc.
If this seems appropriate, you want to look into the Kimball books.
As @Oddthinking noted in a comment, your problems started at:
So I changed the table to be more normalised by making the Neighborhood, City and State fields a foreign key to their own new table (eg. lookups) .. and those two fields are now NULLABLE.
So .. that all works fine. except when I try and do some SQL statements on them. Because of the NULLABLE FK's, it's a nightmare to make all these outer join queries.
This reminds me of the "Doctor, doctor, it hurts when I hit myself like this" joke.
Why exactly did you make the foreign key fields nullable? They were mandatory before, so you should keep them as mandatory, precisely to avoid the nightmares of outer join queries.
Your explanation (question) is somewhat confusing in that you list three fields (Neighborhood, City and State) and then say "those two fields are now nullable". Which two are? And why? And what is in the lookup table? Or is there more than one lookup table? There might be an argument for some sort of NeighbourhoodID number which is a foreign key to a Neighbourhood table, which defines the City and State as well as Neighbourhood name. You might then decide that there is a closed list of cities and the cities have an ID number too, and that number determines the state too. You are probably as well off using a two-letter state code as creating a (probably 4-byte) state ID number. However, do not forget that a check criterion that ensures that the state code is one of the 50 or so valid state codes is harder to write than a foreign key that references a table of states. Since neither states nor cities changes very often, I'd probably use the table of states with a foreign key - but the key column would be the state code.
That means you might have a table of Neighbourhoods with columns NeighbourhoodID, Name, CityID; a table of Cities with columns CityID, Name, State; and a table of States with columns State and Name. You can add other columns as you see fit. And your primary table would contain a NeighbourhoodID column that is a foreign key to Neighbourhoods table.
This is a nice place to start. A whole #$(#$-load of database schemas to check out:
http://www.databaseanswers.org/data_models/
This a problem I've had to deal with and RDBMS systems aren't the best at storing hierarchical data. You might want to look at using an object database since these have to deal with nested objects and are optimized for the problem.
If you need to use an RDBMS, you may have to stick with a de-normalized schema though. Having separate tables to maintain your cities, streets etc may be handy for tracking changes though. If a city or street needs to be renamed, you can update the master record in the respective table and schedule a job to update a text copy of the string in your 'main' table. This will prevent you from having to run updates on 10's 100's of thousands of rows during prime time, but still lets you store the most up-to-data data in the db. Of course, this makes the data duplication situation even worse, but it's the price to pay for performance.