问题
A good way to quickly survey the information in a database is to apply a tool that automatically creates a database diagram of all tables and all relationships between them.
In my experience, such tools use foreign keys as the relationships, which most of the databases I try them do not contain. Sure, they satisfy constraints corresponding to foreign keys, but do not enforce them. And I'll end up with a 'diagram' consisting of a bunch of unrelated tables.
So what I'm looking for is software that can compute "undeclared foreign keys" and either
- uses them as table relations in a database diagram, or
- generates SQL code for corresponding foreign key declarations
Do you know any tools, free if possible, that can already do this?
回答1:
Interesting question. You're looking to parse a database schema and data to determine which tables are relevant or should be related to each other, without any strict definition of the relationship. In effect, you're trying to infer a relationship.
I see two ways that you can infer such a relationship. First let me say that your approach might vary depending on the databases you're working with. A number of questions spring to mind (I don't want answers, but they are worth reflecting on)
- are these in-house enterprise systems that follow some consistent naming convention or pattern?
- or are they 'in-the-wild' databases that you come across anywhere, at any time?
- what sort of assumptions are you prepared to make?
- would you prefer to get more false positives or false negatives in your result?
Note that this type of inference will almost certainly give false results, and is built on a lot of assumptions.
So I offer two approachs that I'd use in concert.
Inferring a relationship through structure / naming (symbolic analysis)
Common database design is to name a PK column after the table name (e.g. CustomerId
on table Customer
), or alternatively name the PK column simply Id
.
A table with a FK relationship to another often names its related column the same as the related table. In the Order
table I'd expect a CustomerId
column which refers to the CustomerId
/ Id
column in the Customer
table.
This type of analysis would include
- inspecting columns across tables for similar phrases / words
- looking for columns names that are similar to the names of other tables
- checking for column names that contain the name of other column (e.g.
FirstCustomerId
&SecondCustomerId
both refer to theCustomerId
column in theCustomer
table)
Inferring a relationship through data (statistical analysis)
Looking at data, as you suggest you have done in your comments, will allow you to determine 'possible' references. If the CustomerId
column in the Order
table contains values which don't exist in the Id
column of the Customer
table then it's reasonable to question that this is a valid relationship (although you never know!)
A simple form of data analysis is using dates and times. Rows that were created with close proximity to one another are more likely to be related to one another. If, for every Order
row that was created, there also exist between 2 and 5 Item
rows created within a few seconds, then a relationship between the two is likely.
A more detailed analysis might look at the range and distribution of used values.
For example, if your Order
table has a St_Id
column - you might infer using symbolic analysis that the column is likely to relate to either a State
table or a Status
table. The St_Id
column has 6 discrete values, and 90% of the records are covered by 2 values. The State
table has 200 rows, and the Status
table has 9 rows. You could quite reasonably infer that the St_Id
column relates to the Status
table - it gives a more greater coverage of the rows of the table (2/3 of the rows are 'used', whereas only 3% of the rows in the State
table would be used).
If you perform data analysis on existing databases to gather 'real life data', I'd expect some patterns that could be used as guides to structure inference. When a table with a large number of records has a column with a small number of values repeated many times (not necessarily in order), it's more likely to this column relates to a table with a correspondingly small number of rows.
In summary
Best of luck. It's an interested problem, I've just thrown some ideas out there but this is very much a trial & error, data gathering and performance tuning situation.
回答2:
This is a non-trivial exercise in most cases. If you are lucky enough to be analysing a schema for a modern framework, such as Ruby on Rails, or CakePHP or similar, and the developers have been stringent about following column conventions, then you have a reasonable chance of finding many, but not all, of the implied relationships.
I.e. if your tables use columns like user_id
to refer to entries in the users
tables.
Be aware: some entity names may pluralise irregularly (entity
being a good example: entities
, not entitys
) and these are harder to catch (but still possible). However, keys such as admin_id
which the developers join with the users table on user.id
can't be inferred. You would need to handle those cases manually.
You didn't specify an RDBMS, but I used MySQL a lot, and I'm currently working on this problem for myself.
The following MySQL script will infer most relationships implied by column names. It then lists any relationships that it could not find table names for, so at least you know which ones you're missing. The inferred parent and child are listed, along with singular and plural names, plus the implied relationship:
-- this DB is where MySQL keeps schema information
use information_schema;
-- change this to the DB you want to analyse
set @db_name = "example_DB";
-- infer relationships
-- NB: this won't catch names that pluralise irregularly like category -> categories or bus_id -> buses etc.
select LEFT(COLUMN_NAME, CHAR_LENGTH(COLUMN_NAME) - 3 ) as inferred_parent_singular
, CONCAT(LEFT(COLUMN_NAME, CHAR_LENGTH(COLUMN_NAME) - 3 ),"s") as inferred_parent_plural
, C.TABLE_NAME as child_table
, CONCAT(LEFT(COLUMN_NAME, CHAR_LENGTH(COLUMN_NAME)-3), "s has many ", C.TABLE_NAME) as inferred_relationship
from COLUMNS C
JOIN TABLES T on C.TABLE_NAME = T.TABLE_NAME
and C.TABLE_SCHEMA = T.TABLE_SCHEMA
and T.TABLE_TYPE != "VIEW" -- filter out views; comment this line if you want to include them
where COLUMN_NAME like "%_id" -- look for columns of the form <name>_id
and C.TABLE_SCHEMA = T.TABLE_SCHEMA and T.TABLE_SCHEMA = @db_name
-- and C.TABLE_NAME not like "wwp%" -- uncomment and set a pattern to filter out any tables you DON'T want included, e.g. wordpress tables e.g. wordpress tables
-- finally make sure to filter out any inferred names that aren't really tables
and CONCAT(LEFT(COLUMN_NAME, CHAR_LENGTH(COLUMN_NAME) - 3 ),"s") -- this is the inferred_parent_plural, but can't use column aliases in the where clause sadly
in (select TABLE_NAME from TABLES where TABLE_SCHEMA = @db_name)
;
This will return results like this:
Then you can examine any naming convention exceptions detected with:
-- Now list any inferred parents that weren't real tables to see see why (irregular plurals and columns not named according to convention)
select LEFT(COLUMN_NAME, CHAR_LENGTH(COLUMN_NAME) - 3 ) as inferred_parent_singular
, CONCAT(LEFT(COLUMN_NAME, CHAR_LENGTH(COLUMN_NAME) - 3 ),"s") as inferred_parent_plural
, C.TABLE_NAME as child_table
from COLUMNS C
JOIN TABLES T on C.TABLE_NAME = T.TABLE_NAME
and C.TABLE_SCHEMA = T.TABLE_SCHEMA
and T.TABLE_TYPE != "VIEW" -- filter out views, comment this line if you want to include them
where COLUMN_NAME like "%_id"
and C.TABLE_SCHEMA = T.TABLE_SCHEMA and T.TABLE_SCHEMA = @db_name
-- and C.TABLE_NAME not like "wwp%" -- uncomment and set a pattern to filter out any tables you DON'T want included, e.g. wordpress tables e.g. wordpress tables
-- this time only include inferred names that aren't real tables
and CONCAT(LEFT(COLUMN_NAME, CHAR_LENGTH(COLUMN_NAME) - 3 ),"s")
not in (select TABLE_NAME from TABLES where TABLE_SCHEMA = @db_name)
;
This will return results like this, which you can process manually:
You can modify these scripts to spit out whatever is useful to you, include foreign key create statements, if you want to. Here, the final column is a simple 'has many' relationship statement. I use this in a tool I've built called called pidgin, which is rapid modelling tool that draws relationship diagrams on the fly based on relationship statements written a very simple syntax (called 'pidgin'). You can check it out at http://pidgin.gruffdavies.com
I've run the above script on a demo DB to show you the sort of results you can expect:
I haven't catered for the irregular plurals in my script, but I might have a go at that too, at least for the case of entities ending in -y. If you want to have a try at that yourself, I'd recommend writing a stored function that takes <name>_id
column names as a parameter, strips the _id
part and then applies some heuristics to attempt to pluralise correctly.
Hope that's useful!
回答3:
The following products are all claiming to provide foreign-keys discovery abilities:
ERwin http://www.ascent.co.za/products/ca_erwin_data_profiler.html
Informatica https://community.informatica.com/onlinehelp/analyst/961/en/index.htm#page/data-discovery-guide/GUID-33EAF039-ECFC-49FD-96F4-A2C2A4EB857F.1.148.html
and XCaseForI http://xcasefori.com/discovering/index.html
Statistical methodologies able to provide a kind of similarity rank like range distribution and creation time as suggested by Kirk, seems to be the right way. .. I'd need to implement it using SAS EG or any free tool.
回答4:
I don't know about the softwares which may help in searching what you require, but The following query will help to get you started. It lists all Foreign Key Relationships within the current database.
SELECT
K_Table = FK.TABLE_NAME,
FK_Column = CU.COLUMN_NAME,
PK_Table = PK.TABLE_NAME,
PK_Column = PT.COLUMN_NAME,
Constraint_Name = C.CONSTRAINT_NAME
FROM
INFORMATION_SCHEMA.REFERENTIAL_CONSTRAINTS C
INNER JOIN INFORMATION_SCHEMA.TABLE_CONSTRAINTS FK
ON C.CONSTRAINT_NAME = FK.CONSTRAINT_NAME
INNER JOIN INFORMATION_SCHEMA.TABLE_CONSTRAINTS PK
ON C.UNIQUE_CONSTRAINT_NAME = PK.CONSTRAINT_NAME
INNER JOIN INFORMATION_SCHEMA.KEY_COLUMN_USAGE CU
ON C.CONSTRAINT_NAME = CU.CONSTRAINT_NAME
INNER JOIN (
SELECT
i1.TABLE_NAME,
i2.COLUMN_NAME
FROM
INFORMATION_SCHEMA.TABLE_CONSTRAINTS i1
INNER JOIN INFORMATION_SCHEMA.KEY_COLUMN_USAGE i2
ON i1.CONSTRAINT_NAME = i2.CONSTRAINT_NAME
WHERE
i1.CONSTRAINT_TYPE = 'PRIMARY KEY'
) PT
ON PT.TABLE_NAME = PK.TABLE_NAME
Hope this helps.
来源:https://stackoverflow.com/questions/7031203/tools-for-discovering-de-facto-foreign-keys-in-databases