问题
The following tables are given:
--- player --
id serial
name VARCHAR(100)
birthday DATE
country VARCHAR(3)
PRIMARY KEY id
--- club ---
id SERIAL
name VARCHAR(100)
country VARCHAR(3)
PRIMARY KEY id
--- playersinclubs ---
id SERIAL
player_id INTEGER (with INDEX)
club_id INTEGER (with INDEX)
joined DATE
left DATE
PRIMARY KEY id
Every player has a row in table player (with his attributes). Equally every club has an entry in table club. For every station in his career, a player has an entry in table playersInClubs (n-m) with the date when the player joined and optionally when the player left the club.
My main problem is the performance of these tables. In Table player we have over 10 million entries. If i want to display a history of a club with all his players played for this club, my select looks like the following:
SELECT * FROM player
JOIN playersinclubs ON player.id = playersinclubs.player_id
JOIN club ON club.id = playersinclubs.club_id
WHERE club.dbid = 3;
But for the massive load of players a sequence scan on table player will be executed. This selection takes a lot of time.
Before I implemented some new functions to my app, every players has exactly one team (only todays teams and players). So i havn't had the table playersinclubs. Instead i had a team_id in table player. I could select the players of a team directly in table player with the where clause team_id = 3.
Does someone has some performance tips for my database structure to speed up these selections?
回答1:
Most importantly, you need an index on playersinclubs(club_id, player_id)
. The rest is details (that may still make quite a difference).
You need to be precise about your actual goals. You write:
all his players played for this club:
You don't need to join to club
for this at all:
SELECT p.*
FROM playersinclubs pc
JOIN player p ON p.id = pc.player_id
WHERE pc.club_id = 3;
And you don't need columns playersinclubs
in the output either, which is a small gain for performance - unless it allows an index-only scan on playersinclubs
, then it may be substantial.
- How does PostgreSQL perform ORDER BY if a b-tree index is built on that field?
You probably don't need all columns of player
in the result, either. Only SELECT
the columns you actually need.
The PK on player
provides the index you need on that table.
You need an index on playersinclubs(club_id, player_id)
, but do not make it unique unless players are not allowed to join the same club a second time.
If players can join multiple times and you just want a list of "all players", you also need to add a DISTINCT
step to fold duplicate entries. You could just:
SELECT DISTINCT p.* ...
But since you are trying to optimize performance: it's cheaper to eliminate dupes early:
SELECT p.*
FROM (
SELECT DISTINCT player_id
FROM playersinclubs
WHERE club_id = 3;
) pc
JOIN player p ON p.id = pc.player_id;
Maybe you really want all entries in playersinclubs
and all columns of the table, too. But your description says otherwise. Query and indexes would be different.
Closely related answer:
- Find overlapping date ranges in PostgreSQL
回答2:
The tables look fine and so does the query. So let's see what the query is supposed to do:
- Select the club with ID 3. One record that can be accessed via the PK index.
- Select all playersinclub records for club ID 3. So we need an index starting with this column. If you don't have it, create it.
I suggest:
create unique index idx_playersinclubs on playersinclubs(club_id, player_id, joined);
This would be the table's unique business key. I know that in many databases with technical IDs these unique constraints are not established, but I consider this a flaw in those databases and would always create these constraints/indexes.
- Use the player IDs got thus and select the players accordingly. We can get the player ID from the playersinclubs records, but it is also the second column in our index, so the DBMS may choose one or the other to perform the join. (It will probably use the column from the index.)
So maybe it is simply that above index does not exist yet.
来源:https://stackoverflow.com/questions/46301784/many-to-many-table-performance-is-bad