问题
I need to connect two tables in a query that I will use to insert data to third table (used in the future to join the two). I will mention only relevant columns in these tables.
PostgreSQL version 9.0.5
Table 1: data_table
migrated data, ca 10k rows, relevant columns:
id (primary key),
address (beginning of an address, string that I need to match with the second table. This address has varying length.)
Table 2: dictionary
dictionary, ca 9 mln rows, relevant columns:
id (primary key),
address (full address, string that I need to match with the first table, varying length as well.)
What exactly do I need
I need to correctly connect these tables in a select statement, and then insert these to a third table. All I need is a way to successfully connect these tables.
The way I want to do it is to take each address from data_table, and join it with first address (edit: order by address asc) from dictionary that begins with data_table.address (without multiplying records, as a lot of addresses in dictionary begin with each data_table.address).
Also, addressess in both tables contain a lot of irregular spaces, so we probably need to
replace(address, ' ', '')
on both of them (any alternative ideas welcome). There might also be some performance issues since dictionary has 9 mln rows and the server is rather slow.
I see the result as some variation of following query:
select
data_table.id, dictionary_id
from
data_table, dictionary
where
-conditions-
回答1:
SELECT DISTINCT ON (1)
t.id, d.address, d.id
FROM data_table t
JOIN dictionary d ON replace(d.address, ' ', '')
LIKE (replace(t.address, ' ', '') || '%')
ORDER BY t.id, d.address, d.id
(ORDER BY
updated after question update.) Without ORDER BY
it's picking an arbitrary match.
Explanation for the technique in this related answer:
Select first row in each GROUP BY group?
A functional index on your dictionary would make this fast:
CREATE INDEX dictionary_address_text_pattern_ops_idx
ON dictionary (replace(address, ' ', '') text_pattern_ops);
More explanation for that in the answer I provided to the precursing question.
One might debate if that gets you the "best" match. One alternative would be a similarity match with a trigram index. Details in the first of the links I added to your last question.
回答2:
The solution that our architect came up with was writing a function to find the first match.
The function:
CREATE OR REPLACE FUNCTION pick_one_address(text)
RETURNS text AS
$BODY$
DECLARE
address_query text;
toFind text;
found text;
BEGIN
toFind := (replace($1, ' ', '') || '%');
address_query := 'select al.id from dictionary al where replace(al.adres, '' '', '''') like ''' || toFind ||''' limit 1';
EXECUTE address_query into found;
RETURN found;
RETURN found_address;
END $BODY$
LANGUAGE plpgsql VOLATILE
COST 100;
The code might seem strange since I did change table names to protect my company's privacy, and didn't mention third table I used to simplify the question, but I guess it should be enough to understand the mechanism.
Thanks for your input @ErwinBrandstetter, @CraigRinger
来源:https://stackoverflow.com/questions/16570597/joining-two-tables-in-a-complex-query-not-uniform-data