Joining two tables in a complex query (not uniform data)

问题

I need to connect two tables in a query that I will use to insert data to third table (used in the future to join the two). I will mention only relevant columns in these tables.

PostgreSQL version 9.0.5

Table 1: data_table

migrated data, ca 10k rows, relevant columns:

id (primary key),

address (beginning of an address, string that I need to match with the second table. This address has varying length.)

Table 2: dictionary

dictionary, ca 9 mln rows, relevant columns:

id (primary key),

address (full address, string that I need to match with the first table, varying length as well.)

What exactly do I need

I need to correctly connect these tables in a select statement, and then insert these to a third table. All I need is a way to successfully connect these tables.

The way I want to do it is to take each address from data_table, and join it with first address (edit: order by address asc) from dictionary that begins with data_table.address (without multiplying records, as a lot of addresses in dictionary begin with each data_table.address).

Also, addressess in both tables contain a lot of irregular spaces, so we probably need to

replace(address, ' ', '')

on both of them (any alternative ideas welcome). There might also be some performance issues since dictionary has 9 mln rows and the server is rather slow.

I see the result as some variation of following query:

select 
data_table.id, dictionary_id
from
data_table, dictionary
where
-conditions-

回答1:

SELECT DISTINCT ON (1)
       t.id, d.address, d.id
FROM   data_table t
JOIN   dictionary d ON replace(d.address, ' ', '')
                 LIKE (replace(t.address, ' ', '') || '%')
ORDER  BY t.id, d.address, d.id

(ORDER BY updated after question update.) Without ORDER BY it's picking an arbitrary match.
Explanation for the technique in this related answer:
Select first row in each GROUP BY group?

A functional index on your dictionary would make this fast:

CREATE INDEX dictionary_address_text_pattern_ops_idx
ON dictionary (replace(address, ' ', '') text_pattern_ops);

More explanation for that in the answer I provided to the precursing question.

One might debate if that gets you the "best" match. One alternative would be a similarity match with a trigram index. Details in the first of the links I added to your last question.

回答2:

The solution that our architect came up with was writing a function to find the first match.

The function:

CREATE OR REPLACE FUNCTION pick_one_address(text)
  RETURNS text AS
$BODY$
DECLARE
  address_query text;
  toFind text;
  found text;
BEGIN

  toFind := (replace($1, ' ', '') || '%');  
  address_query := 'select al.id from dictionary al where replace(al.adres, '' '', '''') like ''' || toFind ||''' limit 1'; 
  EXECUTE address_query into found;
  RETURN found;

RETURN found_address;
END $BODY$
  LANGUAGE plpgsql VOLATILE
  COST 100;

The code might seem strange since I did change table names to protect my company's privacy, and didn't mention third table I used to simplify the question, but I guess it should be enough to understand the mechanism.

Thanks for your input @ErwinBrandstetter, @CraigRinger

来源：https://stackoverflow.com/questions/16570597/joining-two-tables-in-a-complex-query-not-uniform-data

标签

postgresql

join

insert

data-migration