Partition a very large INNER JOIN SQL query

问题

The sql query is fairly standard inner join type. For example comparing n tables to see which customerId's exist in all n tables would be a basic WHERE ... AND type query.

The problem is the size of the tables are > 10 million records. The database is denormalized. Normalization is not an option. The query either takes to long to complete or never completes.

I'm not sure if it's relevant but we are using spring xd job modules for other types of queries.

I'm not sure how to partition this sort of job so that it can be run in parallel so that it takes less time and so if a step/subsection fails it can continue from where it left off.

Other posts with similar problem suggest using alternative methods besides the database engine like implementing a LOOP JOIN in code or using MapReduce or Hadoop, having never used either I'm unsure if they are worth looking into for this use case.

What is the standard approach to this sort of operation, I'd expect it to be fairly common. I might be using the wrong search terms to research approaches because I haven't come across any stock standard solutions or clear directions.

The rather cryptic original requirement was:

Compare party_id column in the three very large tables to identify the customer available in three table i.e if it is AND operation between three. SAMPLE1.PARTY_ID AND SAMPLE2.PARTY_ID AND SAMPLE3.PARTY_ID

If the operation is OR then pick all the customers available in the three tables. SAMPLE1.PARTY_ID OR SAMPLE2.PARTY_ID OR SAMPLE3.PARTY_ID

AND / OR are used between tables then performed the comparison as required. SAMPLE1.PARTY_ID AND SAMPLE2.PARTY_ID OR SAMPLE3.PARTY_ID

I set up some 4 test tables each with with this definition

CREATE TABLE `TABLE1` (
  `CREATED` datetime DEFAULT NULL,
  `PARTY_ID` varchar(45) NOT NULL,
  `GROUP_ID` varchar(45) NOT NULL,
  `SEQUENCE_ID` int(11) NOT NULL AUTO_INCREMENT,
  PRIMARY KEY (`SEQUENCE_ID`)
) ENGINE=InnoDB AUTO_INCREMENT=978536 DEFAULT CHARSET=latin1;

Then added 1,000,000 records to each just random numbers in a range that should result in joins.

I used the following test query

SELECT `TABLE1`.`PARTY_ID` AS `pi1`, `TABLE2`.`PARTY_ID` AS `pi2`, `TABLE3`.`PARTY_ID` AS `pi3`, `TABLE4`.`PARTY_ID` AS `pi4` FROM `devt1`.`TABLE2` AS `TABLE2`, `devt1`.`TABLE1` AS `TABLE1`, `devt1`.`TABLE3` AS `TABLE3`, `devt1`.`TABLE4` AS `TABLE4` WHERE `TABLE2`.`PARTY_ID` = `TABLE1`.`PARTY_ID` AND `TABLE3`.`PARTY_ID` = `TABLE2`.`PARTY_ID` AND `TABLE4`.`PARTY_ID` = `TABLE3`.`PARTY_ID`

It's supposed to complete in under 10 min and for table sizes 10x larger. My test query still hasn't completed and it has been running for 15 min

回答1:

The following may perform better than the existing join-based query:

select party_id from
(select distinct party_id from SAMPLE1 union all
 select distinct party_id from SAMPLE2 union all
 select distinct party_id from SAMPLE3) as ilv
group by party_id 
having count(*) = 3

Amend the count(*) condition to match the number of tables being queried.

If you want to return party_id values that are present in any table rather than all, then omit the final having clause.

来源：https://stackoverflow.com/questions/32620437/partition-a-very-large-inner-join-sql-query

标签

mysql

Hadoop

join

bigdata

spring-xd