MySQL huge tables JOIN makes database collapse

前端 未结 3 1350
借酒劲吻你
借酒劲吻你 2020-12-21 08:53

Following my recent question Select information from last item and join to the total amount, I am having some memory problems while generation tables

I have two tabl

相关标签:
3条回答
  • 2020-12-21 09:30

    You can make this puppy scream. Dump the whole inner join query. Really. This is a trick virtually no one seems to know about.

    Assuming dates is a datetime, convert it to a sortable string, concatenate the values you want, max (or min), substring, cast. You may need to adjust the date convert function (this one works in MS-SQL), but this idea will work anywhere:

    SELECT customer, count(sale), max_sale = cast(substring(max(convert(char(19), dates, 120) + str(sale, 12, 2)), 20, 12) as numeric(12, 2))
    FROM sales a 
    group by customer
    

    Voilá. If you need more result columns, do:

    SELECT yourkey
                , maxval = left(val, N1)                  --you often won't need this
                , result1 = substring(val, N1+1, N2)
                , result2 = substring(val, N1+N2+1, N3)   --etc. for more values
    FROM ( SELECT yourkey, val = max(cast(maxval as char(N1))
                                   + cast(resultCol1 as char(N2))
                                   + cast(resultCol2 as char(N3)) )
           FROM yourtable GROUP BY yourkey ) t
    

    Be sure that you have fixed lengths for all but the last field. This takes a little work to get your head around, but is very learnable and repeatable. It will work on any database engine, and even if you have rank functions, this will often significantly outperform them.

    More on this very common challenge here.

    0 讨论(0)
  • 2020-12-21 09:37

    I think you should try adding an index on sales(customer, date). The subquery is probably the performance bottleneck.

    0 讨论(0)
  • 2020-12-21 09:43

    300k rows is not a huge table. We frequently see 300 million row tables.

    The biggest problem with your query is that you're using a correlated subquery, so it has to re-execute the subquery for each row in the outer query.

    It's often the case that you don't need to do all your work in one SQL statement. There are advantages to breaking it up into several simpler SQL statements:

    • Easier to code.
    • Easier to optimize.
    • Easier to debug.
    • Easier to read.
    • Easier to maintain if/when you have to implement new requirements.

    Number of Purchases

    SELECT customer, COUNT(sale) AS number_of_purchases
    FROM sales 
    GROUP BY customer;
    

    An index on sales(customer,sale) would be best for this query.

    Last Purchase Value

    This is the greatest-n-per-group problem that comes up frequently.

    SELECT a.customer, a.sale as max_sale
    FROM sales a
    LEFT OUTER JOIN sales b
     ON a.customer=b.customer AND a.dates < b.dates
    WHERE b.customer IS NULL;
    

    In other words, try to match row a to a hypothetical row b that has the same customer and a greater date. If no such row is found, then a must have the greatest date for that customer.

    An index on sales(customer,dates,sale) would be best for this query.

    If you might have more than one sale for a customer on that greatest date, this query will return more than one row per customer. You'd need to find another column to break the tie. If you use an auto-increment primary key, it's suitable as a tie breaker because it's guaranteed to be unique and it tends to increase chronologically.

    SELECT a.customer, a.sale as max_sale
    FROM sales a
    LEFT OUTER JOIN sales b
     ON a.customer=b.customer AND (a.dates < b.dates OR a.dates = b.dates and a.id < b.id)
    WHERE b.customer IS NULL;
    

    Total Amount of Purchases, When It Has a Positive Value

    SELECT customer, SUM(sale) AS total_purchases
    FROM sales
    WHERE sale > 0
    GROUP BY customer;
    

    An index on sales(customer,sale) would be best for this query.

    You should consider using NULL to signify a missing sale value instead of -1. Aggregate functions like SUM() and COUNT() ignore NULLs, so you don't have to use a WHERE clause to exclude rows with sale < 0.


    Re: your comment

    What I have now is a table with fields year, quarter, total_sale (regarding to the pair (year,quarter)) and sale. What I want to gather is information regarding certain period: this quarter, quarters, year 2011... Info has to be splitted in top customers, ones with bigger sales, etc. Would it be possible to get the last purchase value from customers with total_purchases bigger than 5?

    Top Five Customers for Q4 2012

    SELECT customer, SUM(sale) AS total_purchases
    FROM sales
    WHERE (year, quarter) = (2012, 4) AND sale > 0
    GROUP BY customer
    ORDER BY total_purchases DESC
    LIMIT 5;
    

    I'd want to test it against real data, but I believe an index on sales(year, quarter, customer, sale) would be best for this query.

    Last Purchase for Customers with Total Purchases > 5

    SELECT a.customer, a.sale as max_sale
    FROM sales a
    INNER JOIN sales c ON a.customer=c.customer
    LEFT OUTER JOIN sales b
     ON a.customer=b.customer AND (a.dates < b.dates OR a.dates = b.dates and a.id < b.id)
    WHERE b.customer IS NULL
    GROUP BY a.id
    HAVING COUNT(*) > 5;
    

    As in the other greatest-n-per-group query above, an index on sales(customer,dates,sale) would be best for this query. It probably can't optimize both the join and the group by, so this will incur a temporary table. But at least it will only do one temporary table instead of many.


    These queries are complex enough. You shouldn't try to write a single SQL query that can give all of these results. Remember the classic quote from Brian Kernighan:

    Everyone knows that debugging is twice as hard as writing a program in the first place. So if you’re as clever as you can be when you write it, how will you ever debug it?

    0 讨论(0)
提交回复
热议问题