Getting data for histogram plot

前端 未结 10 686
花落未央
花落未央 2020-11-29 15:27

Is there a way to specify bin sizes in MySQL? Right now, I am trying the following SQL query:

select total, count(total) from faults GROUP BY total;
<         


        
相关标签:
10条回答
  • 2020-11-29 16:03

    This is a post about a super quick-and-dirty way to create a histogram in MySQL for numeric values.

    There are multiple other ways to create histograms that are better and more flexible, using CASE statements and other types of complex logic. This method wins me over time and time again since it's just so easy to modify for each use case, and so short and concise. This is how you do it:

    SELECT ROUND(numeric_value, -2)    AS bucket,
           COUNT(*)                    AS COUNT,
           RPAD('', LN(COUNT(*)), '*') AS bar
    FROM   my_table
    GROUP  BY bucket;
    

    Just change numeric_value to whatever your column is, change the rounding increment, and that's it. I've made the bars to be in logarithmic scale, so that they don't grow too much when you have large values.

    numeric_value should be offset in the ROUNDing operation, based on the rounding increment, in order to ensure the first bucket contains as many elements as the following buckets.

    e.g. with ROUND(numeric_value,-1), numeric_value in range [0,4] (5 elements) will be placed in first bucket, while [5,14] (10 elements) in second, [15,24] in third, unless numeric_value is offset appropriately via ROUND(numeric_value - 5, -1).

    This is an example of such query on some random data that looks pretty sweet. Good enough for a quick evaluation of the data.

    +--------+----------+-----------------+
    | bucket | count    | bar             |
    +--------+----------+-----------------+
    |   -500 |        1 |                 |
    |   -400 |        2 | *               |
    |   -300 |        2 | *               |
    |   -200 |        9 | **              |
    |   -100 |       52 | ****            |
    |      0 |  5310766 | *************** |
    |    100 |    20779 | **********      |
    |    200 |     1865 | ********        |
    |    300 |      527 | ******          |
    |    400 |      170 | *****           |
    |    500 |       79 | ****            |
    |    600 |       63 | ****            |
    |    700 |       35 | ****            |
    |    800 |       14 | ***             |
    |    900 |       15 | ***             |
    |   1000 |        6 | **              |
    |   1100 |        7 | **              |
    |   1200 |        8 | **              |
    |   1300 |        5 | **              |
    |   1400 |        2 | *               |
    |   1500 |        4 | *               |
    +--------+----------+-----------------+
    

    Some notes: Ranges that have no match will not appear in the count - you will not have a zero in the count column. Also, I'm using the ROUND function here. You can just as easily replace it with TRUNCATE if you feel it makes more sense to you.

    I found it here http://blog.shlomoid.com/2011/08/how-to-quickly-create-histogram-in.html

    0 讨论(0)
  • 2020-11-29 16:03

    I made a procedure that can be used to automatically generate a temporary table for bins according to a specified number or size, for later use with Ofri Raviv's solution.

    CREATE PROCEDURE makebins(numbins INT, binsize FLOAT) # binsize may be NULL for auto-size
    BEGIN
     SELECT FLOOR(MIN(colval)) INTO @binmin FROM yourtable;
     SELECT CEIL(MAX(colval)) INTO @binmax FROM yourtable;
     IF binsize IS NULL 
      THEN SET binsize = CEIL((@binmax-@binmin)/numbins); # CEIL here may prevent the potential creation a very small extra bin due to rounding errors, but no good where floats are needed.
     END IF;
     SET @currlim = @binmin;
     WHILE @currlim + binsize < @binmax DO
      INSERT INTO bins VALUES (@currlim, @currlim+binsize);
      SET @currlim = @currlim + binsize;
     END WHILE;
     INSERT INTO bins VALUES (@currlim, @maxbin);
    END;
    
    DROP TABLE IF EXISTS bins; # be careful if you have a bins table of your own.
    CREATE TEMPORARY TABLE bins (
    minval INT, maxval INT, # or FLOAT, if needed
    KEY (minval), KEY (maxval) );# keys could perhaps help if using a lot of bins; normally negligible
    
    CALL makebins(20, NULL);  # Using 20 bins of automatic size here. 
    
    SELECT bins.*, count(*) AS total FROM bins
    LEFT JOIN yourtable ON yourtable.value BETWEEN bins.minval AND bins.maxval
    GROUP BY bins.minval
    

    This will generate the histogram count only for the bins that are populated. David West ought to be right in his correction, but for some reason, unpopulated bins do not appear in the result for me (despite the use of a LEFT JOIN — I do not understand why).

    0 讨论(0)
  • 2020-11-29 16:03
    SELECT
        CASE
            WHEN total <= 30 THEN "0-30"
            WHEN total <= 40 THEN "31-40"       
            WHEN total <= 50 THEN "41-50"
            ELSE "50-"
        END as Total,
        count(*) as count
    GROUP BY Total 
    ORDER BY Total;
    
    0 讨论(0)
  • 2020-11-29 16:07

    Ofri Raviv's answer is very close but incorrect. The count(*) will be 1 even if there are zero results in a histogram interval. The query needs to be modified to use a conditional sum:

    SELECT b.*, SUM(a.value IS NOT NULL) AS total FROM bins b
      LEFT JOIN a ON a.value BETWEEN b.min_value AND b.max_value
    GROUP BY b.min_value;
    
    0 讨论(0)
  • 2020-11-29 16:09
    SELECT b.*,count(*) as total FROM bins b 
    left outer join table1 a on a.value between b.min_value and b.max_value 
    group by b.min_value
    

    The table bins contains columns min_value and max_value which define the bins. note that the operator "join... on x BETWEEN y and z" is inclusive.

    table1 is the name of the data table

    0 讨论(0)
  • 2020-11-29 16:13

    In addition to great answer https://stackoverflow.com/a/10363145/916682, you can use phpmyadmin chart tool for a nice result:

    enter image description here

    enter image description here

    0 讨论(0)
提交回复
热议问题