Performance of Delta E (CIE Lab) calculating and sorting in SQL

后端 未结 1 627
闹比i
闹比i 2021-02-04 19:37

I have a database table where each row is a color. My goal: given an input color, calculate its distance to each color in the DB table, and sort the results by that distance. Or

1条回答
  •  伪装坚强ぢ
    2021-02-04 20:27

    Two things: 1) you are not using the database to its full extent and 2) your problem is a great example for a custom PostgreSQL extension. Here's why.

    You are only using database as storage, storing colors as floats. In your current configuration, regardless of the type of query, the database will always have to check all values (make a sequential scan). This means a lot of IO and a lot of calculation for few returned matches. You are trying to find the nearest N colors, so there are a few possibilities on how to avoid performing calculations on all data.

    Simple improvement

    Simplest is to limit your calculations to a smaller subset of data. You can assume the difference will be greater if the components differ more. If you can find a safe difference between the components, where the results are always inappropriate, you can exclude those colors altogether using ranged WHERE with btree indexes. However, due to nature of L*a*b colorspace, this will likely worsen your results.

    First create the indexes:

    CREATE INDEX color_lab_l_btree ON color USING btree (lab_l);
    CREATE INDEX color_lab_a_btree ON color USING btree (lab_a);
    CREATE INDEX color_lab_b_btree ON color USING btree (lab_b);
    

    Then I adapted your query to include a WHERE clause to filter only colors, where any of the components differs for at most 20.

    Update: After another look, adding a limit of 20 will very likely worsen the results, since I found at least one point in space, for which this holds true.:

    SELECT 
        c.rgb_r, c.rgb_g, c.rgb_b,
        DELTA_E_CIE2000(
            25.805780252087963, 53.33446637366859, -45.03961353720049, 
            c.lab_l, c.lab_a, c.lab_b,
            1.0, 1.0, 1.0) AS de2000
    FROM color c 
    WHERE 
        c.lab_l BETWEEN 25.805780252087963 - 20 AND 25.805780252087963 + 20 
        AND c.lab_a BETWEEN 53.33446637366859 - 20 AND 53.33446637366859 + 20 
        AND c.lab_b BETWEEN -45.03961353720049 - 20 AND -45.03961353720049 + 20 
    ORDER BY de2000 ;
    

    I filled the table with 100000 of random colors with your script and tested:

    Time without indexes: 44006,851 ms

    Time with indexes and range query: 1293,092 ms

    You can add this WHERE clause to delta_e_cie1976_query too, on my random data it drops the query time from ~110 ms to ~22 ms.

    BTW: I got the number 20 empirically: I tried with 10, but got only 380 records, which seems a little low and might exclude some better options since the limit is 100. With 20 the full set was 2900 rows and one can be fairly sure that the closest matches will be there. I didn't study the DELTA_E_CIE2000 or L*a*b* color space in detail so the threshold may need adjustment along different components for that to actually be true, but the principle of excluding non-interesting data holds.

    Rewrite Delta E CIE 2000 in C

    As you've already said, Delta E CIE 2000 is complex and fairly unsuitable for implementing in SQL. It currently uses about 0.4 ms per call on my laptop. Implementing it in C should considerably speed this up. PostgreSQL assigns default cost to SQL functions as 100 and C functions as 1. I'm guessing this is based on real experience.

    Update: Since this also scratches one of my itches, I reimplemented the Delta E functions from colormath module in C as a PostgreSQL extension, available on PGXN. With this I can see a speedup of about 150x for CIE2000 when querying all the records from the table with 100k records.

    With this C function, I get query times between 147 ms and 160 ms for 100k colors. With extra WHERE, query time is about 20 ms, which seems quite acceptable for me.

    Best, but advanced solution

    However, since your problem is N nearest neighbor search in 3-dimensional space, you could use K-Nearest-Neighbor Indexing which is in PostgreSQL since version 9.1.

    For that to work, you'd put L*a*b* components into a cube. This extension does not yet support distance operator (it's in the works), but even if it would, it would not support Delta E distances and you would need to reimplement it as a C extension.

    This means implementing GiST index operator class (btree_gist PostgreSQL extension in contrib does this) to support indexing according to Delta E distances. The good part is you could then use different operators for different versions of Delta E, eg. <-> for Delta E CIE 2000 and <#> for Delta E CIE 1976 and queries would be really really fast for small LIMIT even with Delta E CIE 2000.

    In the end it may depend on what your (business) requirements and constraints are.

    0 讨论(0)
提交回复
热议问题