Why is Solr so much faster than Postgres?

前端 未结 5 1803
故里飘歌
故里飘歌 2021-01-29 18:27

I recently switched from Postgres to Solr and saw a ~50x speed up in our queries. The queries we run involve multiple ranges, and our data is vehicle listings. For example: \"Fi

5条回答
  •  一个人的身影
    2021-01-29 18:52

    You didn't really say much about what you did to tune your PostgreSQL instance or your queries. It's not unusual to see a 50x speed up on a PostgreSQL query through tuning and/or restating your query in a format which optimizes better.

    Just this week there was a report at work which someone had written using Java and multiple queries in a way which, based on how far it had gotten in four hours, was going to take roughly a month to complete. (It needed to hit five different tables, each with hundreds of millions of rows.) I rewrote it using several CTEs and a window function so that it ran in less than ten minutes and generated the desired results straight out of the query. That's a 4400x speed up.

    Perhaps the best answer to your question has nothing to do with the technical details of how searches can be performed in each product, but more to do with ease of use for your particular use case. Clearly you were able to find the fast way to search with Solr with less trouble than PostgreSQL, and it may not come down to anything more than that.

    I am including a short example of how text searches for multiple criteria might be done in PostgreSQL, and how a few little tweaks can make a large performance difference. To keep it quick and simple I'm just running War and Peace in text form into a test database, with each "document" being a single text line. Similar techniques can be used for arbitrary fields using the hstore type or JSON columns, if the data must be loosely defined. Where there are separate columns with their own indexes, the benefits to using indexes tend to be much bigger.

    -- Create the table.
    -- In reality, I would probably make tsv NOT NULL,
    -- but I'm keeping the example simple...
    CREATE TABLE war_and_peace
      (
        lineno serial PRIMARY KEY,
        linetext text NOT NULL,
        tsv tsvector
      );
    
    -- Load from downloaded data into database.
    COPY war_and_peace (linetext)
      FROM '/home/kgrittn/Downloads/war-and-peace.txt';
    
    -- "Digest" data to lexemes.
    UPDATE war_and_peace
      SET tsv = to_tsvector('english', linetext);
    
    -- Index the lexemes using GiST.
    -- To use GIN just replace "gist" below with "gin".
    CREATE INDEX war_and_peace_tsv
      ON war_and_peace
      USING gist (tsv);
    
    -- Make sure the database has statistics.
    VACUUM ANALYZE war_and_peace;
    

    Once set up for indexing, I show a few searches with row counts and timings with both types of indexes:

    -- Find lines with "gentlemen".
    EXPLAIN ANALYZE
    SELECT * FROM war_and_peace
      WHERE tsv @@ to_tsquery('english', 'gentlemen');
    

    84 rows, gist: 2.006 ms, gin: 0.194 ms

    -- Find lines with "ladies".
    EXPLAIN ANALYZE
    SELECT * FROM war_and_peace
      WHERE tsv @@ to_tsquery('english', 'ladies');
    

    184 rows, gist: 3.549 ms, gin: 0.328 ms

    -- Find lines with "ladies" and "gentlemen".
    EXPLAIN ANALYZE
    SELECT * FROM war_and_peace
      WHERE tsv @@ to_tsquery('english', 'ladies & gentlemen');
    

    1 row, gist: 0.971 ms, gin: 0.104 ms

    Now, since the GIN index was about 10 times faster than the GiST index you might wonder why anyone would use GiST for indexing text data. The answer is that GiST is generally faster to maintain. So if your text data is highly volatile the GiST index might win on overall load, while the GIN index would win if you are only interested in search time or for a read-mostly workload.

    Without the index the above queries take anywhere from 17.943 ms to 23.397 ms since they must scan the entire table and check for a match on each row.

    The GIN indexed search for rows with both "ladies" and "gentlemen" is over 172 times faster than a table scan in exactly the same database. Obviously the benefits of indexing would be more dramatic with bigger documents than were used for this test.

    The setup is, of course, a one-time thing. With a trigger to maintain the tsv column, any changes made would instantly be searchable without redoing any of the setup.

    With a slow PostgreSQL query, if you show the table structure (including indexes), the problem query, and the output from running EXPLAIN ANALYZE of your query, someone can almost always spot the problem and suggest how to get it to run faster.


    UPDATE (Dec 9 '16)

    I didn't mention what I used to get the prior timings, but based on the date it probably would have been the 9.2 major release. I just happened across this old thread and tried it again on the same hardware using version 9.6.1, to see whether any of the intervening performance tuning helps this example. The queries for only one argument only increased in performance by about 2%, but searching for lines with both "ladies" and "gentlemen" about doubled in speed to 0.053 ms (i.e., 53 microseconds) when using the GIN (inverted) index.

提交回复
热议问题