Fast fuzzy/approximate search in dictionary of strings in Ruby

后端 未结 4 658
渐次进展
渐次进展 2021-02-08 23:55

I have a dictionary of 50K to 100K strings (can be up to 50+ characters) and I am trying to find whether a given string is in the dictionary with some \"edit\" distance toleranc

4条回答
  •  孤独总比滥情好
    2021-02-09 00:41

    Approx 15 years ago I wrote fuzzy search, which can found N closes neighbors. This is my modification of Wilbur's trigram algorithm, and this modification named "Wilbur-Khovayko algorithm".

    Basic idea: To split strings by trigrams, and search maximal intersection scores.

    For example, lets we have string "hello world". This string is generates trigrams: hel ell llo "lo ", "o_w", eand so on; Also, produces special prefix/suffix trigrams for each word, like $he $wo lo$ ld$.

    Thereafter, for each trigram built index, in which term it is present.

    So, this is list of term_ID for each trigram.

    When user invoke some string - it also splits to trigrams, and program search maximal intersection score, and generates N-size list.

    It works quick: I remember, on old Sun/solaris, 256MB ram, 200MHZ CPU, it search 100 closest term in dictionary 5,000,000 terms, in 0.25s

    You can get my old source from: http://olegh.ftp.sh/wilbur-khovayko.tar.gz

    UPDATE:

    I created new archive, where is Makefile adjusted for modern Linux/BSD make. You can download new version here: http://olegh.ftp.sh/wilbur-khovayko.tgz

    Make some directory, and extract archive here:

    mkdir F2
    cd F2
    tar xvfz wilbur-khovayko.tgz
    make
    

    Go to test directory, copy term list file (this is fixed name, termlist.txt), and make index:

     cd test/
     cp /tmp/test/termlist.txt ./termlist.txt
     ./crefdb.exe 

    In this test, I used ~380,000 expired domain names:

    wc -l termlist.txt
    379430 termlist.txt
    

    Run findtest application:

    ./findtest.exe
    
    boking  <-- this is query -- word "booking" with misspeling
    
    
    0001:Query: [boking]
      1:  287890 (  3.863739) [bokintheusa.com,2009-11-20,$69]
      2:  287906 (  3.569148) [bookingseu.com,2009-11-20,$69]
      3:  257170 (  3.565942) [bokitko.com,2009-11-18,$69]
      4:  302830 (  3.413791) [bookingcenters.com,2009-11-21,$69]
      5:  274658 (  3.408325) [bookingsadept.com,2009-11-19,$69]
      6:  100438 (  3.379371) [bookingresorts.com,2009-11-09,$69]
      7:  203401 (  3.363858) [bookinginternet.com,2009-11-15,$69]
      8:  221222 (  3.361689) [bobokiosk.com,2009-11-16,$69]
      . . . . 
     97:   29035 (  2.169753) [buccupbooking.com,2009-11-05,$69]
     98:  185692 (  2.169047) [box-hosting.net,2009-11-14,$69]
     99:  345394 (  2.168371) [birminghamcookinglessons.com,2009-11-25,$69]
    100:  150134 (  2.167372) [bowlingbrain.com,2009-11-12,$69]
    

提交回复
热议问题