I have a dictionary of 50K to 100K strings (can be up to 50+ characters) and I am trying to find whether a given string is in the dictionary with some \"edit\" distance toleranc
Approx 15 years ago I wrote fuzzy search, which can found N closes neighbors. This is my modification of Wilbur's trigram algorithm, and this modification named "Wilbur-Khovayko algorithm".
Basic idea: To split strings by trigrams, and search maximal intersection scores.
For example, lets we have string "hello world". This string is generates trigrams: hel ell llo "lo ", "o_w", eand so on; Also, produces special prefix/suffix trigrams for each word, like $he $wo lo$ ld$.
Thereafter, for each trigram built index, in which term it is present.
So, this is list of term_ID for each trigram.
When user invoke some string - it also splits to trigrams, and program search maximal intersection score, and generates N-size list.
It works quick: I remember, on old Sun/solaris, 256MB ram, 200MHZ CPU, it search 100 closest term in dictionary 5,000,000 terms, in 0.25s
You can get my old source from: http://olegh.ftp.sh/wilbur-khovayko.tar.gz
UPDATE:
I created new archive, where is Makefile adjusted for modern Linux/BSD make. You can download new version here: http://olegh.ftp.sh/wilbur-khovayko.tgz
Make some directory, and extract archive here:
mkdir F2
cd F2
tar xvfz wilbur-khovayko.tgz
make
Go to test directory, copy term list file (this is fixed name, termlist.txt), and make index:
cd test/
cp /tmp/test/termlist.txt ./termlist.txt
./crefdb.exe
In this test, I used ~380,000 expired domain names:
wc -l termlist.txt
379430 termlist.txt
Run findtest application:
./findtest.exe
boking <-- this is query -- word "booking" with misspeling
0001:Query: [boking]
1: 287890 ( 3.863739) [bokintheusa.com,2009-11-20,$69]
2: 287906 ( 3.569148) [bookingseu.com,2009-11-20,$69]
3: 257170 ( 3.565942) [bokitko.com,2009-11-18,$69]
4: 302830 ( 3.413791) [bookingcenters.com,2009-11-21,$69]
5: 274658 ( 3.408325) [bookingsadept.com,2009-11-19,$69]
6: 100438 ( 3.379371) [bookingresorts.com,2009-11-09,$69]
7: 203401 ( 3.363858) [bookinginternet.com,2009-11-15,$69]
8: 221222 ( 3.361689) [bobokiosk.com,2009-11-16,$69]
. . . .
97: 29035 ( 2.169753) [buccupbooking.com,2009-11-05,$69]
98: 185692 ( 2.169047) [box-hosting.net,2009-11-14,$69]
99: 345394 ( 2.168371) [birminghamcookinglessons.com,2009-11-25,$69]
100: 150134 ( 2.167372) [bowlingbrain.com,2009-11-12,$69]