How to create a spell check enabled MySQL query by leveraging SOUNDEX and Levenshtein Distance algorithms

Recently I came across a situation where I needed to perform the MySQL search in such way, that it would account for typos in user search queries. For example, if the database of words in MySQL contained only the word “assistance” and the user typed the misspelled word “asistence”, I had to be able to return the correct word “assistance” as a closest possible suggestion from the database. It seemed like a trivial task at first, but it took me quite a while to figure the SQL query that had a good balance between performance and overal quality of results. 

The solutions I found on the internet were frequently suggesting the SOUNDEX phonetic algorithm built into MySQL.  SOUNDEX was originally developed by Robert C. Russell and Margaret King in the early 1900’s, with the goal to match the words despite minor differences in spelling.

I’ve fiddled with SOUNDEX for a while and considering the results of my testing, I concluded that SOUNDEX provided suggestions that were completely unreliable. Just take a look at the following experiment when running a SOUNDEX based query on the database of almost half a million English words using a misspelled word: “asistence”

 

As you can see, SOUNDEX is completely off, even if I use the sorting by English rank which I added into the database to assist with finding the most common English words:

Considering the results, I’ve decided to abandon SOUNDEX and see if there are other options out there, which brought me to Levenshtein distance (LD) algorithm.

Levenshtein distance (LD) is a degree of the similarity between two strings. The distance is the number of deletions, insertions, or substitutions required to transform source string into target string. The Levenshtein distance algorithm has been used in: Spell checking, Speech recognition, DNA analysis and Plagiarism detection.

For example, if the source string is (s) and the target string is (t), then:

  • If s is “assistance” and t is “assistance”, then LD(s,t) = 0, because no transformations are needed. The strings are already identical.
  • If s is “assistance” and t is “asistence”, then LD(s,t) = 2, because two substitutions or changes are required to transform s into t.

NOTE: You can use https://planetcalc.com/1721/ to calculate Levenshtein distance between two strings if you want to play with it.

Anyhow, this seemed like a brilliant solution, however, MySQL doesn’t include a native Levenshtein Distance function. That said, I found the MySQL function on Github, which I’ve added to my database. This is it:

Once I added Levenshtein function into my database, I decided to try it out, by running a query like this, also leveraging rank and Levenshtein ordering to get the results just right:

This worked great:

Except it was entirely unusable because even though the query returned a correct result, it was incredibly slow. Look at the time required to find the result:

RESULT: /* Affected rows: 0 Found rows: 1 Warnings: 0 Duration for 1 query: 17.235 sec. */

RESULT: /* Affected rows: 0 Found rows: 3 Warnings: 0 Duration for 1 query: 00:01:07 minute*/

The time required to find more results is growing exponentially with the number of required results.


 

This took me back to the drawing board. The final solution used a combination of SOUNDEX and LEVENSHTEIN algorithms to get the correct result and get them fast, leveraging not only

The result was found almost immediately, within 0.29 seconds on my 8GB MySQL test box using the Limit 1:

Removing the limit completely to search through an entire database of half a million English words took only 0.31 seconds:

So, the conclusion is to use both methods. The first SOUNDEX gets you the words that sound similar and then LEVENSHTEIN does the good job of filtering these results down to a correct word that most closely matches the typo or misspelling in the user’s entry.

In my testing, the results I am getting out of this query are on par with Bing Spell Search API v7 without paying a cent for such a service and without incuring the latency cost associated with running a spell checking service from the cloud.

Enjoy!

 

Comments

comments