How to create a spell check enabled MySQL query by leveraging SOUNDEX and Levenshtein Distance algorithms

Recently I came across a situation where I needed to perform the MySQL search in such way, that it would account for typos in user search queries. For example, if the database of words in MySQL contained only the word “assistance” and the user typed the misspelled word “asistence”, I had to be able to return the correct word “assistance” as a closest possible suggestion from the database. It seemed like a trivial task at first, but it took me quite a while to figure the SQL query that had a good balance between performance and overall quality of results.  [Read more…]

Converting UTZOO Usenet archive from magnetic tapes to MySQL database using Java

Recently, I came across a post from 2001 which allowed downloading the entire collection of UTZOO NetNews Archive of earliest USENET posts. These were essentially the earliest available discussions posted to the Internet by folks working for various Universities connected to the Internet. Millions of posts created between Feb 1981 and June of 1991.

Until 2001, these early Usenet discussions were considered being lost, but miraculously Henry Spencer from the University of Toronto, Department of Zoology was backing it up onto magnetic tapes and kept them stored for all these years (apparently at the great cost).

H. Spencer had altogether 141 of these magnetic tapes, but there were of no use, so eventually, him and a couple of motivated people such as David Wiseman (who dragged 141 tapes back and forth in his a pickup truck), Lance Bailey, Bruce Jones, Bob Webber, Brewster Kahle, and Sue Thielen; embarked on a process of converting all of these tapes into regular format, accessible to everyone.

And that’s the copy I downloaded. What a treasure, right?

Well, not so fast, once I unzipped the data, I realized that the TGZ format contains literally millions of small text files (each post in its own file). While it was certainly nice to have, it wasn’t something that I or anyone else could read. Certainly not in a forum like discussion format, it wasn’t obvious which post is the one that starts the discussion or which ones are the replies to the thread. And forget about searching through these files, that was utterly not possible. Just to put things into perspective, it took me over 5 hours to just unzip the archives.

That said, it didn’t take long for me to decide to develop a Java-based converter that would attempt to convert the entire collection from millions of flat files into a fully searchable MySQL database. The following post talks about the process and also includes the Java code of the solution released as open source.

[Read more…]

MongoDB and BSON format

Recently I came across a statement that said: “MongoDB (btw. that’s MongoDB) uses the BSON format which extends the JSON model to provide additional data types” and I think this topic deserves a bit of clarification. [Read more…]

Local vs Distributed Databases

In this article, we’ll analyze the situations in which it is more appropriate to implement a distributed database instead of a centralized (local) database. But before we do so, let’s see how the two of the database concepts differ. [Read more…]