How to dump and convert Aspell dictionary to wordlist or searchable MySQL/MariaDB database

The following post outlines the process of converting Aspell language dictionary for any language into Mysql or MariaDB database.

Step 1 – Install Aspell + Language Dictionary

I will illustrate the entire process by converting the Slovak Aspell dictionary into MySQL database.

The very first step is to install Aspell and Slovak Aspell dictionary onto our Linux distro. That is as easy as running the following command:

yum install aspell-sk

The result should look something like this and we’ll see the message that Aspell-SK dictionary, as well as Aspell itself, were installed.

If the Aspell dictionary is not available, you can download it from https://ftp.gnu.org/gnu/aspell/dict/0index.html

Then drop it into a folder on your machine and untar it by using a following command on your tar.bz2 file:

tar xvjf aspell6-en-2017.08.24-0.tar.bz2

Once un-tared, navigate to the folder and run the following commands:

 chmod 777 *
 ./configure
 make 
 make install

Once done, the language of your choice will be added to Aspell and you can continue on your way to extract the wordlist file as explained below:

Step 2 – Dump Aspell into Wordlist file

Now that Aspell is installed, let’s dump the Slovak dictionary to disk using the following Aspell command:

aspell -d sk dump master | aspell -l sk expand > words.sk.txt

The command for other languages would essentially be the same, except the language abbreviations. So for example, for English language we would use ‘en’ instead of ‘sk’, something like this:

aspell -d en dump master | aspell -l en expand > words.en.txt

Not so hard, eh? :)

 

Step 3 – Aspell Wordlist PostProcessing

Now, we should have the entire dictionary file dumped into a text file named ‘words.sk.txt’ in UTF-8 format.

Let’s open it in Notepad++, this is what we should see:

We can immediately see a problem… In certain languages, a single word can be spelled out in a variety of formats that all stem from the original simple form of the word. Note: this isn’t a problem in English dictionary, so if this is what you were after, by now you should have a perfectly well working English wordlist.

For those whose export dictionary file includes the long lines of different word forms, we need to somehow fix it, because the idea behind wordlist file is to have each word on its own separate line.

That said, we need to do another conversion.

Under Linux, this can easily be achieved by running the following command:

sed -E -e 's/\s+/\n/g' words.sk.txt > words.sk-new.txt

You’ll end up with a new file, that has all words broken on a separate line.

In Windows, Notepad++ is a perfect tool for this, albeit it’s not as easy as in Linux. The following are Windows instructions…

So, let’s open the words.sk.txt in Notepad++ once again and do a search and replace. We will replace all spaces found in the file, with the new line.

In Notepad++ go to Search / Replace or simply just press CTRL-H, you’ll see the following box.

  • Into ‘Find what’ field insert a single empty space.
  • Into ‘Replace with’ field insert a newline character: ‘\n’
  • Further below, in the Search Mode section, select ‘Extended’ mode

Once all is done, press ‘Replace All’ option and wait.

Note: this may take anywhere from 5 to 30 minutes, depending on the length of the dictionary and your computer resources (processor and available memory).

Once done, you should see the file with each word on its own line:

Save the file in Notepad++ by pressing CTRL-S or File/Save option.

This concludes the section on how to convert Aspell dictionary into wordlist.

Step 4 – Converting Aspell Wordlist Into MySQL/MariaDB Database

Now, for those who’d like to be able to use files as a database, all that needs to be done, is to create a UTF8 database and table, into which we can import the text file we created. All we really need is a table with ‘id’ and ‘word’ columns, where ‘word’ will have the full-text searchable index and ‘id’ will auto increment.

So, let’s go to MySQL and create an empty database called: ‘dictionaries’.

CREATE DATABASE `dictionaries`;

Now, let’s open the ‘dictionaries’ database we’ve created and we will create a new table called: ‘words.sk.aspel’. We will do so, by using the following SQL command:

CREATE TABLE `words.sk.aspel` (
	`id` INT(11) NOT NULL AUTO_INCREMENT,
	`word` VARCHAR(50) NULL DEFAULT NULL,
	PRIMARY KEY (`id`),
	FULLTEXT INDEX `word` (`word`)
)
COLLATE='utf8_general_ci'
ENGINE=MyISAM
;

Once done, this is what it should look like. We have an ‘id’ that will increment with each imported word and we have a VARCHAR(50) column where we will hold all the imported wordsAdditionallyly, the primary key is on ‘id’ and fulltext key is on ‘word’ column.

At this point, all we need to do is to import the word.sk.txt file into the database table ‘words.sk.aspel’.

That I found is to be a bit complicated, but HeidiSQL includes the functionality.

So unless there is a better way (let me know), let’s go and install HeidiSQL. You can download it from https://www.heidisql.com/download.php

Once installed, connect to dictionaries database, click on the words.sk.aspel table and you’ll see that it’s empty:

Now, we can go ahead and import the wordlist file.

In HeidiSQL, make sure you’ve selected your table first (in my case I’ve simply clicked on ‘words.sk.aspel’.

Then go to Tools at the top menu and select option: Import CSV file:

In the menu which will open:

  • Select the wordlist file, in my case the location of words.sk.txt file on disk.
  • For encoding, select: “let server/database decide (utf8).
  • Leave the options on Low Priority
  • Handling of duplicate words on ‘Replace’
  • Method: Server parses file contents (LOAD DATA)
  • Control characters section is crucial to get right, here make sure all fields are empty, except: ‘Lines terminated by’, select ‘\n’ in this field.
  • Destination section should have our ‘dictionaries’ database and also table ‘words.sk.aspel’ selected.
  • In the collumns, uncheck ‘id’ and only check ‘word’ column.

Press Import! button and wait. The entire words.sk.txt file will be imported into the database. It shouldn’t take too long, a minute or two.

And voila, the wordlist we’ve exported from Aspell in txt format should now be imported into DB and be a fulltext searchable database.

Let’s run a test command to show the first 20 words of the Slovak language file:

SELECT * FROM `dictionaries`.`words.sk.aspel` LIMIT 20;

Result:

Or something more complex?

How about showing all words in Slovak language that can be made from the following letters: ‘ABCDEFGH’

SELECT word, (
    IF (LOCATE(binary('a'), binary word) > 0, 1, 0) + 
    IF (LOCATE(binary('b'), binary word) > 0, 1, 0) + 
    IF (LOCATE(binary('c'), binary word) > 0, 1, 0) + 
    IF (LOCATE(binary('d'), binary word) > 0, 1, 0) + 
    IF (LOCATE(binary('e'), binary word) > 0, 1, 0) +
    IF (LOCATE(binary('f'), binary word) > 0, 1, 0) +
    IF (LOCATE(binary('g'), binary word) > 0, 1, 0) + 
    IF (LOCATE(binary('h'), binary word) > 0, 1, 0) 
) AS chars_present,
    CHAR_LENGTH(word) as word_len
FROM `words.sk.aspel`
WHERE CHAR_LENGTH(word) > 1
AND binary word not like '%-%'
AND binary word not like '% %'
HAVING chars_present > 2
AND chars_present >= word_len
order by word_len desc, chars_present asc

Voila, it works :)

I hope this article was helpful. Leave me a comment if you liked it or if you have any suggestions on how to make the process better.