The following article presents the programming solution to automatically add the diacritics to unaccented words typed using a regular English keyboard. There are solutions for Chrome browser or Firefox, but I wanted to have this option everywhere, in any program on any OS. Hence I decided to code it in Python. In this short article, I demonstrate how to use global keyboard hooks and SQLite database in Python to automatically add Slovak accents (diacritics) to text typed to any program running under Windows or Linux. This example can be easily used to adjust the program to any other foreign language.
DEMO
Creating the Database
First I needed a database of all words. This can be obtained by dumping words from the Aspell dictionaries for practically any known language. In my case, I’ve downloaded them for the Slovak language and then converted them into a searchable database. In the following article, I explain how it can be easily done: How to dump and convert Aspell dictionary to wordlist or searchable MySQL/MariaDB database
Once I had the list of all words, I’ve created an SQLite database called ‘sk.db’ and dumped all the words there and later massaged them in such way that I’ve created two database columns, one called ‘word’ that contained all Slovak words that contained accents, then a second column, that contained the same Slovak words, but without accents. Note: unfortunately SQLite (unlike MySQL) doesn’t allow NO-ACCENT search, so the need for two columns was necessary.
Once done, the database looked like this and contained 1,232,209 unique Slovak words.
Note: the Slovak database of words in SQLite can be downloaded from my Github: https://github.com/JozefJarosciak/slovak-diacritics-python/blob/master/sk.db
Python Code
The whole source code is only about 70 lines of code.
To start we need to import some libraries, most importantly the following:
import sqlite3 import keyboard import unidecode
sqlite3 – Allows our Python code to communicate with the SQLite database of words, we’ve created above.
keyboard – This small Python library allows us to take full control of the keyboard, hook global events in any program, register hotkeys, simulate key presses and much more.
unidecode – We’ll use this library to strip diacritics from the search terms where the user might run a search with some but not all accent marks.
Then I coded the function to globally capture all typed key presses (this even supports mistakes corrected by users) and bundle the typed keys into unique words that can be searched in the database, the code looks like this and it’s pretty self-explanatory:
def on_press_reaction(event): global words, found_words, equal_sign_press_counter if event.name == 'space' or event.name == '.' or event.name == ',' or event.name == ';' or event.name == 'enter': words = '' elif event.name == 'backspace': if len(words) > 0: words = words[:-1] elif event.name == 'delete': found_words = [] elif event.name == '=': if len(str(words)) > 0: search = unidecode.unidecode(words) found_words = [search] equal_sign_press_counter = 1 db_search_to_list(search) words = '' if len(found_words) > 1: print(str(equal_sign_press_counter) + " - " + found_words[equal_sign_press_counter]) place_word(str(found_words[equal_sign_press_counter]), str(event.name)) equal_sign_press_counter = equal_sign_press_counter + 1 if equal_sign_press_counter > len(found_words) - 1: equal_sign_press_counter = 0 else: # print(event.name) if len(str(event.name)) == 1: if re.match("^[A-Za-z0-9_-]*$", event.name): words = words + str(event.name)
The above function is called from keyboard global hook and continues to capture keys and do database searches until the Python program is active:
keyboard.on_press(on_press_reaction) while True: pass
Note, some unaccented words in the Slovak language can have more than one accented counterparts, such as for example:
select * from sk_aspell where worddf='stat'
Returns:
stať
stáť
sťať
štat
štát
šťať
Or the word ‘zatazi’ can be any of these words:
zaťaží
zátaži
zátaží
záťaži
záťaží
So, the database search is done inside another function I called: db_search_to_list, which essentially holds all matching accented words inside the list, so if one unaccented word has multiple accented presentations we can circle through them by continually pressing ‘=’ key after the word, allowing us easily select the correctly accented keyword. DB search is done in this function:
def db_search_to_list(search_term): global found_words conn = sqlite3.connect('sk.db') cursor = conn.execute("select word from sk_aspell where lower(worddf) = '" + search_term.lower() + "'") for row in cursor: if search_term[0].isupper(): found_words.append(str(row[0]).capitalize()) else: found_words.append(str(row[0])) if len(found_words) > 1: print(found_words)
Then once we find the word, by pressing the ‘=’ key, we want to automatically replace it anywhere we currently are, that being in Notepad, Outlook, Browser, or any other Windows or Linux program. So to do so, I used a little bit of trickery. I found the length of the typed word, then simulated that many backspace keyboard presses and then simulated writing the found word on the keyboard. That way, the word in question gets automatically replaced in the text. If I want to try the different word, I simply keep pressing the ‘=’ key to get the different meanings found in the database.
def place_word(found_word, trigger): for x in range(len(found_word) + 1): keyboard.send('backspace') keyboard.write(found_word)
Conclusion
This program has a lot to desire, but it’s a good starting point and sort of works for my current needs and considering 1.2 million words in the database, it responds almost instantly and works offline without the need to connect to the Internet.
Full Souce Code
The entire code is available here: https://github.com/JozefJarosciak/slovak-diacritics-python/
import re import sqlite3 import keyboard import unidecode words = '' found_words = [] equal_sign_press_counter = 0 def db_search_to_list(search_term): global found_words conn = sqlite3.connect('sk.db') cursor = conn.execute("select word from sk_aspell where lower(worddf) = '" + search_term.lower() + "'") for row in cursor: if search_term[0].isupper(): found_words.append(str(row[0]).capitalize()) else: found_words.append(str(row[0])) if len(found_words)>1: print(found_words) def place_word(found_word, trigger): for x in range(len(found_word)+1): keyboard.send('backspace') keyboard.write(found_word) def on_press_reaction(event): global words, found_words, equal_sign_press_counter if event.name == 'space' or event.name == '.' or event.name == ',' or event.name == ';' or event.name == 'enter': words = '' elif event.name == 'backspace': if len(words) > 0: words = words[:-1] elif event.name == 'delete': found_words = [] elif event.name == '=': if len(str(words)) > 0: search = unidecode.unidecode(words) found_words = [search] equal_sign_press_counter = 1 db_search_to_list(search) words = '' if len(found_words) > 1: print(str(equal_sign_press_counter) + " - " + found_words[equal_sign_press_counter]) place_word(str(found_words[equal_sign_press_counter]), str(event.name)) equal_sign_press_counter = equal_sign_press_counter + 1 if equal_sign_press_counter > len(found_words)-1: equal_sign_press_counter = 0 else: # print(event.name) if len(str(event.name)) == 1: if re.match("^[A-Za-z0-9_-]*$", event.name): words = words + str(event.name) keyboard.on_press(on_press_reaction) while True: pass