How to import entire Wikipedia into your own MySQL database. Steps for creating MySQL database out of Wikipedia XML dump…

This article illustrates how to create MySQL database containing entire Wikipedia out of downloaded Wikipedia XML dump (enwiki-latest-pages-articles-multistream.xml)


Prerequisite:
– Make sure you have at least 100GB of free space on your hard drive + another 60 GB of free space for MySQL database.
– Download enwiki-latest-pages-articles-multistream.xml.bz2 from http://dumps.wikimedia.org/enwiki/latest/
– Extract >10GB enwiki-latest-pages-articles-multistream.xml.bz2, you’ll end up with file called: enwiki-latest-pages-articles-multistream.xml. It’ll be approx. 45GB in size.
– Delete enwiki-latest-pages-articles-multistream.xml.bz2, you don’t need it anymore.

1. Download: mwdumper.jar from http://dumps.wikimedia.org/tools/, we’ll use it to convert XML to SQL.

2. Place mwdumper.jar to same folder where you’ve extracted enwiki-latest-pages-articles-multistream.xml

3. Run following command through windows command prompt:

Use following filter (–filter=latest) if you don’t care about old article revisions and only want to keep the most recent one:

This skips all but the last revision listed for each page. Note: Currently this pays no attention to the timestamp or revision number, but simply the order of items in the dump. This may or may not be strictly correct.

When process starts, you’ll see something similar as illustrated in this screenshot:
9-30-2013 5-25-20 PM

This will run for a while, so the only requirement is patience :)

Once this step is completed, you can delete enwiki-latest-pages-articles-multistream.xml, we won’t need it.

4. Now is time to create ‘wiki’ database. Run following SQL command in MySQL:

5. Next step is to create all Wikipedia required tables. Do so by executing following SQL:

6. Now you can execute SQL commands from the file we’ve created earlier. Run following command in MySQL:
mysql -p -u your_mysql_username wiki < enwiki-latest-pages-articles-multistream-dump.sql 7. You're done :), entire Wikipedia is loaded into MySQL database. 8. If you need to query MySQL, you can do so with commands similar to this one: SELECT CONVERT(old_text USING utf8) from text where old_id=x - where x is number ID of article in text database So for example, for 571517241 id, output would look something like this: 10-1-2013 4-37-16 AM

9. Additionally, if you want to import SQL to your Windows based MySQL, it’s pretty fast to execute mysql command from windows powershell.
To do so, just go to Windows Explorer, go to top navigation bar and type in: powershell
Then Navigate to your MySQL installation:

Then execute similar to following:

–force – command will continue even if there are errors in the dump.
If you have multiple MySQL files, use it like this (in Linux):

Comments

comments