Monday, April 18

shovelling data like a mole on steroids

We're on part 2 of our great Wikipedia experiment. Part one is here.

My motivation was simple. I wanted a reasonably current snapshot of the Wikipedia data. Idle clicking around or the odd focused search could then happen from my own machine; not subject to the vagaries of network conditions or the health of the Wikipedia data center. I am probably a compulsive hoarder, but there are limits to the whole pack rat mentality, therefore, I followed some basic guidelines when downloading the databases.

Only current snapshots. No revision history, no metadata. This is a work in progress, so I was prepared to live with the odd dead link as well. Furthermore I only downloaded the English version of Wikipedia which has the most number of entries, but perhaps more pertinently, because it is the only language out of that lot that I can actually understand enough to read encyclopedia articles.

The database dumps of the Wikimedia foundation are done (roughly) once a month. The most recent which I used happened on the 6th of April. The dump files sans much explanation can be found here. Note that other projects by the Wikimedia foundation can also be found here; I also have local copies of wikiquote and wiktionary running on my machine. A simple mediawiki cloning technique (discussed a bit later) made this a snap. In fact, I actually had a trial run by importing the much smaller wikiquote database.

But first, find the download section for Wikipedia; as of writing, it's 4th from top. The subsection that we need is en.wikipedia or english wikipedia. All the other 2 and 3 letter acronyms represent different localized language versions. There are several hyperlinks; one for cur (, I'm not making that link clickable) which is what we need. Start downloading, get on with your life. This is, at the current dump, 806mb of compressed text, which decompresses to 2.3 GB of SQL. It might take a while to download. In the meantime, we'll set up our local copy of MediaWiki.

First, run an administration tool for MySQL (MySQLCC or any thing else that is handy) and CREATE a fresh database. I called mine Wikipedia, but really, the choice of name is yours. For an illusion of better security, a separate database user who can only access this database is also recommended. I called mine the fairly obvious wikip_user. You will need this information before starting the install of the wiki. Testing that this user can actually login and access the intended database is always useful if something goes wrong. It usually will. A guy named Murphy said that

Download the distribution of Mediawiki (download link here but do read the documentation). Decompress the contents (making sure you preserve subdirectories) into a directory named Wikipedia into your webroot. By default, apache calls this webroot directory htdocs. Point your browser at this URL (ie: and the installation process should begin. It's nice and easy; with informative error messages and so on. One detail that I would recommend is using a prefix for the media wiki tables. The default mw_ works well as any other. This serves to distinguish the media wiki tables from any others that may be in the same database. It also has another use (explained later) for all those intrepid souls attempting a Wikipedia clone. At the end of this installation, you should have a nice local install of MediaWiki chugging away on your own machine. If all you need is a wiki, stop here. Your job is done.

Some technical notes: the mediawiki install really is simple. All the installer does is create and populate a file named LocalSettings.php (which it instructs you move to the main directory when installation is complete). LocalSettings.php contains the full path to your mediawiki install, the web path and database details as variables. So, if you want another wiki, you don't need to go through the whole installation procedure again, simply copy the entire Wikipedia directory over and make the appropriate changes to LocalSettings.php. That's it. Begin, the cloning of wikis can.

Part 2 is done. We now have
a shiny php enabled webserver and mysql database.
our own copy of media wiki running off our local webserver.
a download-in-progress of the english wikipedia cur table dump.

Onto the hairy part. Importing the wikipedia data. That's part 3.


<< Home