Tuesday, April 19

Alexandria .. or something close

The library of Alexandria was one of the largest on Earth at one time. These days, it's possible for a slim 2.5 inch hard drive to hold enough information to rival the great library. This is the final step in my attempt at getting a publicly available encyclopedia in a form that is relatively easy to access from my own computer. Parts one and two set up the environment previously.

One very important initial step requires changing a MySQL database parameter. Find the database configuration file (named my.cnf, usually), open it in a text editor and find the line which sets the max_allowed_packet_size for [mysqld]. This must be changed from the default 1M (1 megabyte maximum) to a higher value. I chose 10M. Perhaps a slightly lower value is possible, but why take chances, huh ? If this step is forgotten, the import will fail somewhere around the 600,000 row mark; ie: with only 33.3% of the job done. I discovered this, obviously, by having the job fail on me the first time. By the way, a configuration change of this nature requires that the MySQL server be restarted.

Wait for the download of the cur table dump to complete. A 806 megabyte compressed SQL file should be somewhere on your hard disk. Next step, get that SQL fed into your MySQL database. This is the point where a bit of command line MySQL knowledge comes in handy. The problem with a file this large is that most GUI (or web) tools will simply choke, curl up and die at the sheer size. It's possible (theoretically) to get this done via GUI by splitting the file into managable chunks or something similar, but err.. why bother ?

First, get a Win32 port of gzip. Ungzip the compressed file. Find the MySQL command line client and use that for the import of the SQL. The MySQL command line client is called (perhaps obviously) mysql. My interaction with the command line is shown below. Run the following command:
mysql -u wikip_user -p < 20050406_cur_table.sql

For bonus points, there is also a way with some command line trickery and pipes to manage the entire ungzip/import steps in one line. It's even possible on Windows although a bit dodgy if you're doing it on a Windows cmd prompt. Of course, you could still use the GUI tools for everything. Just that they'd be much much slower.

Enter the password for the wikip_user when prompted and that's it. The hard disk will start churning and anything from 20 minutes to an hour (depending on the speed of your hard disk and a few other factors) later, the Wikipedia data will be in your database. Celebrate ? Nope. Not just yet. See, the Wikipedia schema cur table does not have a prefix. So, looking at your database, you'll see a mw_cur table and a cur table. Simply type in the following commands at the mysql prompt after logging in as user wikip_user:
drop table mw_cur;
alter table rename cur to mw_cur;


The first drops the existing (installed by default) mw_cur table from your database. The second command renames the cur table (with all the freshly imported wikipedia data) to mw_cur.

Now, open a browser, point to http://127.0.0.1/Wikipedia and see if you're greeted by the Wikipedia start page. If all has gone well, you should indeed see what looks like a Wikipedia start page. Notice that the page does not look identical to the main Wikipedia page. Be a bit disappointed. This is because the Wikipedia dump does not include ANY images. Some might argue that this is part (or most) of the fun of a Wikipedia article. Unfortunately, various copyright laws prevent their bundling with the standard database dump. Some of the images in use can be found in the Commons project. Of course, some images are only linked via the fair use doctrine which is only applicable to the US. Your mileage may vary. Later, I'll describe how I got some of the images included in my local Wikipedia copy, but for now, just click on your very own Wikipedia random page (http://127.0.0.1/wikipedia/Special:Randompage), follow links and enjoy. By the way, it's possible to change how your MediaWiki (and therefore, your Wikipedia page) looks by changing the skin from the not so nice default to something a bit cleaner. I recommend the cologneblue skin myself.

And that's all folks. With this, you'll never need to have another dull moment. You can become the trivia master, amaze your friends and stun your enemies with your encyclopedic grasp of proboscis monkey digestive habits and other pieces of information that may ... in time .. earn you fame and fortune in Jeopardy. Or not.

|

<< Home