Alfio Puglisi's original wiki2static.pl generates its static webpages in a two-level directory structure using the two starting letters of the html files as order criteria. This means that the file “zaurus.html” can be found in the directory wikipedia/z/za. These html-files contain quite a bit redundancy, first because text in HTML-encoding always does and second because there are several similarities between the files. All contain e.g. the links to the main-, search- and edit pages as well as lot of similar markup.
Using only the first kind of redundancy (the inner-file one) is an interesting option, since compressed html files (i.e. html.gz files) can be understood and shown by some webbrowsers. Also there exists an Apache module which decompresses them on the fly for the other browsers. So this would allow a compressed static version with nothing but a webserver and a browser.
Alas with that choice compression suffers from the fact that not all available redundancy is used and even more from the storage lost by partially filled file system blocks. Since most file systems have a static block length and use only complete blocks for files, files that are not a multiple of this length will waste storage (notable exception is reiserfs which tries to use the unused bytes in these blocks for other files). On the average every file wastes half a file system block (FAT calls them 'cluster' BTW). With 180000 HTML files in the english wikipedia and using a flash disk with a blocksize of 2KB (most FAT-formatted flash cards use bigger blocks!) this sums up to a loss of 180000KB or ca. 175MB.
Therefore wiki2zaurus tars and compresses
the complete leaf directories so that instead of 180000 HTML-files
the wikipedia consists of only ca. 720 tar-files. Zaurus.html can be
found in the file wikipedia/z/za.tgz . This is no longer
directly understandable to a webbrowser so a cgi-script is introduced
(wpg-de resp. wpg-en) with the only task to
extract the frontmost letters from the passed filename, create the
path to the tgz-file and extract the wanted article. Also some HTML
is the same for all files. It is removed from the individual files
and printed instead by the cgi script. Thus, the files are now longer
complete HTML-files and their suffix was changed from “.html”
to “.ht”.
The same arguments hold for the search index
files which are typically located in wikipedia/search and are called
aa.js, ab.js, ... . They are compressed into single-letter archives
(i.e. a.tgz, b.tgz, ...) and the wpg-de-s rsp. wpg-en-s
cgi scripts are responsible to extract them and add them between the
javascript-based search page parts.
This requires changes to the URLs used in links between articles, especially since it was also an aim to have several languages next to each other with the same images and TeX graphics. But the result is a compression by an factor of about 5 allowing to browse the wikipedia offline without the need for several GB of storage.
Coming from a database background my first idea to bring wikipedia to the Zaurus (which after all is a complete UNIX computer) was to build the databases from the official dump, remove all their content not needed for an offline copy (e.g. talk pages, ...), compress the database (MySQL allows this) and then use some slightly adapted versions of the official php code to generate the wikipedia on the Zaurus as it is created on the original website. It wasn't only an idea but I actually wrote a script to clean and compress the database (put it into the maintenance directory before trying) as well as changed the php-code (overwrite your htdocs/wiki directory with it) to have a rudimentary wikipedia running on the desktop with a largely reduced database size.
Although the above scripts are alpha at best, things looked lovely on the desktop. Alas, there are some problems bringing it to the Zaurus:
Apache for Zaurus contains a too old php-version,
No texvc (requiring ocaml) on the Zaurus, and most importantly
MySQL 3.23.52 on the PC already seemed buggy concerning table compression (myisampack followed by a myisamchk looses some columns). The newest version for the Zaurus is even older (3.22).
Due to this the switch of approach seemed appropriate. But feel free to try it yourself using the above files as a starting point.
last edit: