[libvoikko] Zip specification

Tue Sep 28 15:58:17 EEST 2010

On Friday 24 September 2010, Harri Pitkänen wrote:
> I'd suggest the following limitation to the format:
> 
> "The ZIP file must not be encrypted and it must be compatible with to ZIP 
> format version 2.0 as defined by PKWare application note 6.2.0: 
> http://www.pkware.com/support/application-note-archives"

The ZIP format does have some problems. If the data is compressed, it must be 
uncompressed before use and this makes it hard to share the data between 
processes. This may not be a problem on a typical single user workstation but 
almost everywhere else it is an issue:

1) Embedded systems may have slow CPUs which can add a noticeable startup 
delay when the spell checker is started.

2) Low end virtual servers don't have a lot of memory. Additionally swapping 
out to (virtual) disk is slow. Using file backed memory for dictionary data 
would allow the operating system to skip the swap-out phase if the memory is 
needed somewhere else. With compressed ZIP it won't be possible to use file 
backed memory.

3) LTSP systems can have many desktop users simultaneously. We waste a lot of 
memory if there are multiple copies of the same data in the memory. Let's 
assume separate transducers for spell checkers and hyphenators, each using 10 
MB of memory. Each user has Firefox (spell checker) and OOo (spell checker + 
hyphenator) in use. That would make 30 MB of memory for each user. If we have 
20 simultaneous users, that's 600 MB for just spellers and hyphenators. For 
reference, the amount of total memory that people who build such systems for 
Finnish schools recommended[1] is 500 MB + 20 * 128 MB = 3060 MB so we would 
be using about 20% of the total system memory at that point. I suspect that 
this recommendation may be bit outdated but using even 10% of total memory for 
spell checking would seem a bit too much.

The first problem can be solved by using uncompressed ZIP files when we know 
that the target has a slow CPU. The last problem could be (painfully) worked 
around by using other IPC mechanisms. Solving the virtual server issue would 
require using uncompressed ZIP and a transducer format that is suitable for 
memory mapped use. I assume current optimized lookup format is not?

In my opinion the ideal format would allow efficient memory mapped access when 
used on little endian platforms which are most commonly used these days. On 
big endian platforms we could either translate the the data during load (and 
lose the memory efficiency) or translate during lookup (which would presumably 
be quite slow). The container could be uncompressed ZIP but the transducer 
format might need to be different. This would work for all three use cases 
mentioned above and would also have some performance benefits on normal 
desktop systems.

Harri

[1] http://eduwiki.coss.fi/index.php/LTSP-
järjestelmän_asentaminen_ja_hallinnointi