[libvoikko] Zip specification
Harri Pitkänen
hatapitk at iki.fi
Tue Sep 28 15:58:17 EEST 2010
On Friday 24 September 2010, Harri Pitkänen wrote:
> I'd suggest the following limitation to the format:
>
> "The ZIP file must not be encrypted and it must be compatible with to ZIP
> format version 2.0 as defined by PKWare application note 6.2.0:
> http://www.pkware.com/support/application-note-archives"
The ZIP format does have some problems. If the data is compressed, it must be
uncompressed before use and this makes it hard to share the data between
processes. This may not be a problem on a typical single user workstation but
almost everywhere else it is an issue:
1) Embedded systems may have slow CPUs which can add a noticeable startup
delay when the spell checker is started.
2) Low end virtual servers don't have a lot of memory. Additionally swapping
out to (virtual) disk is slow. Using file backed memory for dictionary data
would allow the operating system to skip the swap-out phase if the memory is
needed somewhere else. With compressed ZIP it won't be possible to use file
backed memory.
3) LTSP systems can have many desktop users simultaneously. We waste a lot of
memory if there are multiple copies of the same data in the memory. Let's
assume separate transducers for spell checkers and hyphenators, each using 10
MB of memory. Each user has Firefox (spell checker) and OOo (spell checker +
hyphenator) in use. That would make 30 MB of memory for each user. If we have
20 simultaneous users, that's 600 MB for just spellers and hyphenators. For
reference, the amount of total memory that people who build such systems for
Finnish schools recommended[1] is 500 MB + 20 * 128 MB = 3060 MB so we would
be using about 20% of the total system memory at that point. I suspect that
this recommendation may be bit outdated but using even 10% of total memory for
spell checking would seem a bit too much.
The first problem can be solved by using uncompressed ZIP files when we know
that the target has a slow CPU. The last problem could be (painfully) worked
around by using other IPC mechanisms. Solving the virtual server issue would
require using uncompressed ZIP and a transducer format that is suitable for
memory mapped use. I assume current optimized lookup format is not?
In my opinion the ideal format would allow efficient memory mapped access when
used on little endian platforms which are most commonly used these days. On
big endian platforms we could either translate the the data during load (and
lose the memory efficiency) or translate during lookup (which would presumably
be quite slow). The container could be uncompressed ZIP but the transducer
format might need to be different. This would work for all three use cases
mentioned above and would also have some performance benefits on normal
desktop systems.
Harri
[1] http://eduwiki.coss.fi/index.php/LTSP-
järjestelmän_asentaminen_ja_hallinnointi
More information about the Libvoikko
mailing list