The builder requires a Java™ Runtime Environment (JRE) version 1.4+.
A powerful machine is recommendable. The compression algorithm used by the builder is very memory intensive. You should have at least 128Mb of memory (or of swap space at a pinch) on your machine. The dictbuilder script sets the memory limit to a high value. In exceptional cases, you might have to set it higher. As an indication, compiling our big French word list (800,000 words, 9.8Mb) requires 30 seconds and 87Mb of memory on a 1 GHz Pentium III.
The builder is no longer included in the SDK. It must be downloaded separately from www.xmlmind.com/spellchecker/.
In all cases, the builder is a command-line utility: a shell file named dictbuilder on Unix or MacOS, dictbuilder.bat on Windows.
General form of the command line:
dictbuilder ?options? word_list ... word_list ?-sub word_list ... word_list?
It is also possible to use a compiled dictionary as input. This is the way to create a new version of an existing dictionary if you do not possess the source word list.
General options:
character_encodingEncoding used in word lists, frequent word list and hints files. This must be an encoding supported by Java™ runtime.
This option must be placed before the files it applies to.
hints_fileSpecifies the hints file.
Specifying a hints file is almost always needed as this file is used to specify which characters may be used to form a word.
The hints files used to build XMLmind's en, fr, de, and es dictionaries are found here: en.hints, fr.hints, de.hints, es.hints. Note that the encoding of all these hints files is ISO-8859-1.
word_listList of frequent words.
word_listList of standard prefixes.
word_list ... word_listEvery word list whose path follows this option will be subtracted from the resulting dictionary, instead of being merged with. It means that every word belonging to this word list will be absent from the result. This option should be placed after the input word lists.
output_fileSpecifies the compiled dictionary output file. The convention is to use a .cdi extension, but there is no obligation.
Other options:
Explain what is being done.
out_word_listAfter merging all the compiled and textual word lists specified in the command line and after subtracting words if the -sub option is used, output the resulting word list in specified text file. As always, the encoding of the generated text file is specified using the -cs option.
Example 1: Create compiled dictionary mylang.cdi out of word lists mywords.txt and extrawords.txt. The encoding of all text files specified in the command line is ISO-8859-2. Hints file is mylang.hints. Frequent words are contained in frqw.txt. Standard prefixes are contained in myprefixes.txt.
dictbuilder -cs ISO-8859-2 -hints mylang.hints -freq frqw.txt -prefixes myprefixes.txt \
mywords.txt extrawords.txt -o mylang.cdiExample 2: Add words contained in added_words.txt to compiled dictionary de.cdi. Compile the resulting word list as new_de.cdi.
dictbuilder -cs ISO-8859-1 -hints de.hints de.cdi added_words.txt -o new_de.cdi
Example 3: Subtract words contained in removed_words.txt from compiled dictionary de.cdi. Compile the resulting word list as new_de.cdi.
dictbuilder -cs ISO-8859-1 -hints de.hints de.cdi \
-sub removed_words.txt -o new_de.cdiExample 4: Output in text file de.txt all the words contained in compiled dictionary de.cdi.
dictbuilder -verbose -cs ISO-8859-1 -hints de.hints de.cdi -dump de.txt