ngram-class ngram-class NAME ngram-class - induce word classes from N-gram statistics SYNOPSIS ngram-class [-help] option ... DESCRIPTION ngram-class induces word classes from distributional statistics, so as to minimize perplexity of a class-based N-gram model given the provided word N-gram counts. Presently, only bigram statistics are used, i.e., the induced classes are best suited for a class-bigram language model. The program generates the class N-gram counts and class expansions needed by ngram-count(1) and ngram(1), respectively to train and to apply the class N-gram model. OPTIONS Each filename argument can be an ASCII file, or a compressed file (name ending in .Z or .gz), or ``-'' to indicate stdin/stdout. -help Print option summary. -version Print version information. -debug level Set debugging output at level. Level 0 means no debugging. Debugging messages are written to stderr. A useful level to trace the formation of classes is 2. Input Options -vocab file Read a vocabulary from file. Subsequently, out-of-vocabulary words in both counts or text are replaced with the unknown-word token. If this option is not specified all words found are implicitly added to the vocabulary. -tolower Map the vocabulary to lowercase. -counts file Read N-gram counts from a file. Each line contains an N-gram of words, followed by an integer count, all separated by whitespace. Repeated counts for the same N-gram are added. Counts collected by -text and -counts are additive as well. Note that the input should contain consistent lower- and higher-order counts (i.e., unigrams and bigrams), as would be generated by ngram-count(1). -text textfile Generate N-gram counts from text file. textfile should contain one sentence unit per line. Begin/end sentence tokens are added if not already present. Empty lines are ignored. Class Merging -numclasses C The target number of classes to induce. A zero argument suppresses automatic class merging altogether (e.g., for use with -interact). -full Perform full greedy merging over all classes starting with one class per word. This is the O(V^3) algorithm described in Brown et al. (1992). -incremental Perform incremental greedy merging, starting with one class each for the C most frequent words, and then adding one word at a time. This is the O(V*C^2) algorithm described in Brown et al. (1992); it is the default. -interact Enter a primitive interactive interface when done with automatic class induction, allowing manual specification of additional merging steps. -noclass-vocab file Read a list of vocabulary items from file that are to be excluded from classes. These words or tags do no undergo class merging, but their N-gram counts still affect the optimization of model perplexity. The default is to exclude the sentence begin/end tags (<s> and </s>) from class merging; this can be suppressed by specifying -noclass-vocab /dev/null. Output Options -class-counts file Write class N-gram counts to file when done. The format is the same as for word N-gram counts, and can be read by ngram-count(1) to estimate a class-N-gram model. -classes file Write class definitions (member words and their probabilities) to file when done. The output format is the same as required by the -classes option of ngram(1). -save S Save the class counts and/or class definitions every S iterations during induction. The filenames are obtained from the -class-counts and -classes options, respectively, by appending the iteration number. This is convenient for producing sets of classes at different granularities during the same run. S=0 (the default) suppresses the saving actions. SEE ALSO ngram-count(1), ngram(1). P. F. Brown, V. J. Della Pietra, P. V. deSouza, J. C. Lai and R. L. Mercer, ``Class-Based n-gram Models of Natural Language,'' Computational Linguistics 18(4), 467-479, 1992. BUGS Classes are optimized only for bigram models at present. AUTHOR Andreas Stolcke <stolcke@speech.sri.com>. Copyright 1999-2004 SRI International