training-scripts training-scripts NAME training-scripts, compute-oov-rate, continuous-ngram-count, get-gt-counts, make-abs-discount, make-batch-counts, make-big-lm, make-diacritic-map, make-google-ngrams, make-gt-discounts, make-kn-counts, make-kn-discounts, merge-batch-counts, replace-words-with-classes, reverse-ngram-counts, split-tagged-ngrams, reverse-text, uniform-classes, vp2text - miscellaneous conveniences for language model training SYNOPSIS get-gt-counts max=K out=name [counts...] make-abs-discount gtcounts make-gt-discounts min=min max=max gtcounts make-kn-counts order=N max_per_file=M output=file [ no_max_order=1 ] make-kn-discounts min=min gtcounts make-batch-counts file-list [batch-size [filter [count-dir [options...]]]] merge-batch-counts count-dir [file-list|start-iter] make-google-ngrams [ dir=DIR ] [ per_file=N ] [ gzip=0 ] [counts-file...] continuous-ngram-count [ order=N ] [textfile...] reverse-ngram-counts [counts-file...] reverse-text [textfile...] split-tagged-ngrams [ separator=S ] [counts-file...] make-big-lm -name name -read counts [ -trust-totals -max-per-file M ] -lm new-model [options...] replace-words-with-classes classes=classes [outfile=counts normalize=0|1 addone=K have_counts=1 partial=1 ] [textfile...] uniform-classes classes >new-classes make-diacritic-map vocab vp2text [textfile...] compute-oov-rate vocab [counts...] DESCRIPTION These scripts perform convenience tasks associated with the training of language models. They complement and extend the basic N-gram model estimator in ngram-count(1). Since these tools are implemented as scripts they don't automatically input or output compressed data files correctly, unlike the main SRILM tools. However, since most scripts work with data from standard input or to standard output (by leaving out the file argument, or specifying it as ``-'') it is easy to combine them with gunzip(1) or gzip(1) on the command line. Also note that many of the scripts take their options with the gawk(1) syntax option=value instead of the more common -option value. get-gt-counts computes the counts-of-counts statistics needed in Good-Turing smoothing. The frequencies of counts up to K are computed (default is 10). The results are stored in a series of files with root name, name.gt1counts, name.gt2counts, ..., name.gtNcounts. It is assumed that the input counts have been properly merged, i.e., that there are no duplicated N-grams. make-gt-discounts takes one of the output files of get-gt-counts and computes the corresponding Good-Turing discounting factors. The output can then be passed to ngram-count(1) via the -gtn options to control the smoothing during model estimation. Precomputing the GT discounting in this fashion has the advantage that the GT statistics are not affected by restricting N-grams to a limited vocabulary. Also, get-gt-counts/make-gt-discounts can process arbitrarily large count files, since they do not need to read the counts into memory (unlike ngram-count). make-abs-discount computes the absolute discounting constant needed for the ngram-count -cdiscountn options. Input is one of the files produced by get-gt-counts. make-kn-discount computes the discounting constants used by the modified Kneser-Ney smoothing method. Input is one of the files produced by get-gt-counts. make-batch-counts performs the first stage in the construction of very large N-gram count files. file-list is a list of input text files. Lines starting with a `#' character are ignored. These files will be grouped into batches of size batch-size (default 10) that are then processed in one run of ngram-count each. For maximum performance, batch-size should be as large as possible without triggering paging. Optionally, a filter script or program can be given to condition the input texts. The N-gram count files are left in directory count-dir (``counts'' by default), where they can be found by a subsequent run of merge-batch-counts. All following options are passed to ngram-count, e.g., to control N-gram order, vocabulary, etc. (no options triggering model estimation should be included). merge-batch-counts completes the construction of large count files by merging the batched counts left in count-dir until a single count file is produced. Optionally, a file-list of count files to combine can be specified; otherwise all count files in count-dir from a prior run of make-batch-counts will be merged. A number as second argument restarts the merging process at iteration start-iter. This is convenient if merging fails to complete for some reason (e.g., for temporary lack of disk space). make-google-ngrams takes a sorted count file as input and creates an indexed directory structure, in a format developed by Google to store very large N-gram collections. The resulting directory can then be used with the ngram-count(1) -read-google option. Optional arguments specify the output directory dir and the size N of individual N-gram files (default is 10 million N-grams per file). The gzip=0 option writes plain, as opposed to compressed, files. continuous-ngram-count generates N-grams that span line breaks (which are usually taken to be sentence boundaries). To count N-grams across line breaks use continuous-ngram-count textfile | ngram-count -read - The argument N controls the order of N-grams counted (default 3), and should match the argument of ngram-count -order. reverse-ngram-counts reverses the word order of N-grams in a counts file or stream. For example, to recompute lower-order counts from higher-order ones, but do the summation over preceding words (rather than following words, as in ngram-count(1)), use reverse-ngram-counts count-file | \ ngram-count -read - -recompute -write - | \ reverse-ngram-counts > new-counts reverse-text reverses the word order in text files, line-by-line. Start- and end-sentence tags, if present, will be preserved. This reversal is appropriate for preprocessing training data for LMs that are meant to be used with the ngram -reverse option. split-tagged-ngrams expands N-gram count of word/tag pairs into mixed N-grams of words and tags. The optional separator=S argument allows the delimiting character, which defaults to "/", to be modified. make-big-lm constructs large N-gram models in a more memory-efficient way than ngram-count by itself. It does so by precomputing the Good-Turing or Kneser-Ney smoothing parameters from the full set of counts, and then instructing ngram-count to store only a subset of the counts in memory, namely those of N-grams to be retained in the model. The name parameter is used to name various auxiliary files. counts contains the raw N-gram counts; it may be (and usually is) a compressed file. Unlike with ngram-count, the -read option can be repeated to concatenate multiple count files, but the arguments must be regular files; reading from stdin is not supported. If Good-Turing smoothing is used and the file contains complete lower-order counts corresponding to the sums of higher-order counts, then the -trust-totals options may be given for efficiency. All other options are passed to ngram-count (only options affecting model estimation should be given). Smoothing methods other than Good-Turing and modified Kneser-Ney are not supported by make-big-lm. Kneser-Ney smoothing also requires enough disk space to compute and store the modified lower-order counts used by the KN method. This is done using the merge-batch-counts command, and the -max-per-file option controls how many counts are to be stored per batch, and should be chosen so that these batches fit in real memory. make-kn-counts computes the modified lower-order counts used by the KN smoothing method. It is invoked as a helper scripts by make-big-lm . replace-words-with-classes replaces expansions of word classes with the corresponding class labels. classes specifies class expansions in classes-format(5). Ambiguities are resolved in favor of the longest matching word strings. Ties are broken in favor of the expansion listed first in classes. Optionally, the file counts will receive the expansion counts resulting from the replacements. normalize=0 or 1 indicates whether the counts should be normalized to probabilities (default is 1). The addone option may be used to smooth the expansion probabilities by adding K to each count (default 1). The option have_counts=1 indicates that the input consists of N-gram counts and that replacement should be performed on them. Note this will not merge counts that have been mapped to identical N-grams, since this is done automatically when ngram-count(1) reads count data. The option partial=1 prevents multi-word class expansions from being replaced when more than one space character occurs inbetween the words. uniform-classes takes a file in classes-format(5) and adds uniform probabilities to expansions that don't have a probability explicitly stated. make-diacritic-map constructs a map file that pairs an ASCII-fied version of the words in vocab with all the occurring non-ASCII word forms. Such a map file can then be used with disambig(1) and a language model to reconstruct the non-ASCII word form with diacritics from an ASCII text. vp2text is a reimplementation of the filter used in the DARPA Hub-3 and Hub-4 CSR evaluations to convert ``verbalized punctuation'' texts to language model training data. compute-oov-rate determines the out-of-vocabulary rate of a corpus from its unigram counts and a target vocabulary list in vocab. SEE ALSO ngram-count(1), ngram(1), classes-format(5), disambig(1), select-vocab(1). BUGS Some of the tools could be generalized and/or made more robust to misuse. AUTHOR Andreas Stolcke <stolcke@speech.sri.com>. Copyright 1995-2006 SRI International