training-scripts(1) training-scripts(1) NNAAMMEE training-scripts, compute-oov-rate, continuous-ngram- count, get-gt-counts, make-abs-discount, make-batch- counts, make-big-lm, make-diacritic-map, make-google- ngrams, make-gt-discounts, make-kn-counts, make-kn-dis- counts, merge-batch-counts, replace-words-with-classes, reverse-ngram-counts, split-tagged-ngrams, reverse-text, uniform-classes, vp2text - miscellaneous conveniences for language model training SSYYNNOOPPSSIISS ggeett--ggtt--ccoouunnttss mmaaxx==_K oouutt==_n_a_m_e [_c_o_u_n_t_s...] mmaakkee--aabbss--ddiissccoouunntt _g_t_c_o_u_n_t_s mmaakkee--ggtt--ddiissccoouunnttss mmiinn==_m_i_n mmaaxx==_m_a_x _g_t_c_o_u_n_t_s mmaakkee--kknn--ccoouunnttss oorrddeerr==_N mmaaxx__ppeerr__ffiillee==_M oouuttppuutt==_f_i_l_e [ nnoo__mmaaxx__oorrddeerr==11 ] mmaakkee--kknn--ddiissccoouunnttss mmiinn==_m_i_n _g_t_c_o_u_n_t_s mmaakkee--bbaattcchh--ccoouunnttss _f_i_l_e_-_l_i_s_t [_b_a_t_c_h_-_s_i_z_e [_f_i_l_t_e_r [_c_o_u_n_t_-_d_i_r [_o_p_t_i_o_n_s...]]]] mmeerrggee--bbaattcchh--ccoouunnttss _c_o_u_n_t_-_d_i_r [_f_i_l_e_-_l_i_s_t|_s_t_a_r_t_-_i_t_e_r] mmaakkee--ggooooggllee--nnggrraammss [ ddiirr==_D_I_R ] [ ppeerr__ffiillee==_N ] [ ggzziipp==00 ] [_c_o_u_n_t_s_-_f_i_l_e...] ccoonnttiinnuuoouuss--nnggrraamm--ccoouunntt [ oorrddeerr==_N ] [_t_e_x_t_f_i_l_e...] rreevveerrssee--nnggrraamm--ccoouunnttss [_c_o_u_n_t_s_-_f_i_l_e...] rreevveerrssee--tteexxtt [_t_e_x_t_f_i_l_e...] sspplliitt--ttaaggggeedd--nnggrraammss [ sseeppaarraattoorr==_S ] [_c_o_u_n_t_s_-_f_i_l_e...] mmaakkee--bbiigg--llmm --nnaammee _n_a_m_e --rreeaadd _c_o_u_n_t_s [ --ttrruusstt--ttoottaallss --mmaaxx-- ppeerr--ffiillee M ] --llmm _n_e_w_-_m_o_d_e_l [_o_p_t_i_o_n_s...] rreeppllaaccee--wwoorrddss--wwiitthh--ccllaasssseess ccllaasssseess==_c_l_a_s_s_e_s [oouuttffiillee==_c_o_u_n_t_s nnoorrmmaalliizzee==0|1 aaddddoonnee==_K hhaavvee__ccoouunnttss==11 ppaarrttiiaall==11 ] [_t_e_x_t_f_i_l_e...] uunniiffoorrmm--ccllaasssseess _c_l_a_s_s_e_s >>_n_e_w_-_c_l_a_s_s_e_s mmaakkee--ddiiaaccrriittiicc--mmaapp _v_o_c_a_b vvpp22tteexxtt [_t_e_x_t_f_i_l_e...] ccoommppuuttee--oooovv--rraattee _v_o_c_a_b [_c_o_u_n_t_s...] DDEESSCCRRIIPPTTIIOONN These scripts perform convenience tasks associated with the training of language models. They complement and extend the basic N-gram model estimator in nnggrraamm--ccoouunntt(1). Since these tools are implemented as scripts they don't automatically input or output compressed data files cor- rectly, unlike the main SRILM tools. However, since most scripts work with data from standard input or to standard output (by leaving out the file argument, or specifying it as ``-'') it is easy to combine them with gguunnzziipp(1) or ggzziipp(1) on the command line. Also note that many of the scripts take their options with the ggaawwkk(1) syntax _o_p_t_i_o_n==_v_a_l_u_e instead of the more common --_o_p_t_i_o_n _v_a_l_u_e. ggeett--ggtt--ccoouunnttss computes the counts-of-counts statistics needed in Good-Turing smoothing. The frequencies of counts up to _K are computed (default is 10). The results are stored in a series of files with root _n_a_m_e, _n_a_m_e..ggtt11ccoouunnttss, _n_a_m_e..ggtt22ccoouunnttss, ..., _n_a_m_e..ggtt_Nccoouunnttss. It is assumed that the input counts have been properly merged, i.e., that there are no duplicated N-grams. mmaakkee--ggtt--ddiissccoouunnttss takes one of the output files of ggeett--ggtt-- ccoouunnttss and computes the corresponding Good-Turing dis- counting factors. The output can then be passed to nnggrraamm-- ccoouunntt(1) via the --ggtt_n options to control the smoothing during model estimation. Precomputing the GT discounting in this fashion has the advantage that the GT statistics are not affected by restricting N-grams to a limited vocabulary. Also, ggeett--ggtt--ccoouunnttss/mmaakkee--ggtt--ddiissccoouunnttss can process arbitrarily large count files, since they do not need to read the counts into memory (unlike nnggrraamm--ccoouunntt). mmaakkee--aabbss--ddiissccoouunntt computes the absolute discounting con- stant needed for the nnggrraamm--ccoouunntt --ccddiissccoouunntt_n options. Input is one of the files produced by ggeett--ggtt--ccoouunnttss. mmaakkee--kknn--ddiissccoouunntt computes the discounting constants used by the modified Kneser-Ney smoothing method. Input is one of the files produced by ggeett--ggtt--ccoouunnttss. mmaakkee--bbaattcchh--ccoouunnttss performs the first stage in the con- struction of very large N-gram count files. _f_i_l_e_-_l_i_s_t is a list of input text files. Lines starting with a `#' character are ignored. These files will be grouped into batches of size _b_a_t_c_h_-_s_i_z_e (default 10) that are then pro- cessed in one run of nnggrraamm--ccoouunntt each. For maximum per- formance, _b_a_t_c_h_-_s_i_z_e should be as large as possible with- out triggering paging. Optionally, a _f_i_l_t_e_r script or program can be given to condition the input texts. The N- gram count files are left in directory _c_o_u_n_t_-_d_i_r (``counts'' by default), where they can be found by a sub- sequent run of mmeerrggee--bbaattcchh--ccoouunnttss. All following _o_p_t_i_o_n_s are passed to nnggrraamm--ccoouunntt, e.g., to control N-gram order, vocabulary, etc. (no options triggering model estimation should be included). mmeerrggee--bbaattcchh--ccoouunnttss completes the construction of large count files by merging the batched counts left in _c_o_u_n_t_- _d_i_r until a single count file is produced. Optionally, a _f_i_l_e_-_l_i_s_t of count files to combine can be specified; oth- erwise all count files in _c_o_u_n_t_-_d_i_r from a prior run of mmaakkee--bbaattcchh--ccoouunnttss will be merged. A number as second argument restarts the merging process at iteration _s_t_a_r_t_- _i_t_e_r. This is convenient if merging fails to complete for some reason (e.g., for temporary lack of disk space). mmaakkee--ggooooggllee--nnggrraammss takes a sorted count file as input and creates an indexed directory structure, in a format devel- oped by Google to store very large N-gram collections. The resulting directory can then be used with the nnggrraamm-- ccoouunntt(1) --rreeaadd--ggooooggllee option. Optional arguments specify the output directory _d_i_r and the size _N of individual N- gram files (default is 10 million N-grams per file). The ggzziipp==00 option writes plain, as opposed to compressed, files. ccoonnttiinnuuoouuss--nnggrraamm--ccoouunntt generates N-grams that span line breaks (which are usually taken to be sentence bound- aries). To count N-grams across line breaks use continuous-ngram-count _t_e_x_t_f_i_l_e | ngram-count -read - The argument _N controls the order of N-grams counted (default 3), and should match the argument of nnggrraamm--ccoouunntt --oorrddeerr. rreevveerrssee--nnggrraamm--ccoouunnttss reverses the word order of N-grams in a counts file or stream. For example, to recompute lower- order counts from higher-order ones, but do the summation over preceding words (rather than following words, as in nnggrraamm--ccoouunntt(1)), use reverse-ngram-counts _c_o_u_n_t_-_f_i_l_e | \ ngram-count -read - -recompute -write - | \ reverse-ngram-counts > _n_e_w_-_c_o_u_n_t_s rreevveerrssee--tteexxtt reverses the word order in text files, line- by-line. Start- and end-sentence tags, if present, will be preserved. This reversal is appropriate for prepro- cessing training data for LMs that are meant to be used with the nnggrraamm --rreevveerrssee option. sspplliitt--ttaaggggeedd--nnggrraammss expands N-gram count of word/tag pairs into mixed N-grams of words and tags. The optional sseeppaa-- rraattoorr==_S argument allows the delimiting character, which defaults to "/", to be modified. mmaakkee--bbiigg--llmm constructs large N-gram models in a more mem- ory-efficient way than nnggrraamm--ccoouunntt by itself. It does so by precomputing the Good-Turing or Kneser-Ney smoothing parameters from the full set of counts, and then instruct- ing nnggrraamm--ccoouunntt to store only a subset of the counts in memory, namely those of N-grams to be retained in the model. The _n_a_m_e parameter is used to name various auxil- iary files. _c_o_u_n_t_s contains the raw N-gram counts; it may be (and usually is) a compressed file. Unlike with nnggrraamm-- ccoouunntt, the --rreeaadd option can be repeated to concatenate multiple count files, but the arguments must be regular files; reading from stdin is not supported. If Good-Tur- ing smoothing is used and the file contains complete lower-order counts corresponding to the sums of higher- order counts, then the --ttrruusstt--ttoottaallss options may be given for efficiency. All other _o_p_t_i_o_n_s are passed to nnggrraamm-- ccoouunntt (only options affecting model estimation should be given). Smoothing methods other than Good-Turing and mod- ified Kneser-Ney are not supported by mmaakkee--bbiigg--llmm. Kneser-Ney smoothing also requires enough disk space to compute and store the modified lower-order counts used by the KN method. This is done using the mmeerrggee--bbaattcchh--ccoouunnttss command, and the --mmaaxx--ppeerr--ffiillee option controls how many counts are to be stored per batch, and should be chosen so that these batches fit in real memory. mmaakkee--kknn--ccoouunnttss computes the modified lower-order counts used by the KN smoothing method. It is invoked as a helper scripts by mmaakkee--bbiigg--llmm .. rreeppllaaccee--wwoorrddss--wwiitthh--ccllaasssseess replaces expansions of word classes with the corresponding class labels. _c_l_a_s_s_e_s specifies class expansions in ccllaasssseess--ffoorrmmaatt(5). Ambigui- ties are resolved in favor of the longest matching word strings. Ties are broken in favor of the expansion listed first in _c_l_a_s_s_e_s_. Optionally, the file _c_o_u_n_t_s will receive the expansion counts resulting from the replace- ments. nnoorrmmaalliizzee==00 or 11 indicates whether the counts should be normalized to probabilities (default is 1). The aaddddoonnee option may be used to smooth the expansion proba- bilities by adding _K to each count (default 1). The option hhaavvee__ccoouunnttss==11 indicates that the input consists of N-gram counts and that replacement should be performed on them. Note this will not merge counts that have been mapped to identical N-grams, since this is done automati- cally when nnggrraamm--ccoouunntt(1) reads count data. The option ppaarrttiiaall==11 prevents multi-word class expansions from being replaced when more than one space character occurs inbe- tween the words. uunniiffoorrmm--ccllaasssseess takes a file in ccllaasssseess--ffoorrmmaatt(5) and adds uniform probabilities to expansions that don't have a probability explicitly stated. mmaakkee--ddiiaaccrriittiicc--mmaapp constructs a map file that pairs an ASCII-fied version of the words in _v_o_c_a_b with all the occurring non-ASCII word forms. Such a map file can then be used with ddiissaammbbiigg(1) and a language model to recon- struct the non-ASCII word form with diacritics from an ASCII text. vvpp22tteexxtt is a reimplementation of the filter used in the DARPA Hub-3 and Hub-4 CSR evaluations to convert ``verbal- ized punctuation'' texts to language model training data. ccoommppuuttee--oooovv--rraattee determines the out-of-vocabulary rate of a corpus from its unigram _c_o_u_n_t_s and a target vocabulary list in _v_o_c_a_b. SSEEEE AALLSSOO ngram-count(1), ngram(1), classes-format(5), disambig(1), select-vocab(1). BBUUGGSS Some of the tools could be generalized and/or made more robust to misuse. AAUUTTHHOORR Andreas Stolcke . Copyright 1995-2006 SRI International SRILM Tools $Date: 2006/08/11 22:35:11 $training-scripts(1)