lm-scripts(1) lm-scripts(1) NNAAMMEE lm-scripts, add-dummy-bows, change-lm-vocab, empty-sen- tence-lm, get-unigram-probs, make-hiddens-lm, make-lm-sub- set, make-sub-lm, remove-lowprob-ngrams, reverse-lm, sort- lm - manipulate N-gram language models SSYYNNOOPPSSIISS aadddd--dduummmmyy--bboowwss [_l_m_-_f_i_l_e] >>_n_e_w_-_l_m_-_f_i_l_e cchhaannggee--llmm--vvooccaabb --vvooccaabb _v_o_c_a_b --llmm _l_m_-_f_i_l_e --wwrriittee--llmm _n_e_w_-_l_m_- _f_i_l_e [--ttoolloowweerr] [--ssuubbsseett] [_n_g_r_a_m_-_o_p_t_i_o_n_s...] eemmppttyy--sseenntteennccee--llmm --pprroobb _p --llmm _l_m_-_f_i_l_e --wwrriittee--llmm _n_e_w_-_l_m_- _f_i_l_e [_n_g_r_a_m_-_o_p_t_i_o_n_s...] ggeett--uunniiggrraamm--pprroobbss [lliinneeaarr==11] mmaakkee--hhiiddddeennss--llmm [_l_m_-_f_i_l_e] >>_h_i_d_d_e_n_s_-_l_m_-_f_i_l_e mmaakkee--llmm--ssuubbsseett _c_o_u_n_t_-_f_i_l_e|-- [_l_m_-_f_i_l_e|--] mmaakkee--ssuubb--llmm [mmaaxxoorrddeerr==_N] [_l_m_-_f_i_l_e] >>_n_e_w_-_l_m_-_f_i_l_e rreemmoovvee--lloowwpprroobb--nnggrraammss [_l_m_-_f_i_l_e] >>_n_e_w_-_l_m_-_f_i_l_e rreevveerrssee--llmm [_l_m_-_f_i_l_e] >>_n_e_w_-_l_m_-_f_i_l_e ssoorrtt--llmm [_l_m_-_f_i_l_e] >>_s_o_r_t_e_d_-_l_m_-_f_i_l_e DDEESSCCRRIIPPTTIIOONN These scripts perform various useful manipulations on N- gram models in their textual representation. Most operate on backoff N-grams in ARPA nnggrraamm--ffoorrmmaatt(5). Since these tools are implemented as scripts they don't automatically input or output compressed model files cor- rectly, unlike the main SRILM tools. However, since most scripts work with data from standard input or to standard output (by leaving out the file argument, or specifying it as ``-'') it is easy to combine them with gguunnzziipp(1) or ggzziipp(1) on the command line. Also note that many of the scripts take their options with the ggaawwkk(1) syntax _o_p_t_i_o_n==_v_a_l_u_e instead of the more common --_o_p_t_i_o_n _v_a_l_u_e. aadddd--dduummmmyy--bboowwss adds dummy backoff weights to N-grams, even where they are not required, to satisfy some broken soft- ware that expects backoff weights on all N-grams (except those of highest order). cchhaannggee--llmm--vvooccaabb modifies the vocabulary of an LM to be that in _v_o_c_a_b. Any N-grams containing out-of-vocabulary words are removed, new words receive a unigram probabil- ity, and the model is renormalized. The --ttoolloowweerr option causes case distinctions to be ignored. --ssuubbsseett only removes words from the LM vocabulary, without adding any. Any remaining _n_g_r_a_m_-_o_p_t_i_o_n_s are passes to nnggrraamm(1), and can be used to set debugging level, N-gram order, etc. eemmppttyy--sseenntteennccee--llmm modifies an LM so that it allows the empty sentence with probability _p. This is useful to mod- ify existing LMs that are trained on non-empty sentences only. _n_g_r_a_m_-_o_p_t_i_o_n_s are passes to nnggrraamm(1), and can be used to set debugging level, N-gram order, etc. mmaakkee--hhiiddddeennss--llmm constructs an N-gram model that can be used with the nnggrraamm --hhiiddddeennss option. The new model con- tains intra-utterance sentence boundary tags ``'' with the same probability as the original model had final sen- tence tags . Also, utterance-initial words are not conditioned on and there is no penalty associated with utterance-final . Such as model might work better it the test corpus is segmented at places other than proper boundaries. mmaakkee--llmm--ssuubbsseett forms a new LM containing only the N-grams found in the _c_o_u_n_t_-_f_i_l_e, in nnggrraamm--ccoouunntt(1) format. The result still needs to be renormalized with nnggrraamm --rreennoorrmm (which will also adjust the N-gram counts in the header). mmaakkee--ssuubb--llmm removes N-grams of order exceeding _N. This function is now redundant, since all SRILM tools can do this implicitly (without using extra memory and very small time overhead) when reading N-gram models with the appro- priate --oorrddeerr parameter. rreemmoovvee--lloowwpprroobb--nnggrraammss eliminates N-grams whose probability is lower than that which they would receive through back- off. This is useful when building finite-state networks for N-gram models. However, this function is now per- formed much faster by nnggrraamm(1) with the --pprruunnee--lloowwpprroobbss option. rreevveerrssee--llmm produces a new LM that generates sentences with probabilities equal to the reversed sentences in the input model. ssoorrtt--llmm sorts the n-grams in an LM in lexicographic order (left-most words being the most significant). This is not a requirement for SRILM, but might be necessary for some other LM software. (The LMs output by SRILM are sorted somewhat differently, reflecting the internal data struc- tures used; that is also the order that should give best cache utilization when using SRILM to read models.) ggeett--uunniiggrraamm--pprroobbss extracts the unigram probabilities in a simple table format from a backoff language model. The lliinneeaarr==11 option causes probabilities to be output on a linear (instead of log) scale. SSEEEE AALLSSOO ngram-format(5), ngram(1). BBUUGGSS These are quick-and-dirty scripts, what do you expect? rreevveerrssee--llmm supports only bigram LMs, and can produce improper probability estimates as a result of inconsistent marginals in the input model. AAUUTTHHOORR Andreas Stolcke . Copyright 1995-2006 SRI International SRILM Tools $Date: 2006/11/18 22:32:45 $ lm-scripts(1)