这是一款很好用的工具包

源代码在线查看: training-scripts.1

软件大小: 3034 K
上传用户: wanghaihah
关键词: 工具包
下载地址: 免注册下载 普通下载 VIP

相关代码

				training-scripts(1)                           training-scripts(1)																NNAAMMEE				       training-scripts,    compute-oov-rate,   continuous-ngram-				       count,   get-gt-counts,   make-abs-discount,   make-batch-				       counts,   make-big-lm,  make-diacritic-map,   make-google-				       ngrams,  make-gt-discounts,  make-kn-counts,  make-kn-dis-				       counts,   merge-batch-counts,  replace-words-with-classes,				       reverse-ngram-counts,  split-tagged-ngrams,  reverse-text,				       uniform-classes,  vp2text - miscellaneous conveniences for				       language model training								SSYYNNOOPPSSIISS				       ggeett--ggtt--ccoouunnttss mmaaxx==_K oouutt==_n_a_m_e [_c_o_u_n_t_s...]				       mmaakkee--aabbss--ddiissccoouunntt _g_t_c_o_u_n_t_s				       mmaakkee--ggtt--ddiissccoouunnttss mmiinn==_m_i_n mmaaxx==_m_a_x _g_t_c_o_u_n_t_s				       mmaakkee--kknn--ccoouunnttss  oorrddeerr==_N   mmaaxx__ppeerr__ffiillee==_M   oouuttppuutt==_f_i_l_e   [				       nnoo__mmaaxx__oorrddeerr==11 ]				       mmaakkee--kknn--ddiissccoouunnttss mmiinn==_m_i_n _g_t_c_o_u_n_t_s				       mmaakkee--bbaattcchh--ccoouunnttss _f_i_l_e_-_l_i_s_t [_b_a_t_c_h_-_s_i_z_e [_f_i_l_t_e_r [_c_o_u_n_t_-_d_i_r				       [_o_p_t_i_o_n_s...]]]]				       mmeerrggee--bbaattcchh--ccoouunnttss _c_o_u_n_t_-_d_i_r [_f_i_l_e_-_l_i_s_t|_s_t_a_r_t_-_i_t_e_r]				       mmaakkee--ggooooggllee--nnggrraammss [ ddiirr==_D_I_R ] [ ppeerr__ffiillee==_N ] [  ggzziipp==00  ]				       [_c_o_u_n_t_s_-_f_i_l_e...]				       ccoonnttiinnuuoouuss--nnggrraamm--ccoouunntt [ oorrddeerr==_N ] [_t_e_x_t_f_i_l_e...]				       rreevveerrssee--nnggrraamm--ccoouunnttss [_c_o_u_n_t_s_-_f_i_l_e...]				       rreevveerrssee--tteexxtt [_t_e_x_t_f_i_l_e...]				       sspplliitt--ttaaggggeedd--nnggrraammss [ sseeppaarraattoorr==_S ] [_c_o_u_n_t_s_-_f_i_l_e...]				       mmaakkee--bbiigg--llmm  --nnaammee _n_a_m_e --rreeaadd _c_o_u_n_t_s [ --ttrruusstt--ttoottaallss --mmaaxx--				       ppeerr--ffiillee M ] --llmm _n_e_w_-_m_o_d_e_l [_o_p_t_i_o_n_s...]				       rreeppllaaccee--wwoorrddss--wwiitthh--ccllaasssseess ccllaasssseess==_c_l_a_s_s_e_s [oouuttffiillee==_c_o_u_n_t_s				       nnoorrmmaalliizzee==0|1    aaddddoonnee==_K    hhaavvee__ccoouunnttss==11   ppaarrttiiaall==11   ]				       [_t_e_x_t_f_i_l_e...]				       uunniiffoorrmm--ccllaasssseess _c_l_a_s_s_e_s >>_n_e_w_-_c_l_a_s_s_e_s				       mmaakkee--ddiiaaccrriittiicc--mmaapp _v_o_c_a_b				       vvpp22tteexxtt [_t_e_x_t_f_i_l_e...]				       ccoommppuuttee--oooovv--rraattee _v_o_c_a_b [_c_o_u_n_t_s...]								DDEESSCCRRIIPPTTIIOONN				       These scripts perform convenience  tasks  associated  with				       the  training  of  language  models.   They complement and				       extend the basic N-gram model estimator in nnggrraamm--ccoouunntt(1).								       Since  these  tools  are implemented as scripts they don't				       automatically input or output compressed data  files  cor-				       rectly,  unlike the main SRILM tools.  However, since most				       scripts work with data from standard input or to  standard				       output (by leaving out the file argument, or specifying it				       as ``-'') it is easy to combine  them  with  gguunnzziipp(1)  or				       ggzziipp(1) on the command line.								       Also note that many of the scripts take their options with				       the ggaawwkk(1) syntax _o_p_t_i_o_n==_v_a_l_u_e instead of the more common				       --_o_p_t_i_o_n _v_a_l_u_e.								       ggeett--ggtt--ccoouunnttss  computes  the  counts-of-counts  statistics				       needed  in  Good-Turing  smoothing.   The  frequencies  of				       counts  up to _K are computed (default is 10).  The results				       are  stored  in  a  series  of  files  with   root   _n_a_m_e,				       _n_a_m_e..ggtt11ccoouunnttss,  _n_a_m_e..ggtt22ccoouunnttss,  ..., _n_a_m_e..ggtt_Nccoouunnttss.  It				       is assumed  that  the  input  counts  have  been  properly				       merged, i.e., that there are no duplicated N-grams.								       mmaakkee--ggtt--ddiissccoouunnttss takes one of the output files of ggeett--ggtt--				       ccoouunnttss and computes  the  corresponding  Good-Turing  dis-				       counting factors.  The output can then be passed to nnggrraamm--				       ccoouunntt(1) via the --ggtt_n options  to  control  the  smoothing				       during  model estimation.  Precomputing the GT discounting				       in this fashion has the advantage that the  GT  statistics				       are  not  affected  by  restricting  N-grams  to a limited				       vocabulary.   Also,  ggeett--ggtt--ccoouunnttss/mmaakkee--ggtt--ddiissccoouunnttss   can				       process  arbitrarily  large count files, since they do not				       need to read the counts into memory (unlike  nnggrraamm--ccoouunntt).								       mmaakkee--aabbss--ddiissccoouunntt  computes  the absolute discounting con-				       stant needed  for  the  nnggrraamm--ccoouunntt  --ccddiissccoouunntt_n  options.				       Input is one of the files produced by ggeett--ggtt--ccoouunnttss.								       mmaakkee--kknn--ddiissccoouunntt  computes  the discounting constants used				       by the modified Kneser-Ney smoothing method.  Input is one				       of the files produced by ggeett--ggtt--ccoouunnttss.								       mmaakkee--bbaattcchh--ccoouunnttss  performs  the  first  stage in the con-				       struction of very large N-gram count files.  _f_i_l_e_-_l_i_s_t  is				       a  list  of  input  text files.  Lines starting with a `#'				       character are ignored.  These files will be  grouped  into				       batches of size _b_a_t_c_h_-_s_i_z_e (default 10) that are then pro-				       cessed in one run of nnggrraamm--ccoouunntt each.  For  maximum  per-				       formance,  _b_a_t_c_h_-_s_i_z_e should be as large as possible with-				       out triggering paging.  Optionally,  a  _f_i_l_t_e_r  script  or				       program can be given to condition the input texts.  The N-				       gram  count  files  are  left   in   directory   _c_o_u_n_t_-_d_i_r				       (``counts'' by default), where they can be found by a sub-				       sequent run of mmeerrggee--bbaattcchh--ccoouunnttss.  All following  _o_p_t_i_o_n_s				       are  passed to nnggrraamm--ccoouunntt, e.g., to control N-gram order,				       vocabulary, etc.  (no options triggering model  estimation				       should be included).								       mmeerrggee--bbaattcchh--ccoouunnttss  completes  the  construction  of large				       count files by merging the batched counts left  in  _c_o_u_n_t_-				       _d_i_r  until a single count file is produced.  Optionally, a				       _f_i_l_e_-_l_i_s_t of count files to combine can be specified; oth-				       erwise  all  count  files in _c_o_u_n_t_-_d_i_r from a prior run of				       mmaakkee--bbaattcchh--ccoouunnttss will be  merged.   A  number  as  second				       argument  restarts the merging process at iteration _s_t_a_r_t_-				       _i_t_e_r.  This is convenient if merging fails to complete for				       some reason (e.g., for temporary lack of disk space).								       mmaakkee--ggooooggllee--nnggrraammss  takes a sorted count file as input and				       creates an indexed directory structure, in a format devel-				       oped  by  Google  to  store very large N-gram collections.				       The resulting directory can then be used with  the  nnggrraamm--				       ccoouunntt(1)  --rreeaadd--ggooooggllee option.  Optional arguments specify				       the output directory _d_i_r and the size _N of  individual  N-				       gram  files (default is 10 million N-grams per file).  The				       ggzziipp==00 option writes  plain,  as  opposed  to  compressed,				       files.								       ccoonnttiinnuuoouuss--nnggrraamm--ccoouunntt  generates  N-grams  that span line				       breaks (which are usually  taken  to  be  sentence  bound-				       aries).  To count N-grams across line breaks use				            continuous-ngram-count _t_e_x_t_f_i_l_e | ngram-count -read -				       The argument _N  controls  the  order  of  N-grams  counted				       (default 3), and should match  the argument of nnggrraamm--ccoouunntt				       --oorrddeerr.								       rreevveerrssee--nnggrraamm--ccoouunnttss reverses the word order of N-grams in				       a counts file or stream.  For example, to recompute lower-				       order counts from higher-order ones, but do the  summation				       over  preceding  words (rather than following words, as in				       nnggrraamm--ccoouunntt(1)), use				            reverse-ngram-counts _c_o_u_n_t_-_f_i_l_e | \				            ngram-count -read - -recompute -write - | \				            reverse-ngram-counts > _n_e_w_-_c_o_u_n_t_s								       rreevveerrssee--tteexxtt reverses the word order in text files,  line-				       by-line.   Start-  and end-sentence tags, if present, will				       be preserved.  This reversal is  appropriate  for  prepro-				       cessing  training  data  for LMs that are meant to be used				       with the nnggrraamm --rreevveerrssee option.								       sspplliitt--ttaaggggeedd--nnggrraammss expands N-gram count of word/tag pairs				       into  mixed N-grams of words and tags.  The optional sseeppaa--				       rraattoorr==_S argument allows the  delimiting  character,  which				       defaults to "/", to be modified.								       mmaakkee--bbiigg--llmm  constructs large N-gram models in a more mem-				       ory-efficient way than nnggrraamm--ccoouunntt by itself.  It does  so				       by  precomputing  the  Good-Turing or Kneser-Ney smoothing				       parameters from the full set of counts, and then instruct-				       ing  nnggrraamm--ccoouunntt  to  store only a subset of the counts in				       memory, namely those of N-grams  to  be  retained  in  the				       model.   The _n_a_m_e parameter is used to name various auxil-				       iary files.  _c_o_u_n_t_s contains the raw N-gram counts; it may				       be (and usually is) a compressed file.  Unlike with nnggrraamm--				       ccoouunntt, the --rreeaadd option can  be  repeated  to  concatenate				       multiple  count  files,  but the arguments must be regular				       files; reading from stdin is not supported.  If  Good-Tur-				       ing  smoothing  is  used  and  the  file contains complete				       lower-order counts corresponding to the  sums  of  higher-				       order  counts, then the --ttrruusstt--ttoottaallss options may be given				       for efficiency.  All other _o_p_t_i_o_n_s are  passed  to  nnggrraamm--				       ccoouunntt  (only  options affecting model estimation should be				       given).  Smoothing methods other than Good-Turing and mod-				       ified   Kneser-Ney   are  not  supported  by  mmaakkee--bbiigg--llmm.				       Kneser-Ney smoothing also requires enough  disk  space  to				       compute  and store the modified lower-order counts used by				       the KN method.  This is done using the  mmeerrggee--bbaattcchh--ccoouunnttss				       command,  and  the  --mmaaxx--ppeerr--ffiillee option controls how many				       counts are to be stored per batch, and should be chosen so				       that these batches fit in real memory.								       mmaakkee--kknn--ccoouunnttss  computes  the  modified lower-order counts				       used by the KN smoothing  method.   It  is  invoked  as  a				       helper scripts by mmaakkee--bbiigg--llmm ..								       rreeppllaaccee--wwoorrddss--wwiitthh--ccllaasssseess  replaces  expansions  of  word				       classes with  the  corresponding  class  labels.   _c_l_a_s_s_e_s				       specifies class expansions in ccllaasssseess--ffoorrmmaatt(5).  Ambigui-				       ties are resolved in favor of the  longest  matching  word				       strings.  Ties are broken in favor of the expansion listed				       first  in  _c_l_a_s_s_e_s_.   Optionally,  the  file  _c_o_u_n_t_s  will				       receive  the  expansion counts resulting from the replace-				       ments.  nnoorrmmaalliizzee==00 or  11  indicates  whether  the  counts				       should be normalized to probabilities (default is 1).  The				       aaddddoonnee option may be used to smooth the  expansion  proba-				       bilities  by  adding  _K  to  each  count (default 1).  The				       option hhaavvee__ccoouunnttss==11 indicates that the input consists  of				       N-gram  counts and that replacement should be performed on				       them.  Note this will not  merge  counts  that  have  been				       mapped  to identical N-grams, since this is done automati-				       cally when nnggrraamm--ccoouunntt(1) reads count  data.   The  option				       ppaarrttiiaall==11  prevents multi-word class expansions from being				       replaced when more than one space character  occurs  inbe-				       tween the words.								       uunniiffoorrmm--ccllaasssseess takes a file in ccllaasssseess--ffoorrmmaatt(5) and adds				       uniform probabilities to  expansions  that  don't  have  a				       probability explicitly stated.								       mmaakkee--ddiiaaccrriittiicc--mmaapp  constructs  a  map  file that pairs an				       ASCII-fied version of the words  in  _v_o_c_a_b  with  all  the				       occurring  non-ASCII word forms.  Such a map file can then				       be used with ddiissaammbbiigg(1) and a language  model  to  recon-				       struct  the  non-ASCII  word  form with diacritics from an				       ASCII text.								       vvpp22tteexxtt is a reimplementation of the filter  used  in  the				       DARPA Hub-3 and Hub-4 CSR evaluations to convert ``verbal-				       ized punctuation'' texts to language model training  data.								       ccoommppuuttee--oooovv--rraattee  determines the out-of-vocabulary rate of				       a corpus from its unigram _c_o_u_n_t_s and a  target  vocabulary				       list in _v_o_c_a_b.								SSEEEE AALLSSOO				       ngram-count(1),  ngram(1), classes-format(5), disambig(1),				       select-vocab(1).								BBUUGGSS				       Some of the tools could be generalized  and/or  made  more				       robust to misuse.								AAUUTTHHOORR				       Andreas Stolcke .				       Copyright 1995-2006 SRI International																SRILM Tools        $Date: 2006/08/11 22:35:11 $training-scripts(1)							

相关资源