源码地带 > 电路图 > 电子资料下载 > 其他 >这是一款很好用的工具包 > 查看压缩包源码
这是一款很好用的工具包

源代码在线查看： training-scripts.1

软件大小：	3034 K
上传用户：	wanghaihah
关键词：	工具包
下载地址：	免注册下载普通下载

相关代码
training-scripts.1 training-scripts.1 training-scripts.html ppl-scripts.1 lm-scripts.1 pfsg-scripts.1 nbest-scripts.1 ppl-scripts.1
				training-scripts(1)                           training-scripts(1)																NNAAMMEE				       training-scripts,    compute-oov-rate,   continuous-ngram-				       count,   get-gt-counts,   make-abs-discount,   make-batch-				       counts,   make-big-lm,  make-diacritic-map,   make-google-				       ngrams,  make-gt-discounts,  make-kn-counts,  make-kn-dis-				       counts,   merge-batch-counts,  replace-words-with-classes,				       reverse-ngram-counts,  split-tagged-ngrams,  reverse-text,				       uniform-classes,  vp2text - miscellaneous conveniences for				       language model training								SSYYNNOOPPSSIISS				       ggeett--ggtt--ccoouunnttss mmaaxx==_K oouutt==_n_a_m_e [_c_o_u_n_t_s...]				       mmaakkee--aabbss--ddiissccoouunntt _g_t_c_o_u_n_t_s				       mmaakkee--ggtt--ddiissccoouunnttss mmiinn==_m_i_n mmaaxx==_m_a_x _g_t_c_o_u_n_t_s				       mmaakkee--kknn--ccoouunnttss  oorrddeerr==_N   mmaaxx__ppeerr__ffiillee==_M   oouuttppuutt==_f_i_l_e   [				       nnoo__mmaaxx__oorrddeerr==11 ]				       mmaakkee--kknn--ddiissccoouunnttss mmiinn==_m_i_n _g_t_c_o_u_n_t_s				       mmaakkee--bbaattcchh--ccoouunnttss _f_i_l_e_-_l_i_s_t [_b_a_t_c_h_-_s_i_z_e [_f_i_l_t_e_r [_c_o_u_n_t_-_d_i_r				       [_o_p_t_i_o_n_s...]]]]				       mmeerrggee--bbaattcchh--ccoouunnttss _c_o_u_n_t_-_d_i_r [_f_i_l_e_-_l_i_s_t|_s_t_a_r_t_-_i_t_e_r]				       mmaakkee--ggooooggllee--nnggrraammss [ ddiirr==_D_I_R ] [ ppeerr__ffiillee==_N ] [  ggzziipp==00  ]				       [_c_o_u_n_t_s_-_f_i_l_e...]				       ccoonnttiinnuuoouuss--nnggrraamm--ccoouunntt [ oorrddeerr==_N ] [_t_e_x_t_f_i_l_e...]				       rreevveerrssee--nnggrraamm--ccoouunnttss [_c_o_u_n_t_s_-_f_i_l_e...]				       rreevveerrssee--tteexxtt [_t_e_x_t_f_i_l_e...]				       sspplliitt--ttaaggggeedd--nnggrraammss [ sseeppaarraattoorr==_S ] [_c_o_u_n_t_s_-_f_i_l_e...]				       mmaakkee--bbiigg--llmm  --nnaammee _n_a_m_e --rreeaadd _c_o_u_n_t_s [ --ttrruusstt--ttoottaallss --mmaaxx--				       ppeerr--ffiillee M ] --llmm _n_e_w_-_m_o_d_e_l [_o_p_t_i_o_n_s...]				       rreeppllaaccee--wwoorrddss--wwiitthh--ccllaasssseess ccllaasssseess==_c_l_a_s_s_e_s [oouuttffiillee==_c_o_u_n_t_s				       nnoorrmmaalliizzee==0|1    aaddddoonnee==_K    hhaavvee__ccoouunnttss==11   ppaarrttiiaall==11   ]				       [_t_e_x_t_f_i_l_e...]				       uunniiffoorrmm--ccllaasssseess _c_l_a_s_s_e_s >>_n_e_w_-_c_l_a_s_s_e_s				       mmaakkee--ddiiaaccrriittiicc--mmaapp _v_o_c_a_b				       vvpp22tteexxtt [_t_e_x_t_f_i_l_e...]				       ccoommppuuttee--oooovv--rraattee _v_o_c_a_b [_c_o_u_n_t_s...]								DDEESSCCRRIIPPTTIIOONN				       These scripts perform convenience  tasks  associated  with				       the  training  of  language  models.   They complement and				       extend the basic N-gram model estimator in nnggrraamm--ccoouunntt(1).								       Since  these  tools  are implemented as scripts they don't				       automatically input or output compressed data  files  cor-				       rectly,  unlike the main SRILM tools.  However, since most				       scripts work with data from standard input or to  standard				       output (by leaving out the file argument, or specifying it				       as ``-'') it is easy to combine  them  with  gguunnzziipp(1)  or				       ggzziipp(1) on the command line.								       Also note that many of the scripts take their options with				       the ggaawwkk(1) syntax _o_p_t_i_o_n==_v_a_l_u_e instead of the more common				       --_o_p_t_i_o_n _v_a_l_u_e.								       ggeett--ggtt--ccoouunnttss  computes  the  counts-of-counts  statistics				       needed  in  Good-Turing  smoothing.   The  frequencies  of				       counts  up to _K are computed (default is 10).  The results				       are  stored  in  a  series  of  files  with   root   _n_a_m_e,				       _n_a_m_e..ggtt11ccoouunnttss,  _n_a_m_e..ggtt22ccoouunnttss,  ..., _n_a_m_e..ggtt_Nccoouunnttss.  It				       is assumed  that  the  input  counts  have  been  properly				       merged, i.e., that there are no duplicated N-grams.								       mmaakkee--ggtt--ddiissccoouunnttss takes one of the output files of ggeett--ggtt--				       ccoouunnttss and computes  the  corresponding  Good-Turing  dis-				       counting factors.  The output can then be passed to nnggrraamm--				       ccoouunntt(1) via the --ggtt_n options  to  control  the  smoothing				       during  model estimation.  Precomputing the GT discounting				       in this fashion has the advantage that the  GT  statistics				       are  not  affected  by  restricting  N-grams  to a limited				       vocabulary.   Also,  ggeett--ggtt--ccoouunnttss/mmaakkee--ggtt--ddiissccoouunnttss   can				       process  arbitrarily  large count files, since they do not				       need to read the counts into memory (unlike  nnggrraamm--ccoouunntt).								       mmaakkee--aabbss--ddiissccoouunntt  computes  the absolute discounting con-				       stant needed  for  the  nnggrraamm--ccoouunntt  --ccddiissccoouunntt_n  options.				       Input is one of the files produced by ggeett--ggtt--ccoouunnttss.								       mmaakkee--kknn--ddiissccoouunntt  computes  the discounting constants used				       by the modified Kneser-Ney smoothing method.  Input is one				       of the files produced by ggeett--ggtt--ccoouunnttss.								       mmaakkee--bbaattcchh--ccoouunnttss  performs  the  first  stage in the con-				       struction of very large N-gram count files.  _f_i_l_e_-_l_i_s_t  is				       a  list  of  input  text files.  Lines starting with a `#'				       character are ignored.  These files will be  grouped  into				       batches of size _b_a_t_c_h_-_s_i_z_e (default 10) that are then pro-				       cessed in one run of nnggrraamm--ccoouunntt each.  For  maximum  per-				       formance,  _b_a_t_c_h_-_s_i_z_e should be as large as possible with-				       out triggering paging.  Optionally,  a  _f_i_l_t_e_r  script  or				       program can be given to condition the input texts.  The N-				       gram  count  files  are  left   in   directory   _c_o_u_n_t_-_d_i_r				       (``counts'' by default), where they can be found by a sub-				       sequent run of mmeerrggee--bbaattcchh--ccoouunnttss.  All following  _o_p_t_i_o_n_s				       are  passed to nnggrraamm--ccoouunntt, e.g., to control N-gram order,				       vocabulary, etc.  (no options triggering model  estimation				       should be included).								       mmeerrggee--bbaattcchh--ccoouunnttss  completes  the  construction  of large				       count files by merging the batched counts left  in  _c_o_u_n_t_-				       _d_i_r  until a single count file is produced.  Optionally, a				       _f_i_l_e_-_l_i_s_t of count files to combine can be specified; oth-				       erwise  all  count  files in _c_o_u_n_t_-_d_i_r from a prior run of				       mmaakkee--bbaattcchh--ccoouunnttss will be  merged.   A  number  as  second				       argument  restarts the merging process at iteration _s_t_a_r_t_-				       _i_t_e_r.  This is convenient if merging fails to complete for				       some reason (e.g., for temporary lack of disk space).								       mmaakkee--ggooooggllee--nnggrraammss  takes a sorted count file as input and				       creates an indexed directory structure, in a format devel-				       oped  by  Google  to  store very large N-gram collections.				       The resulting directory can then be used with  the  nnggrraamm--				       ccoouunntt(1)  --rreeaadd--ggooooggllee option.  Optional arguments specify				       the output directory _d_i_r and the size _N of  individual  N-				       gram  files (default is 10 million N-grams per file).  The				       ggzziipp==00 option writes  plain,  as  opposed  to  compressed,				       files.								       ccoonnttiinnuuoouuss--nnggrraamm--ccoouunntt  generates  N-grams  that span line				       breaks (which are usually  taken  to  be  sentence  bound-				       aries).  To count N-grams across line breaks use				            continuous-ngram-count _t_e_x_t_f_i_l_e | ngram-count -read -				       The argument _N  controls  the  order  of  N-grams  counted				       (default 3), and should match  the argument of nnggrraamm--ccoouunntt				       --oorrddeerr.								       rreevveerrssee--nnggrraamm--ccoouunnttss reverses the word order of N-grams in				       a counts file or stream.  For example, to recompute lower-				       order counts from higher-order ones, but do the  summation				       over  preceding  words (rather than following words, as in				       nnggrraamm--ccoouunntt(1)), use				            reverse-ngram-counts _c_o_u_n_t_-_f_i_l_e | \				            ngram-count -read - -recompute -write - | \				            reverse-ngram-counts > _n_e_w_-_c_o_u_n_t_s								       rreevveerrssee--tteexxtt reverses the word order in text files,  line-				       by-line.   Start-  and end-sentence tags, if present, will				       be preserved.  This reversal is  appropriate  for  prepro-				       cessing  training  data  for LMs that are meant to be used				       with the nnggrraamm --rreevveerrssee option.								       sspplliitt--ttaaggggeedd--nnggrraammss expands N-gram count of word/tag pairs				       into  mixed N-grams of words and tags.  The optional sseeppaa--				       rraattoorr==_S argument allows the  delimiting  character,  which				       defaults to "/", to be modified.								       mmaakkee--bbiigg--llmm  constructs large N-gram models in a more mem-				       ory-efficient way than nnggrraamm--ccoouunntt by itself.  It does  so				       by  precomputing  the  Good-Turing or Kneser-Ney smoothing				       parameters from the full set of counts, and then instruct-				       ing  nnggrraamm--ccoouunntt  to  store only a subset of the counts in				       memory, namely those of N-grams  to  be  retained  in  the				       model.   The _n_a_m_e parameter is used to name various auxil-				       iary files.  _c_o_u_n_t_s contains the raw N-gram counts; it may				       be (and usually is) a compressed file.  Unlike with nnggrraamm--				       ccoouunntt, the --rreeaadd option can  be  repeated  to  concatenate				       multiple  count  files,  but the arguments must be regular				       files; reading from stdin is not supported.  If  Good-Tur-				       ing  smoothing  is  used  and  the  file contains complete				       lower-order counts corresponding to the  sums  of  higher-				       order  counts, then the --ttrruusstt--ttoottaallss options may be given				       for efficiency.  All other _o_p_t_i_o_n_s are  passed  to  nnggrraamm--				       ccoouunntt  (only  options affecting model estimation should be				       given).  Smoothing methods other than Good-Turing and mod-				       ified   Kneser-Ney   are  not  supported  by  mmaakkee--bbiigg--llmm.				       Kneser-Ney smoothing also requires enough  disk  space  to				       compute  and store the modified lower-order counts used by				       the KN method.  This is done using the  mmeerrggee--bbaattcchh--ccoouunnttss				       command,  and  the  --mmaaxx--ppeerr--ffiillee option controls how many				       counts are to be stored per batch, and should be chosen so				       that these batches fit in real memory.								       mmaakkee--kknn--ccoouunnttss  computes  the  modified lower-order counts				       used by the KN smoothing  method.   It  is  invoked  as  a				       helper scripts by mmaakkee--bbiigg--llmm ..								       rreeppllaaccee--wwoorrddss--wwiitthh--ccllaasssseess  replaces  expansions  of  word				       classes with  the  corresponding  class  labels.   _c_l_a_s_s_e_s				       specifies class expansions in ccllaasssseess--ffoorrmmaatt(5).  Ambigui-				       ties are resolved in favor of the  longest  matching  word				       strings.  Ties are broken in favor of the expansion listed				       first  in  _c_l_a_s_s_e_s_.   Optionally,  the  file  _c_o_u_n_t_s  will				       receive  the  expansion counts resulting from the replace-				       ments.  nnoorrmmaalliizzee==00 or  11  indicates  whether  the  counts				       should be normalized to probabilities (default is 1).  The				       aaddddoonnee option may be used to smooth the  expansion  proba-				       bilities  by  adding  _K  to  each  count (default 1).  The				       option hhaavvee__ccoouunnttss==11 indicates that the input consists  of				       N-gram  counts and that replacement should be performed on				       them.  Note this will not  merge  counts  that  have  been				       mapped  to identical N-grams, since this is done automati-				       cally when nnggrraamm--ccoouunntt(1) reads count  data.   The  option				       ppaarrttiiaall==11  prevents multi-word class expansions from being				       replaced when more than one space character  occurs  inbe-				       tween the words.								       uunniiffoorrmm--ccllaasssseess takes a file in ccllaasssseess--ffoorrmmaatt(5) and adds				       uniform probabilities to  expansions  that  don't  have  a				       probability explicitly stated.								       mmaakkee--ddiiaaccrriittiicc--mmaapp  constructs  a  map  file that pairs an				       ASCII-fied version of the words  in  _v_o_c_a_b  with  all  the				       occurring  non-ASCII word forms.  Such a map file can then				       be used with ddiissaammbbiigg(1) and a language  model  to  recon-				       struct  the  non-ASCII  word  form with diacritics from an				       ASCII text.								       vvpp22tteexxtt is a reimplementation of the filter  used  in  the				       DARPA Hub-3 and Hub-4 CSR evaluations to convert ``verbal-				       ized punctuation'' texts to language model training  data.								       ccoommppuuttee--oooovv--rraattee  determines the out-of-vocabulary rate of				       a corpus from its unigram _c_o_u_n_t_s and a  target  vocabulary				       list in _v_o_c_a_b.								SSEEEE AALLSSOO				       ngram-count(1),  ngram(1), classes-format(5), disambig(1),				       select-vocab(1).								BBUUGGSS				       Some of the tools could be generalized  and/or  made  more				       robust to misuse.								AAUUTTHHOORR				       Andreas Stolcke .				       Copyright 1995-2006 SRI International																SRILM Tools        $Date: 2006/08/11 22:35:11 $training-scripts(1)

相关资源
这是一款很好用的工具包这是一款很好用的B/S结构的酒店管理系统简单这是一本很好用的VHDL编程书这是一本很好用的VHDL编程书 UltraEdit是一款很好用的编辑软件这是一款很好的SQL多用户版程序这是一款简单易用的自动升级及更新软件这是一款很好的登陆软件
这是一款很好用的工具包

源代码在线查看： training-scripts.1

相关代码

相关资源

友情链接