这是一款很好用的工具包

源代码在线查看: training-scripts.html

软件大小: 3034 K
上传用户: wanghaihah
关键词: 工具包
下载地址: 免注册下载 普通下载 VIP

相关代码

																training-scripts								training-scripts				 NAME 				training-scripts, compute-oov-rate, continuous-ngram-count, get-gt-counts, make-abs-discount, make-batch-counts, make-big-lm, make-diacritic-map,  make-google-ngrams, make-gt-discounts, make-kn-counts, make-kn-discounts, merge-batch-counts, replace-words-with-classes, reverse-ngram-counts, split-tagged-ngrams, reverse-text, uniform-classes, vp2text - miscellaneous conveniences for language model training				 SYNOPSIS 				 get-gt-counts 				max=K				out=name				[counts...]								 make-abs-discount 				 gtcounts 								 make-gt-discounts 				min=min				max=max				 gtcounts 								 make-kn-counts 				order=N				max_per_file=M				output=file				[				 no_max_order=1 				]								 make-kn-discounts 				min=min				 gtcounts 								 make-batch-counts 				 file-list 				[batch-size				[filter				[count-dir				[options...]]]]								 merge-batch-counts 				 count-dir 				[file-list|start-iter]								 make-google-ngrams 				[				dir=DIR				] [				per_file=N				] [				 gzip=0 				]				[counts-file...]								 continuous-ngram-count 				[				order=N				]				[textfile...]								 reverse-ngram-counts 				[counts-file...]								 reverse-text 				[textfile...]								 split-tagged-ngrams 				[				separator=S				]				[counts-file...]								 make-big-lm 				 -name 				 name 				 -read 				 counts 				[				 -trust-totals 				-max-per-file M				]				 -lm 				 new-model 				[options...]								 replace-words-with-classes 				classes=classes				[outfile=counts				normalize=0|1				addone=K				 have_counts=1 				 partial=1 				]				[textfile...]								 uniform-classes 				 classes 				>new-classes								 make-diacritic-map 				 vocab 								 vp2text 				[textfile...]								 compute-oov-rate 				 vocab 				[counts...]				 DESCRIPTION 				These scripts perform convenience tasks associated with the training of				language models.				They complement and extend the basic N-gram model estimator in				ngram-count(1).								Since these tools are implemented as scripts they don't automatically				input or output compressed data files correctly, unlike the main				SRILM tools.				However, since most scripts work with data from standard input or				to standard output (by leaving out the file argument, or specifying it 				as ``-'') it is easy to combine them with 				gunzip(1)				or				gzip(1)				on the command line.								Also note that many of the scripts take their options with the 				gawk(1)				syntax				option=value				instead of the more common				-option				value.								 get-gt-counts 				computes the counts-of-counts statistics needed in Good-Turing smoothing.				The frequencies of counts up to				 K 				are computed (default is 10).				The results are stored in a series of files with root				name,				name.gt1counts,				name.gt2counts,				..., 				name.gtNcounts.				It is assumed that the input counts have been properly merged, i.e.,				that there are no duplicated N-grams.								 make-gt-discounts 				takes one of the output files of				 get-gt-counts 				and computes the corresponding Good-Turing discounting factors.				The output can then be passed to				ngram-count(1)				via the 				-gtn				options to control the smoothing during model estimation.				Precomputing the GT discounting in this fashion has the advantage that the				GT statistics are not affected by restricting N-grams to a limited vocabulary.				Also, 				get-gt-counts/make-gt-discounts				can process arbitrarily large count files, since they do not need to				read the counts into memory (unlike				ngram-count).								 make-abs-discount 				computes the absolute discounting constant needed for the				 ngram-count 				-cdiscountn				options.				Input is one of the files produced by 				get-gt-counts.								 make-kn-discount 				computes the discounting constants used by the modified Kneser-Ney				smoothing method.				Input is one of the files produced by 				get-gt-counts.								 make-batch-counts 				performs the first stage in the construction of very large N-gram count 				files.				 file-list 				is a list of input text files.				Lines starting with a `#' character are ignored.				These files will be grouped into batches of size				 batch-size 				(default 10)				that are then processed in one run of				 ngram-count 				each.				For maximum performance,				 batch-size 				should be as large as possible without triggering paging.				Optionally, a				 filter 				script or program can be given to condition the input texts.				The N-gram count files are left in directory				 count-dir 				(``counts'' by default), where they can be found by a subsequent				run of				merge-batch-counts.				All following				 options 				are passed to 				ngram-count,				e.g., to control N-gram order, vocabulary, etc.				(no options triggering model estimation should be included).								 merge-batch-counts 				completes the construction of large count files by merging the 				batched counts left in 				 count-dir 				until a single count file is produced.				Optionally, a				 file-list 				of count files to combine can be specified; otherwise all count files				in				 count-dir 				from a prior run of				 make-batch-counts 				will be merged.				A number as second argument restarts the merging process at iteration				start-iter.				This is convenient if merging fails to complete for some reason				(e.g., for temporary lack of disk space).								 make-google-ngrams 				takes a sorted count file as input and creates an indexed directory				structure, in a format developed by Google to store very large N-gram				collections.				The resulting directory can then be used with the				ngram-count(1)				 -read-google 				option.				Optional arguments specify the output directory				 dir 				and the size				 N 				of individual N-gram files				(default is 10 million N-grams per file).				The 				 gzip=0 				option writes plain, as opposed to compressed, files.								 continuous-ngram-count 				generates N-grams that span line breaks (which are usually taken to				be sentence boundaries).				To count N-grams across line breaks use									continuous-ngram-count textfile | ngram-count -read -								The argument				 N 				controls the order of N-grams counted (default 3), and				should match  the argument of 				 ngram-count 				-order.								 reverse-ngram-counts 				reverses the word order of N-grams in a counts file or stream.				For example, to recompute lower-order counts from higher-order ones,				but do the summation over preceding words (rather than following words,				as in 				ngram-count(1)),				use									reverse-ngram-counts count-file | \									ngram-count -read - -recompute -write - | \									reverse-ngram-counts > new-counts								 reverse-text 				reverses the word order in text files, line-by-line.				Start- and end-sentence tags, if present, will be preserved.				This reversal is appropriate for preprocessing training data				for LMs that are meant to be used with the 				 ngram 				-reverse				option.								 split-tagged-ngrams 				expands N-gram count of word/tag pairs into mixed N-grams 				of words and tags.				The optional 				separator=S				argument allows the delimiting character, which defaults to "/",				to be modified.								 make-big-lm 				constructs large N-gram models in a more memory-efficient way than				 ngram-count 				by itself.				It does so by precomputing the Good-Turing or Kneser-Ney smoothing parameters				from the full set of counts, and then instructing				 ngram-count 				to store only a subset of the counts in memory,				namely those of N-grams to be retained in the model.				The				 name 				parameter is used to name various auxiliary files.				 counts 				contains the raw N-gram counts; it may be (and usually is) a compressed file.				Unlike with				ngram-count,				the				 -read 				option can be repeated to concatenate multiple count files, but the arguments				must be regular files; reading from stdin is not supported.				If Good-Turing smoothing is used and the file contains complete lower-order				counts corresponding to the				sums of higher-order counts, then the				 -trust-totals 				options may be given for efficiency.				All other				 options 				are passed to 				 ngram-count 				(only options affecting model estimation should be given).				Smoothing methods other than Good-Turing and modified Kneser-Ney are not				supported by				make-big-lm.				Kneser-Ney smoothing also requires enough disk space to compute and store the				modified lower-order counts used by the KN method.				This is done using the 				 merge-batch-counts 				command, and the				 -max-per-file 				option controls how many counts are to be stored per batch, and 				should be chosen so that these batches fit in real memory.								 make-kn-counts 				computes the modified lower-order counts used by the KN smoothing method.				It is invoked as a helper scripts by 				 make-big-lm . 								 replace-words-with-classes 				replaces expansions of word classes with the corresponding class labels.				 classes 				specifies class expansions in 				classes-format(5).				Ambiguities are resolved in favor of the longest matching word strings.				Ties are broken in favor of the expansion listed first in 				classes.				Optionally, the file				 counts 				will receive the expansion counts resulting from the replacements.				 normalize=0 				or				 1 				indicates whether the counts should be normalized to probabilities				(default is 1).				The				 addone 				option may be used to smooth the expansion probabilities by adding 				 K 				to each count (default 1).				The option 				 have_counts=1 				indicates that the input consists of N-gram counts and that replacement				should be performed on them.				Note this will not merge counts that have been mapped to identical N-grams,				since this is done automatically when 				ngram-count(1)				reads count data.				The option				 partial=1 				prevents multi-word class expansions from being replaced when more than				one space character occurs inbetween the words.								 uniform-classes 				takes a file in				classes-format(5)				and adds uniform probabilities to expansions that don't have a probability				explicitly stated.								 make-diacritic-map 				constructs a map file that pairs an ASCII-fied version of the words in				 vocab 				with all the occurring non-ASCII word forms.				Such a map file can then be used with				disambig(1)				and a language model				to reconstruct the non-ASCII word form with diacritics from an ASCII				text.								 vp2text 				is a reimplementation of the filter used in the DARPA Hub-3 and Hub-4 				CSR evaluations to convert ``verbalized punctuation'' texts to				language model training data.								 compute-oov-rate 				determines the out-of-vocabulary rate of a corpus from its unigram				 counts 				and a target vocabulary list in				vocab.				 SEE ALSO 				ngram-count(1), ngram(1), classes-format(5), disambig(1), select-vocab(1).				 BUGS 				Some of the tools could be generalized and/or made more robust to				misuse.				 AUTHOR 				Andreas Stolcke <stolcke@speech.sri.com>.								Copyright 1995-2006 SRI International															

相关资源