这是一款很好用的工具包

源代码在线查看: ngram-class.html

软件大小: 3034 K
上传用户: wanghaihah
关键词: 工具包
下载地址: 免注册下载 普通下载 VIP

相关代码

																ngram-class								ngram-class				 NAME 				ngram-class - induce word classes from N-gram statistics				 SYNOPSIS 				 ngram-class 				[-help]				 option 				...				 DESCRIPTION 				 ngram-class 				induces word classes from distributional statistics,				so as to minimize perplexity of a class-based N-gram model				given the provided word N-gram counts.				Presently, only bigram statistics are used, i.e., the induced classes				are best suited for a class-bigram language model.								The program generates the class N-gram counts and class expansions				needed by				ngram-count(1)				and				ngram(1),				respectively to train and to apply the class N-gram model.				 OPTIONS 								Each filename argument can be an ASCII file, or a 				compressed file (name ending in .Z or .gz), or ``-'' to indicate				stdin/stdout.								 -help 								Print option summary.				 -version 								Print version information.				-debug level								Set debugging output at				level.				Level 0 means no debugging.				Debugging messages are written to stderr.				A useful level to trace the formation of classes is 2.												 Input Options 								-vocab file								Read a vocabulary from file.				Subsequently, out-of-vocabulary words in both counts or text are				replaced with the unknown-word token.				If this option is not specified all words found are implicitly added				to the vocabulary.				 -tolower 								Map the vocabulary to lowercase.				-counts file								Read N-gram counts from a file.				Each line contains an N-gram of 				words, followed by an integer count, all separated by whitespace.				Repeated counts for the same N-gram are added.				Counts collected by 				 -text 				and 				 -counts 				are additive as well.								Note that the input should contain consistent lower- and higher-order				counts (i.e., unigrams and bigrams), as would be generated by				ngram-count(1).				-text textfile								Generate N-gram counts from text file.				 textfile 				should contain one sentence unit per line.				Begin/end sentence tokens are added if not already present.				Empty lines are ignored.												 Class Merging 								-numclasses C								The target number of classes to induce.				A zero argument suppresses automatic class merging altogether				(e.g., for use with 				 -interact). 				 -full 								Perform full greedy merging over all classes starting with one class per				word.				This is the O(V^3) algorithm described in Brown et al. (1992).				 -incremental 								Perform incremental greedy merging, starting with 				one class each for the 				 C 				most frequent words, and then adding one word at a time.				This is the O(V*C^2) algorithm described in Brown et al. (1992);				it is the default.				 -interact 								Enter a primitive interactive interface when done with automatic class				induction, allowing manual specification of additional merging steps.				-noclass-vocab file								Read a list of vocabulary items from				 file 				that are to be excluded from classes.				These words or tags do no undergo class merging, but their 				N-gram counts still affect the optimization of model perplexity.								The default is to exclude the sentence begin/end tags (<s> and </s>)				from class merging; this can be suppressed by specifying				-noclass-vocab /dev/null.												 Output Options 								-class-counts file								Write class N-gram counts to				 file 				when done.				The format is the same as for word N-gram counts, and can be				read by				ngram-count(1)				to estimate a class-N-gram model.				-classes file								Write class definitions (member words and their probabilities) to				 file 				when done.				The output format is the same as required by the				 -classes 				option of 				ngram(1).				-save S								Save the class counts and/or class definitions every				 S 				iterations during induction.				The filenames are obtained from the				 -class-counts 				and				 -classes 				options, respectively, by appending the iteration number.				This is convenient for producing sets of classes at different granularities				during the same run.				S=0				(the default) suppresses the saving actions.												 SEE ALSO 				ngram-count(1), ngram(1).								P. F. Brown, V. J. Della Pietra, P. V. deSouza, J. C. Lai and R. L. Mercer,				``Class-Based n-gram Models of Natural Language,''				Computational Linguistics 18(4), 467-479, 1992.				 BUGS 				Classes are optimized only for bigram models at present.				 AUTHOR 				Andreas Stolcke <stolcke@speech.sri.com>.								Copyright 1999-2004 SRI International															

相关资源