--- autoclass-3.3.6.dfsg.1.orig/debian/README.Debian +++ autoclass-3.3.6.dfsg.1/debian/README.Debian @@ -0,0 +1,57 @@ +autoclass for Debian +---------------------- + +AutoClass and multimix are both clustering programs which have been +packaged for Debian. Here is a comparison written by the Multimix +authors: + +"AutoClass is a Bayesian clustering program developed by Peter +Cheeseman and colleagues at NASA Ames Research Center. The models +fitted by AutoClass are very similar to those fitted by Multimix, +although both programs were developed independently. Two obvious +differences are + + "1. AutoClass has automated the process of model selection as well as + that of parameter estimation but Multimix leaves model-specification + to the user; + + "2. AutoClass uses Maximum Posterior estimation in place of Maximum + Likelihood estimation. + +"In fact the first is the more crucial difference, because the EM +algorithm at the basis of both programs accommodates both ML and MAP +estimation. AutoClass compares different models by calculating an +approximation to the marginal density of the observed data after the +model parameters have been integrated out. In usual EM language the +approximation used is analogous to taking observed data likelihood to +be proportional to complete data likelihood with the constant of +proportionality to be evaluated at the maximum likelihood estimates. + +"The models currently available in AutoClass for attributes within a +component are as follows. Categorical attributes are modelled by +general discrete distributions (multi-category Bernoulli) as in +Multimix. Continuous attributes may be taken to have uniform or +normal distributions, possibly after transformation. Poisson +distributions are available for count attributes. Cheeseman and Stutz +report that von Mises-Fisher distributions for circular and spherical +attributes are under development. At present it appears that +AutoClass does not offer facilities for modelling within cluster +dependencies, the is, all models assume within-cluster independence of +attributes. Missing values are treated as a special kind of value in +some attribute models, but there has been no implementation of the +Little and Rubin methodology for data missing at random." + +For more details, including references and comparison with Snob and +Mclust, see these articles: + "Mixture Model Clustering with the Multimix Program" by Jorgensen +and Hunt, in /usr/share/doc/PPAPER.ps.gz. + "Mixture Model Clustering using the Multimix Program" by Hunt and +Jorgensen, in /usr/share/doc/paper.ps. + +For an example problem solved by both programs, see + + /usr/share/doc/multimix/examples/simple.* + /usr/share/doc/autoclass/examples/simple.* + + -- James R. Van Zandt , Sun Dec 9 15:19:50 EST 2001 + --- autoclass-3.3.6.dfsg.1.orig/debian/README.sample +++ autoclass-3.3.6.dfsg.1/debian/README.sample @@ -0,0 +1,6 @@ +# to run the example, do this: +gunzip *.gz + +autoclass -search imports-85c.db2 imports-85c.hd2 imports-85c.model imports-85c.s-params + +autoclass -reports imports-85c.results-bin imports-85c.search imports-85c.r-params --- autoclass-3.3.6.dfsg.1.orig/debian/autoclass.1 +++ autoclass-3.3.6.dfsg.1/debian/autoclass.1 @@ -0,0 +1,1588 @@ +.\" -*-nroff-*- +.TH AUTOCLASS 1 "December 9, 2001" +.SH NAME +autoclass \- automatically discover classes in data +.SH SYNOPSIS +.ad l +.B autoclass "-search " +.I data_file header_file model_file s_param_file +.br +.B autoclass "-report " +.I results_file search_file r_params_file +.\" .br +.\" .B autoclass "-predict " +.\" .I data_file +.br +.B autoclass "-predict " +.I results_file search_file results_file +.ad b +.br +.SH "DESCRIPTION" +\fBAutoClass\fP solves the problem of automatic discovery of classes in data +(sometimes called clustering, or unsupervised learning), as distinct +from the generation of class descriptions from labeled examples +(called supervised learning). It aims to discover the "natural" +classes in the data. \fBAutoClass\fP is applicable to observations of +things that can be described by a set of attributes, without referring +to other things. The data values corresponding to each attribute are +limited to be either numbers or the elements of a fixed set of +symbols. With numeric data, a measurement error must be provided. +.PP +\fBAutoClass\fP is looking for the best classification(s) of the data it can find. +A classification is composed of: +.IP 1) +A set of classes, each of which is described by a set of class +parameters, which specify how the class is distributed along the +various attributes. For example, "height normally distributed with +mean 4.67 ft and standard deviation .32 ft", +.IP 2) +A set of class weights, describing what percentage of cases are +likely to be in each class. +.IP 3) +A probabilistic assignment of cases in the data to these classes. +I.e. for each case, the relative probability that it is a member of +each class. +.PP +As a strictly Bayesian system (accept no substitutes!), the quality +measure \fBAutoClass\fP uses is the total probability that, had you known +nothing about your data or its domain, you would have found this set of +data generated by this underlying model. This includes the prior +probability that the "world" would have chosen this number of classes, +this set of relative class weights, and this set of parameters for each +class, and the likelihood that such a set of classes would have generated +this set of values for the attributes in the data cases. +.PP +These probabilities are typically very small, in the range of e^-30000, +and so are usually expressed in exponential notation. +.PP +When run with the \fB-search\fP command, \fBAutoClass\fP searches for +a classification. The required arguments are the paths to the four +input files, which supply the data, the data format, the desired +classification model, and the search parameters, respectively. +.PP +By default, \fBAutoClass\fP writes intermediate results in a binary file. +With the \fB-report\fP command, \fBAutoClass\fP generates an ASCII +report. The arguments are the full path names of the .results, .search, and .r-params files. +.PP +When run with the \fB-predict\fP command, \fBAutoClass\fP predicts the +class membership of a "test" data set based on classes found in a +"training" data set (see "PREDICTIONS" below). +.SH "INPUT FILES" +An AutoClass data set resides in two files. There is a header file +(file type "hd2") that describes the specific data format and attribute +definitions. The actual data values are in a data file (file type "db2"). +We use two files to allow editing of data descriptions without having to +deal with the entire data set. This makes it easy to experiment with +different descriptions of the database without having to reproduce the data +set. Internally, an AutoClass database structure is identified by its +header and data files, and the number of data loaded. +.PP +For more detailed information on the formats of these files, see +\fI/usr/share/doc/autoclass/preparation-c.text\fP. +.SS "DATA FILE" +The data file contains a sequence of data objects (datum or case) +terminated by the end of the file. The number of values for each data +object must be equal to the number of attributes defined in the header +file. Data objects must be groups of tokens delimited by "new-line". +Attributes are typed as REAL, DISCRETE, or DUMMY. Real attribute +values are numbers, either integer or floating point. Discrete +attribute values can be strings, symbols, or integers. A dummy +attribute value can be any of these types. Dummys are read in but +otherwise ignored -- they will be set to zeros in the the internal +database. Thus the actual values will not be available for use in +report output. To have these attribute values available, use either +type REAL or type DISCRETE, and define their model type as IGNORE in +the .model file. Missing values for any attribute type may be +represented by either "?", or other token specified in the header +file. All are translated to a special unique value after being read, +so this symbol is effectively reserved for unknown/missing values. +.PP +For example: +.nf + white 38.991306 0.54248405 2 2 1 + red 25.254923 0.5010235 9 2 1 + yellow 32.407973 ? 8 2 1 + all_white 28.953982 0.5267696 0 1 1 +.fi +.SS "HEADER FILE" +The header file specifies the data file format, and the definitions of +the data attributes. The header file functional specifications +consists of two parts -- the data set format definition +specifications, and the attribute descriptors. ";" in column 1 +identifies a comment. +.PP +A header file follows this general format: +.nf + + ;; num_db2_format_defs value (number of format def lines + ;; that follow), range of n is 1 -> 5 + num_db2_format_defs n + ;; number_of_attributes token and value required + number_of_attributes + ;; following are optional - default values are specified + separator_char ' ' + comment_char ';' + unknown_token '?' + separator_char ',' + + ;; attribute descriptors + ;; + ;; + +.fi +Each attribute descriptor is a line of: +.nf + + Attribute index (zero based, beginning in column 1) + Attribute type. See below. + Attribute subtype. See below + Attribute description: symbol (no embedded blanks) or + string; <= 40 characters + Specific property and value pairs. + Currently available combinations: + + type subtype property type(s) + ---- -------- --------------- + dummy none/nil -- + discrete nominal range + real location error + real scalar zero_point rel_error + +.fi +The ERROR property should represent your best estimate of the +average error expected in the measurement and recording of that real +attribute. Lacking better information, the error can be taken as 1/2 +the minimum possible difference between measured values. It can be +argued that real values are often truncated, so that smaller errors +may be justified, particularly for generated data. But AutoClass only +sees the recorded values. So it needs the error in the recorded +values, rather than the actual measurement error. Setting this error +much smaller than the minimum expressible difference implies the +possibility of values that cannot be expressed in the data. Worse, it +implies that two identical values must represent measurements that +were much closer than they might actually have been. This leads to +over-fitting of the classification. + +The REL_ERROR property is used for SCALAR reals when the error is +proportional to the measured value. The ERROR property is not +supported. + +AutoClass uses the error as a lower bound on the width of the +normal distribution. So small error estimates tend to give narrower +peaks and to increase both the number of classes and the +classification probability. Broad error estimates tend to limit the +number of classes. + +The scalar ZERO_POINT property is the smallest value that the +measurement process could have produced. This is often 0.0, or less +by some error range. Similarly, the bounded real's min and max +properties are exclusive bounds on the attributes generating process. +For a calculated percentage these would be 0-e and 100+e, where e is +an error value. The discrete attribute's range is the number of +possible values the attribute can take on. This range must include +unknown as a value when such values occur. + +Header File Example: +.nf + +!#; AutoClass C header file -- extension .hd2 +!#; the following chars in column 1 make the line a comment: +!#; '!', '#', ';', ' ', and '\\n' (empty line) + +;#! num_db2_format_defs +num_db2_format_defs 2 +;; required +number_of_attributes 7 +;; optional - default values are specified +;; separator_char ' ' +;; comment_char ';' +;; unknown_token '?' +separator_char ',' + +;; + +0 dummy nil "True class, range = 1 - 3" +1 real location "X location, m. in range of 25.0 - 40.0" error .25 +2 real location "Y location, m. in range of 0.5 - 0.7" error .05 +3 real scalar "Weight, kg. in range of 5.0 - 10.0" zero_point 0.0 +rel_error .001 +4 discrete nominal "Truth value, range = 1 - 2" range 2 +5 discrete nominal "Color of foobar, 10 values" range 10 +6 discrete nominal Spectral_color_group range 6 +.fi +.SS "MODEL FILE" +A classification of a data set is made with respect to a model which +specifies the form of the probability distribution function for classes in that +data set. Normally the model structure is defined in a model file (file +type "model"), containing one or more models. Internally, a model is defined +relative to a particular database. Thus it is identified by the corresponding +database, the model's model file and its sequential position in the +file. +.PP +Each model is specified by one or more model group definition lines. +Each model group line associates attribute indices with a +model term type. +.PP +Here is an example model file: +.nf + +# AutoClass C model file -- extension .model +model_index 0 7 +ignore 0 +single_normal_cn 3 +single_normal_cn 17 18 21 +multi_normal_cn 1 2 +multi_normal_cn 8 9 10 +multi_normal_cn 11 12 13 +single_multinomial default +.fi +.PP +Here, the first line is a comment. The following characters in column +1 make the line a comment: `!', `#', ` ', `;', and `\\n' (empty line). +.PP +The tokens "model_index \fIn m\fP" must appear on the first non-comment +line, and precede the model term definition lines. \fIn\fP is the +zero-based model index, typically 0 where there is only one model -- +the majority of search situations. \fIm\fP is the number of model term +definition lines that follow. + +The last seven lines are model group lines. Each model group line +consists of: +.br +.ad l +.nh +.HP 4 +A model term type (one of +.BR single_multinomial , +.BR single_normal_cm , +.BR single_normal_cn , +.BR multi_normal_cn ", or" +.BR ignore ). +.HP 4 +A list of attribute indices (the attribute set list), or the symbol +\fBdefault\fP. Attribute indices are zero-based. Single model terms +may have one or more attribute indices on each line, while multi model +terms require two or more attribute indices per line. An attribute +index must not appear more than once in a model list. +.ad b +.hy +.PP +Notes: +.IP 1) +At least one model definition is required (model_index token). +.IP 2) +There may be multiple entries in a model for any model term type. +.IP 3) +Model term types currently consist of: +.RS +.TP +.B single_multinomial +models discrete attributes as multinomials, with missing values. +.TP +.B single_normal_cn +models real valued attributes as normals; no missing values. +.TP +.B single_normal_cm +models real valued attributes with missing values. +.TP +.B multi_normal_cn +is a covariant normal model without missing values. +.TP +.B ignore +allows the model to ignore one or more attributes. +\fBignore\fP is not a valid default model term type. +.PP +See the documentation in models-c.text for further information about +specific model terms. +.RE +.IP 4) +\fBSingle_normal_cn\fP, \fBsingle_normal_cm\fP, and +\fBmulti_normal_cn\fP modeled data, whose subtype is \fBscalar\fP +(value distribution is away from 0.0, and is thus not a "normal" +distribution) will be log transformed and modeled with the log-normal +model. For data whose subtype is \fBlocation\fP (value distribution +is around 0.0), no transform is done, and the normal model is used. +.SH SEARCHING +AutoClass, when invoked in the "search" mode will check the validity +of the set of data, header, model, and search parameter files. Errors +will stop the search from starting, and warnings will ask the user +whether to continue. A history of the error and warning messages is +saved, by default, in the log file. +.PP +Once you have succeeded in describing your data with a header file +and model file that passes the AUTOCLASS -SEARCH <...> input checks, +you will have entered the search domain where \fBAutoClass\fP classifies +your data. (At last!) +.PP +The main function to use in finding a good classification of your data +is AUTOCLASS -SEARCH, and using it will take most of the computation +time. Searches are invoked with: +.nf + +autoclass -search <.db2 file path> <.hd2 file path> + <.model file path> <.s-params file path> + +.fi +All files must be specified as fully qualified relative or absolute +pathnames. File name extensions (file types) for all files are forced +to canonical values required by the AutoClass program: +.nf + + data file ("ascii") db2 + data file ("binary") db2-bin + header file hd2 + model file model + search params file s-params +.fi +.PP +The sample-run (\fI/usr/share/doc/autoclass/examples/\fP) that comes +with \fBAutoClass\fP shows some sample searches, and browsing these is +probably the fastest way to get familiar with how to do searches. The +test data sets located under \fI/usr/share/doc/autoclass/examples/\fP +will show you some other header (.hd2), model (.model), and search +params (.s-params) file setups. The remainder of this section +describes how to do searches in somewhat more detail. +.PP +The \fBbold faced\fP tokens below are generally search params file +parameters. For more information on the s-params file, see \fBSEARCH +PARAMETERS\fP below, or +\fI/usr/share/doc/autoclass/search-c.text.gz\fP. +.SS "WHAT RESULTS ARE" +\fBAutoClass\fP is looking for the best classification(s) of the data +it can find. A classification is composed of: +.IP 1) +a set of classes, each of which is described by a set of class +parameters, which specify how the class is distributed along the +various attributes. For example, "height normally distributed with +mean 4.67 ft and standard deviation .32 ft", +.IP 2) +a set of class weights, describing what percentage of cases are +likely to be in each class. +.IP 3) +a probabilistic assignment of cases in the data to these classes. +I.e. for each case, the relative probability that it is a member of +each class. +.PP +As a strictly Bayesian system (accept no substitutes!), the quality +measure \fBAutoClass\fP uses is the total probability that, had you known +nothing about your data or its domain, you would have found this set of +data generated by this underlying model. This includes the prior +probability that the "world" would have chosen this number of classes, +this set of relative class weights, and this set of parameters for each +class, and the likelihood that such a set of classes would have generated +this set of values for the attributes in the data cases. +.PP +These probabilities are typically very small, in the range of e^-30000, +and so are usually expressed in exponential notation. +.SS "WHAT RESULTS MEAN" +It is important to remember that all of these probabilities are GIVEN +that the real model is in the model family that \fBAutoClass\fP has +restricted its attention to. If \fBAutoClass\fP is looking for +Gaussian classes and the real classes are Poisson, then the fact that +\fBAutoClass\fP found 5 Gaussian classes may not say much about how +many Poisson classes there really are. +.PP +The relative probability between different classifications found can +be very large, like e^1000, so the very best classification found is +usually overwhelmingly more probable than the rest (and overwhelmingly +less probable than any better classifications as yet undiscovered). +If \fBAutoClass\fP should manage to find two classifications that are +within about exp(5-10) of each other (i.e. within 100 to 10,000 times +more probable) then you should consider them to be about equally +probable, as our computation is usually not more accurate than this +(and sometimes much less). +.SS "HOW IT WORKS" +\fBAutoClass\fP repeatedly creates a random classification and then +tries to massage this into a high probability classification though +local changes, until it converges to some "local maximum". It then +remembers what it found and starts over again, continuing until you +tell it to stop. Each effort is called a "try", and the computed +probability is intended to cover the whole volume in parameter space +around this maximum, rather than just the peak. +.PP +The standard approach to massaging is to +.IP 1) +Compute the probabilistic class memberships of cases using the class +parameters and the implied relative likelihoods. +.IP 2) +Using the new class members, compute class statistics (like mean) +and revise the class parameters. +.PP +and repeat till they stop changing. There are three available +convergence algorithms: "converge_search_3" (the default), +"converge_search_4" and "converge". Their specification is controlled +by search params file parameter \fBtry_fn_type\fP. +.SS "WHEN TO STOP" +You can tell AUTOCLASS -SEARCH to stop by: 1) giving a +\fBmax_duration\fP (in seconds) argument at the beginning; 2) giving a +\fBmax_n_tries\fP (an integer) argument at the beginning; or 3) by +typing a "q" and after you have seen enough tries. The +\fBmax_duration\fP and \fBmax_n_tries\fP arguments are useful if you +desire to run AUTOCLASS -SEARCH in batch mode. If you are restarting +AUTOCLASS -SEARCH from a previous search, the value of +\fBmax_n_tries\fP you provide, for instance 3, will tell the program +to compute 3 more tries in addition to however many it has already +done. The same incremental behavior is exhibited by +\fBmax_duration\fP. +.PP +Deciding when to stop is a judgment call and it's up to you. Since the +search includes a random component, there's always the chance that if +you let it keep going it will find something better. So you need to +trade off how much better it might be with how long it might take to +find it. The search status reports that are printed when a new best +classification is found are intended to provide you information to help +you make this tradeoff. +.PP +One clear sign that you should probably stop is if most of the +classifications found are duplicates of previous ones (flagged by +"dup" as they are found). This should only happen for very small sets +of data or when fixing a very small number of classes, like two. +.PP +Our experience is that for moderately large to extremely large data +sets (~200 to ~10,000 datum), it is necessary to run \fBAutoClass\fP +for at least 50 trials. +.SS "WHAT GETS RETURNED" +Just before returning, AUTOCLASS -SEARCH will give short descriptions +of the best classifications found. How many will be described can be +controlled with \fBn_final_summary\fP. +.PP +By default AUTOCLASS -SEARCH will write out a number of files, both at +the end and periodically during the search (in case your system +crashes before it finishes). These files will all have the same name +(taken from the search params pathname [.s-params]), and differ +only in their file extensions. If your search runs are very long and +there is a possibility that your machine may crash, you can have +intermediate "results" files written out. These can be used to +restart your search run with minimum loss of search effort. See the +documentation file \fI/usr/share/doc/autoclass/checkpoint-c.text\fP. +.PP +A ".log" file will hold a listing of most of what was printed to the +screen during the run, unless you set \fBlog_file_p\fP to false to say +you want no such foolishness. Unless \fBresults_file_p\fP is false, a +binary ".results-bin" file (the default) or an ASCII ".results" text +file, will hold the best classifications that were returned, and +unless \fBsearch_file_p\fP is false, a ".search" file will hold the +record of the search tries. \fBsave_compact_p\fP controls whether the +"results" files are saved as binary or ASCII text. +.PP +If the C global variable "G_safe_file_writing_p" is defined as TRUE in +"autoclass-c/prog/globals.c", the names of "results" files (those that +contain the saved classifications) are modified internally to account +for redundant file writing. If the search params file name is +"my_saved_clsfs" you will see the following "results" file names +(ignoring directories and pathnames for this example) +.sp +.nf + \fBsave_compact_p\fP = true -- + "my_saved_clsfs.results-bin" - completely written file + "my_saved_clsfs.results-tmp-bin" - partially written file, renamed + when complete + + \fBsave_compact_p\fP = false -- + "my_saved_clsfs.results" - completely written file + "my_saved_clsfs.results-tmp" - partially written file, renamed + when complete +.fi +.sp +If check pointing is being done, these additional names will appear +.sp +.nf + \fBsave_compact_p\fP = true -- + "my_saved_clsfs.chkpt-bin" - completely written checkpoint file + "my_saved_clsfs.chkpt-tmp-bin" - partially written checkpoint file, + renamed when complete + \fBsave_compact_p\fP = false -- + "my_saved_clsfs.chkpt" - completely written checkpoint file + "my_saved_clsfs.chkpt-tmp" - partially written checkpoint file, + renamed when complete +.fi +.sp +.SS "HOW TO GET STARTED" +The way to invoke AUTOCLASS -SEARCH is: +.nf + +autoclass -search <.db2 file path> <.hd2 file path> + <.model file path> <.s-params file path> + +.fi +To restart a previous search, specify that \fBforce_new_search_p\fP +has the value false in the search params file, since its default is +true. Specifying false tells AUTOCLASS -SEARCH to try to find a +previous compatible search (<...>.results[-bin] & <...>.search) to +continue from, and will restart using it if found. To force a new +search instead of restarting an old one, give the parameter +\fBforce_new_search_p\fP the value of true, or use the default. If +there is an existing search (<...>.results[-bin] & <...>.search), the +user will be asked to confirm continuation since continuation will +discard the existing search. +.PP +If a previous search is continued, the message "RESTARTING SEARCH" will +be given instead of the usual "BEGINNING SEARCH". It is generally +better to continue a previous search than to start a new one, unless +you are trying a significantly different search method, in which case +statistics from the previous search may mislead the current one. +.SS "STATUS REPORTS" +A running commentary on the search will be printed to the screen and +to the log file (unless \fBlog_file_p\fP is false). Note that the +".log" file will contain a listing of all default search params +values, and the values of all params that are overridden. +.PP +After each try a very short report (only a few characters long) is +given. After each new best classification, a longer report is given, +but no more often than \fBmin_report_period\fP (default is 30 +seconds). +.SS "SEARCH VARIATIONS" +AUTOCLASS -SEARCH by default uses a certain standard search method or +"try function" (\fBtry_fn_type\fP = "converge_search_3"). Two others +are also available: "converge_search_4" and "converge"). They are +provided in case your problem is one that may happen to benefit from +them. In general the default method will result in finding better +classifications at the expense of a longer search time. The default +was chosen so as to be robust, giving even performance across many +problems. The alternatives to the default may do better on some +problems, but may do substantially worse on others. +.PP +"converge_search_3" uses an absolute stopping criterion +(\fBrel_delta_range\fP, default value of 0.0025) which tests the variation +of each class of the delta of the log approximate-marginal-likelihood +of the class statistics with-respect-to the class hypothesis +(class->log_a_w_s_h_j) divided by the class weight (class->w_j) +between successive convergence cycles. Increasing this value loosens +the convergence and reduces the number of cycles. Decreasing this +value tightens the convergence and increases the number of +cycles. \fBn_average\fP (default value of 3) specifies how many successive +cycles must meet the stopping criterion before the trial terminates. +.PP +"converge_search_4" uses an absolute stopping criterion +(\fBcs4_delta_range\fP, default value of 0.0025) which tests the +variation of each class of the slope for each class of log +approximate-marginal-likelihood of the class statistics +with-respect-to the class hypothesis (class->log_a_w_s_h_j) divided by +the class weight (class->w_j) over \fBsigma_beta_n_values\fP (default value +6) convergence cycles. Increasing the value of \fBcs4_delta_range\fP +loosens the convergence and reduces the number of cycles. Decreasing +this value tightens the convergence and increases the number of +cycles. Computationally, this try function is more expensive than +"converge_search_3", but may prove useful if the computational "noise" +is significant compared to the variations in the computed values. Key +calculations are done in double precision floating point, and for the +largest data base we have tested so far ( 5,420 cases of 93 +attributes), computational noise has not been a problem, although the +value of \fBmax_cycles\fP needed to be increased to 400. +.PP +"converge" uses one of two absolute stopping criterion which test the +variation of the classification (clsf) log_marginal (clsf->log_a_x_h) +delta between successive convergence cycles. The largest of +\fBhalt_range\fP (default value 0.5) and \fBhalt_factor\fP * +\fBcurrent_clsf_log_marginal\fP) is used (default value of +\fBhalt_factor\fP is 0.0001). Increasing these values loosens the +convergence and reduces the number of cycles. Decreasing these values +tightens the convergence and increases the number of cycles. +\fBn_average\fP (default value of 3) specifies how many cycles must +meet the stopping criteria before the trial terminates. This is a +very approximate stopping criterion, but will give you some feel for +the kind of classifications to expect. It would be useful for +"exploratory" searches of a data base. +.PP +The purpose of \fBreconverge_type\fP = "chkpt" is to complete an +interrupted classification by continuing from its last checkpoint. +The purpose of \fBreconverge_type\fP = "results" is to attempt further +refinement of the best completed classification using a different +value of \fBtry_fn_type\fP ("converge_search_3", "converge_search_4", +"converge"). If \fBmax_n_tries\fP is greater than 1, then in each case, +after the reconvergence has completed, \fBAutoClass\fP will perform +further search trials based on the parameter values in the +<...>.s-params file. +.PP +With the use of \fBreconverge_type\fP ( default value ""), you may +apply more than one try function to a classification. Say you +generate several exploratory trials using \fBtry_fn_type\fP = +"converge", and quit the search saving .search and .results[-bin] +files. Then you can begin another search with \fBtry_fn_type\fP = +"converge_search_3", \fBreconverge_type\fP = "results", and +\fBmax_n_tries\fP = 1. This will result in the further convergence of +the best classification generated with \fBtry_fn_type\fP = "converge", +with \fBtry_fn_type\fP = "converge_search_3". When \fBAutoClass\fP +completes this search try, you will have an additional refined +classification. +.PP +A good way to verify that any of the alternate \fBtry_fun_type\fP are +generating a well converged classification is to run \fBAutoClass\fP +in prediction mode on the same data used for generating the +classification. Then generate and compare the corresponding case or +class cross reference files for the original classification and the +prediction. Small differences between these files are to be expected, +while large differences indicate incomplete convergence. Differences +between such file pairs should, on average and modulo class deletions, +decrease monotonically with further convergence. +.PP +The standard way to create a random classification to begin a try is +with the default value of "random" for \fBstart_fn_type\fP. At this +point there are no alternatives. Specifying "block" for +\fBstart_fn_type\fP produces repeatable non-random searches. That is +how the <..>.s-params files in the autoclass-c/data/.. sub-directories +are specified. This is how development testing is done. +.PP +\fBmax_cycles\fP controls the maximum number of convergence cycles that will +be performed in any one trial by the convergence functions. Its default +value is 200. The screen output shows a period (".") for each cycle +completed. If your search trials run for 200 cycles, then either your +data base is very complex (increase the value), or the \fBtry_fn_type\fP +is not adequate for situation (try another of the available ones, and +use \fBconverge_print_p\fP to get more information on what is going on). +.PP +Specifying \fBconverge_print_p\fP to be true will generate a brief +print-out for each cycle which will provide information so that you +can modify the default values of \fBrel_delta_range\fP & +\fBn_average\fP for "converge_search_3"; \fBcs4_delta_range\fP & +\fBsigma_beta_n_values\fP for "converge_search_4"; and +\fBhalt_range\fP, \fBhalt_factor\fP, and \fBn_average\fP for +"converge". Their default values are given in the <..>.s-params files +in the autoclass-c/data/.. sub-directories. +.SS "HOW MANY CLASSES?" +Each new try begins with a certain number of classes and may end up +with a smaller number, as some classes may drop out of the convergence. +In general, you want to begin the try with some number of classes that +previous tries have indicated look promising, and you want to be sure +you are fishing around elsewhere in case you missed something before. +.PP +\fBn_classes_fn_type\fP = "random_ln_normal" is the default way to make this +choice. It fits a log normal to the number of classes (usually called "j" +for short) of the 10 best classifications found so far, and randomly +selects from that. There is currently no alternative. +.PP +To start the game off, the default is to go down \fBstart_j_list\fP +for the first few tries, and then switch to \fBn_classes_fn_type\fP. +If you believe that the probable number of classes in your data base +is say 75, then instead of using the default value of \fBstart_j_list\fP (2, +3, 5, 7, 10, 15, 25), specify something like 50, 60, 70, 80, 90, 100. +.PP +If one wants to always look for, say, three classes, one can use +\fBfixed_j\fP and override the above. Search status reports will describe +what the current method for choosing j is. +.SS "DO I HAVE ENOUGH MEMORY AND DISK SPACE?" +Internally, the storage requirements in the current system are of +order n_classes_per_clsf * (n_data + n_stored_clsfs * n_attributes * +n_attribute_values). This depends on the number of cases, the number +of attributes, the values per attribute (use 2 if a real value), and +the number of classifications stored away for comparison to see if +others are duplicates -- controlled by \fBmax_n_store\fP (default +value = 10). The search process does not itself consume significant +memory, but storage of the results may do so. +.PP +\fBAutoClass C\fP is configured to handle a maximum of 999 attributes. +If you attempt to run with more than that you will get array bound +violations. In that case, change these configuration parameters in +prog/autoclass.h and recompile \fBAutoClass C\fP: +.nf + +#define ALL_ATTRIBUTES 999 +#define VERY_LONG_STRING_LENGTH 20000 +#define VERY_LONG_TOKEN_LENGTH 500 + +.fi +For example, these values will handle several thousand attributes: +.nf + +#define ALL_ATTRIBUTES 9999 +#define VERY_LONG_STRING_LENGTH 50000 +#define VERY_LONG_TOKEN_LENGTH 50000 + +.fi +Disk space taken up by the "log" file will of course depend on the +duration of the search. \fBn_save\fP (default value = 2) determines how +many best classifications are saved into the ".results[-bin]" file. +\fBsave_compact_p\fP controls whether the "results" and "checkpoint" files +are saved as binary. Binary files are faster and more compact, but +are not portable. The default value of \fBsave_compact_p\fP is true, which +causes binary files to be written. +.PP +If the time taken to save the "results" files is a problem, consider +increasing \fBmin_save_period\fP (default value = 1800 seconds or 30 +minutes). Files are saved to disk this often if there is anything +different to report. +.SS "JUST HOW SLOW IS IT?" +Compute time is of order n_data * n_attributes * n_classes * n_tries +* converge_cycles_per_try. The major uncertainties in this are the +number of basic back and forth cycles till convergence in each try, and of +course the number of tries. The number of cycles per trial is typically +10-100 for \fBtry_fn_type\fP "converge", and 10-200+ for "converge_search_3" +and "converge_search-4". The maximum number is specified by \fBmax_n_tries\fP +(default value = 200). The number of trials is up to you and your +available computing resources. +.PP +The running time of very large data sets will be quite uncertain. We +advise that a few small scale test runs be made on your system to +determine a baseline. Specify \fBn_data\fP to limit how many data vectors +are read. Given a very large quantity of data, \fBAutoClass\fP may +find its most probable classifications at upwards of a hundred +classes, and this will require that \fBstart_j_list\fP be specified +appropriately (See above section \fBHOW MANY CLASSES?\fP). If you are +quite certain that you only want a few classes, you can force +\fBAutoClass\fP to search with a fixed number of classes specified by +\fBfixed_j\fP. You will then need to run separate searches with each +different fixed number of classes. +.SS "CHANGING FILENAMES IN A SAVED CLASSIFICATION FILE" +\fBAutoClass\fP caches the data, header, and model file pathnames in +the saved classification structure of the binary (".results-bin") or +ASCII (".results") "results" files. If the "results" and "search" +files are moved to a different directory location, the search cannot +be successfully restarted if you have used absolute pathnames. Thus +it is advantageous to run invoke \fBAutoClass\fP in a parent directory +of the data, header, and model files, so that relative pathnames can +be used. Since the pathnames cached will then be relative, the files +can be moved to a different host or file system and restarted -- +providing the same relative pathname hierarchy exists. +.PP +However, since the ".results" file is ASCII text, those pathnames +could be changed with a text editor (\fBsave_compact_p\fP must be +specified as false). +.SS "SEARCH PARAMETERS" +The search is controlled by the ".s-params" file. In this file, an +empty line or a line starting with one of these characters is treated +as a comment: "#", "!", or ";". The parameter name and its value can +be separated by an equal sign, a space, or a tab: +.sp +.nf + n_clsfs 1 + n_clsfs = 1 + n_clsfs1 +.fi +.sp +Spaces are ignored if "=" or "" are used as separators. Note +there are no trailing semicolons. +.PP +The search parameters, with their default values, are as follows: +.IP "\fBrel_error\fP = 0.01" +Specifies the relative difference measure used by clsf-DS-%=, when +deciding if a new clsf is a duplicate of an old one. +.IP "\fBstart_j_list\fP = 2, 3, 5, 7, 10, 15, 25" +Initially try these numbers of classes, so as not to narrow the +search too quickly. The state of this list is saved in the <..>.search +file and used on restarts, unless an override +specification of \fBstart_j_list\fP is made in the .s-params file for the +restart run. This list should bracket your expected number of +classes, and by a wide margin! +"start_j_list = -999" specifies an empty list (allowed only on restarts) +.IP "\fBn_classes_fn_type\fP = ""random_ln_normal""" +Once \fBstart_j_list\fP is exhausted, \fBAutoClass\fP will call this +function to decide how many classes to start with on the next try, +based on the 10 best classifications found so far. Currently only +"random_ln_normal" is available. +.IP "\fBfixed_j\fP = 0" +When \fBfixed_j\fP > 0, overrides \fBstart_j_list\fP and +\fBn_classes_fn_type,\fP and \fBAutoClass\fP will always use this value for +the initial number of classes. +.IP "\fBmin_report_period\fP = 30" +Wait at least this time (in seconds) since last report until reporting +verbosely again. Should be set longer than the expected run time when +checking for repeatability of results. For repeatable results, also +see \fBforce_new_search_p,\fP \fBstart_fn_type\fP and +\fBrandomize_random_p\fP. \fINOTE\fP: At least one of "interactive_p", +"max_duration", and "max_n_tries" must be active. Otherwise +\fBAutoClass\fP will run indefinitely. See below. +.IP "\fBinteractive_p\fP = true" +When false, allows run to continue until otherwise halted. +When true, standard input is queried on each cycle for the quit +character "q", which, when detected, triggers an immediate halt. +.IP "\fBmax_duration\fP = 0" +When = 0, allows run to continue until otherwise halted. +When > 0, specifies the maximum number of seconds to run. +.IP "\fBmax_n_tries\fP = 0" +When = 0, allows run to continue until otherwise halted. +When > 0, specifies the maximum number of tries to make. +.IP "\fBn_save\fP = 2" +Save this many clsfs to disk in the .results[-bin] and .search files. +if 0, don't save anything (no .search & .results[-bin] files). +.IP "\fBlog_file_p\fP = true" +If false, do not write a log file. +.IP "\fBsearch_file_p\fP = true" +If false, do not write a search file. +.IP "\fBresults_file_p\fP = true" +If false, do not write a results file. +.IP "\fBmin_save_period\fP = 1800" +CPU crash protection. This specifies the maximum time, in seconds, +that \fBAutoClass\fP will run before it saves the current results to +disk. The default time is 30 minutes. +.IP "\fBmax_n_store\fP = 10" +Specifies the maximum number of classifications stored internally. +.IP "\fBn_final_summary\fP = 10" +Specifies the number of trials to be printed out after search ends. +.IP "\fBstart_fn_type\fP = ""random""" +One of {"random", "block"}. This specifies the type of class +initialization. For normal search, use "random", which randomly +selects instances to be initial class means, and adds appropriate +variances. For testing with repeatable search, use "block", which +partitions the database into successive blocks of near equal size. +For repeatable results, also see \fBforce_new_search_p\fP, +\fBmin_report_period\fP, and \fBrandomize_random_p\fP. +.IP "\fBtry_fn_type\fP = ""converge_search_3""" +One of {"converge_search_3", "converge_search_4", "converge"}. +These specify alternate search stopping criteria. +"converge" merely tests the rate of change of the log_marginal +classification probability (clsf->log_a_x_h), without checking +rate of change of individual classes(see \fBhalt_range\fP and +\fBhalt_factor\fP). +"converge_search_3" and "converge_search_4" each monitor the ratio +class->log_a_w_s_h_j/class->w_j for all classes, and continue +convergence until all pass the quiescence criteria for \fBn_average\fP +cycles. "converge_search_3" tests differences between successive +convergence cycles (see \fBrel_delta_range\fP). This provides a +reasonable, general purpose stopping criteria. +"converge_search_4" averages the ratio over "sigma_beta_n_values" +cycles (see \fBcs4_delta_range\fP). This is preferred when +converge_search_3 produces many similar classes. +.IP "\fBinitial_cycles_p\fP = true" +If true, perform base_cycle in initialize_parameters. +false is used only for testing. +.IP "\fBsave_compact_p\fP = true" +true saves classifications as machine dependent binary +(.results-bin & .chkpt-bin). +false saves as ascii text (.results & .chkpt) +.IP "\fBread_compact_p\fP = true" +true reads classifications as machine dependent binary +(.results-bin & .chkpt-bin). +false reads as ascii text (.results & .chkpt). +.IP "\fBrandomize_random_p\fP = true" +false seeds lrand48, the pseudo-random number function with 1 +to give repeatable test cases. true uses universal time clock +as the seed, giving semi-random searches. +For repeatable results, also see \fBforce_new_search_p\fP, +\fBmin_report_period\fP and \fBstart_fn_type\fP. +.IP "\fBn_data\fP = 0" +With n_data = 0, the entire database is read from .db2. +With n_data > 0, only this number of data are read. +.IP "\fBhalt_range\fP = 0.5" +Passed to try_fn_type "converge". With the "converge" +try_fn_type, convergence is halted when the larger of halt_range +and (halt_factor * current_log_marginal) exceeds the difference +between successive cycle values of the classification log_marginal +(clsf->log_a_x_h). Decreasing this value may tighten the +convergence and increase the number of cycles. +.IP "\fBhalt_factor\fP = 0.0001" +Passed to try_fn_type "converge". With the "converge" +try_fn_type, convergence is halted when the larger of halt_range +and (halt_factor * current_log_marginal) exceeds the difference +between successive cycle values of the classification log_marginal +(clsf->log_a_x_h). Decreasing this value may tighten the +convergence and increase the number of cycles. +.IP "\fBrel_delta_range\fP = 0.0025" +Passed to try function "converge_search_3", which monitors the +ratio of log approx-marginal-likelihood of class statistics +with-respect-to the class hypothesis (class->log_a_w_s_h_j) +divided by the class weight (class->w_j), for each class. +"converge_search_3" halts convergence when the difference between +cycles, of this ratio, for every class, has been exceeded by +"rel_delta_range" for "n_average" cycles. Decreasing +"rel_delta_range" tightens the convergence and increases the +number of cycles. +.IP "\fBcs4_delta_range\fP = 0.0025" +Passed to try function "converge_search_4", which monitors the +ratio of (class->log_a_w_s_h_j)/(class->w_j), for each class, +averaged over "sigma_beta_n_values" convergence cycles. +"converge_search_4" halts convergence when the maximum difference +in average values of this ratio falls below "cs4_delta_range". +Decreasing "cs4_delta_range" tightens the convergence and +increases the number of cycles. +.IP "\fBn_average\fP = 3" +Passed to try functions "converge_search_3" and "converge". +The number of cycles for which the convergence criterion +must be satisfied for the trial to terminate. +.IP "\fBsigma_beta_n_values\fP = 6" +Passed to try_fn_type "converge_search_4". The number of past +values to use in computing sigma^2 (noise) and beta^2 (signal). +.IP "\fBmax_cycles\fP = 200" +This is the maximum number of cycles permitted for any one convergence +of a classification, regardless of any other stopping criteria. This +is very dependent upon your database and choice of model and +convergence parameters, but should be about twice the average number +of cycles reported in the screen dump and .log file +.IP "\fBconverge_print_p\fP = false" +If true, the selected try function will print to the screen values +useful in specifying non-default values for \fBhalt_range\fP, +\fBhalt_factor\fP, \fBrel_delta_range\fP, \fBn_average\fP, +\fBsigma_beta_n_values\fP, and \fBrange_factor\fP. +.IP "\fBforce_new_search_p\fP = true" +If true, will ignore any previous search results, discarding the +existing .search and .results[-bin] files after confirmation by the +user; if false, will continue the search using the +existing .search and .results[-bin] files. +For repeatable results, also see \fBmin_report_period\fP, +\fBstart_fn_type\fP and \fBrandomize_random_p\fP. +.IP "\fBcheckpoint_p\fP = false" +If true, checkpoints of the current classification will be written +every "min_checkpoint_period" seconds, with file extension +\&.chkpt[-bin]. This is only useful for very large classifications +.IP "\fBmin_checkpoint_period\fP = 10800" +If checkpoint_p = true, the checkpointed classification will be +written this often - in seconds (default = 3 hours) +.IP "\fBreconverge_type\fP = """ +Can be either "chkpt" or "results". If "checkpoint_p" = true and +"reconverge_type" = "chkpt", then continue convergence of the +classification contained in <...>.chkpt[-bin]. If "checkpoint_p " += false and "reconverge_type" = "results", continue convergence of +the best classification contained in <...>.results[-bin]. +.IP "\fBscreen_output_p\fP = true" +If false, no output is directed to the screen. Assuming +log_file_p = true, output will be directed to the log file only. +.IP "\fBbreak_on_warnings_p\fP = true" +The default value asks the user whether or not to continue, when data +definition warnings are found. If specified as false, then +\fBAutoClass\fP will continue, despite warnings -- the warning will +continue to be output to the terminal and the log file. +.IP "\fBfree_storage_p\fP = true" +The default value tells \fBAutoClass\fP to free the majority of its +allocated storage. This is not required, and in the case of the DEC +Alpha causes core dump [is this still true?]. If specified as false, +\fBAutoClass\fP will not attempt to free storage. +.SS "HOW TO GET AUTOCLASS C TO PRODUCE REPEATABLE RESULTS" +In some situations, repeatable classifications are required: comparing +basic \fBAutoClass C\fP integrity on different platforms, porting +\fBAutoClass C\fP to a new platform, etc. In order to accomplish this +two things are necessary: 1) the same random number generator must be +used, and 2) the search parameters must be specified properly. +.PP +Random Number Generator. This implementation of \fBAutoClass C\fP uses the +Unix srand48/lrand48 random number generator which generates +pseudo-random numbers using the well-known linear congruential +algorithm and 48-bit integer arithmetic. lrand48() returns non- +negative long integers uniformly distributed over the interval [0, +2**31]. +.PP +Search Parameters. +The following .s-params file parameters should be specified: +.nf + +force_new_search_p = true +start_fn_type "block" +randomize_random_p = false +;; specify the number of trials you wish to run +max_n_tries = 50 +;; specify a time greater than duration of run +min_report_period = 30000 + +.fi +Note that no current best classification reports will be produced. +Only a final classification summary will be output. +.SH CHECKPOINTING +With very large databases there is a significant probability of a +system crash during any one classification try. Under such +circumstances it is advisable to take the time to checkpoint the +calculations for possible restart. +.PP +Checkpointing is initiated by specifying "\fBcheckpoint_p\fP = true" +in the ".s-params" file. This causes the inner convergence step, to +save a copy of the classification onto the checkpoint file each time +the classification is updated, providing a certain period of time has +elapsed. The file extension is ".chkpt[-bin]". +.PP +Each time a AutoClass completes a cycle, a "." is output to the screen +to provide you with information to be used in setting the +\fBmin_checkpoint_period\fP value (default 10800 seconds or 3 hours). +There is obviously a trade-off between frequency of checkpointing and +the probability that your machine may crash, since the repetitive +writing of the checkpoint file will slow the search process. +.PP +Restarting AutoClass Search: +.PP +To recover the classification and continue the search after rebooting +and reloading AutoClass, specify \fBreconverge_type\fP = "chkpt" in +the ".s-params" file (specify \fBforce_new_search_p\fP as false). +.PP +AutoClass will reload the appropriate database and models, provided +there has been no change in their filenames since the time they were +loaded for the checkpointed classification run. The ".s-params" file +contains any non-default arguments that were provided to the original +call. +.PP +In the beginning of a search, before \fBstart_j_list\fP has been +emptied, it will be necessary to trim the original list to what would +have remained in the crashed search. This can be determined by +looking at the ".log" file to determine what values were already used. +If the \fBstart_j_list\fP has been emptied, then an empty +\fBstart_j_list\fP should be specified in the ".s-params" file. This +is done either by +.sp + \fBstart_j_list\fP = +.sp +or +.sp + \fBstart_j_list\fP = -9999 +.sp +Here is an a set of scripts to demonstrate check-pointing: +.nf + +autoclass -search data/glass/glassc.db2 data/glass/glass-3c.hd2 \\ + data/glass/glass-mnc.model data/glass/glassc-chkpt.s-params + +Run 1) + ## glassc-chkpt.s-params + max_n_tries = 2 + force_new_search_p = true + ## -------------------- + ;; run to completion + +Run 2) + ## glassc-chkpt.s-params + force_new_search_p = false + max_n_tries = 10 + checkpoint_p = true + min_checkpoint_period = 2 + ## -------------------- + ;; after 1 checkpoint, ctrl-C to simulate cpu crash + +Run 3) + ## glassc-chkpt.s-params + force_new_search_p = false + max_n_tries = 1 + checkpoint_p = true + min_checkpoint_period = 1 + reconverge_type = "chkpt" + ## -------------------- + ;; checkpointed trial should finish + +.fi +.SH "OUTPUT FILES" +The standard reports are +.IP 1) +Attribute influence values: presents the relative influence or +significance of the data's attributes both globally (averaged over +all classes), and locally (specifically for each class). A +heuristic +for relative class strength is also listed; +.IP 2) +Cross-reference by case (datum) number: lists the primary class +probability for each datum, ordered by case number. When +report_mode = "data", additional lesser class probabilities +(greater than or equal to 0.001) are listed for each datum; +.IP 3) +Cross-reference by class number: for each class the primary class +probability and any lesser class probabilities (greater than or +equal to 0.001) are listed for each datum in the class, ordered by +case number. It is also possible to list, for each datum, the values +of attributes, which you select. +.PP +The attribute influence values report attempts to provide relative +measures of the "influence" of the data attributes on the classes +found by the classification. The normalized class strengths, the +normalized attribute influence values summed over all classes, and the +individual influence values (I[jkl]) are all only relative measures +and should be interpreted with more meaning than rank ordering, but +not like anything approaching absolute values. +.PP +The reports are output to files whose names and pathnames are taken +from the ".r-params" file pathname. The report file types (extensions) +are: +.IP "\fBinfluence values report\fP" +"influ-o-text-\fIn\fP" or "influ-no-text-\fIn\fP" +.IP "\fBcross-reference by case\fP" +"case-text-\fIn\fP" +.IP "\fBcross-reference by class\fP" +"class-text-\fIn\fP" +.PP +or, if report_mode is overridden to "data": +.IP "\fBinfluence values report\fP" +"influ-o-data-\fIn\fP" or "influ-no-data-\fIn\fP" +.IP "\fBcross-reference by case\fP" +"case-data-\fIn\fP" +.IP "\fBcross-reference by class\fP" +"class-data-\fIn\fP" +.PP +where \fIn\fP is the classification number from the "results" file. +The first or best classification is numbered 1, the next best 2, etc. +The default is to generate reports only for the best classification in +the "results" file. You can produce reports for other saved +classifications by using report params keywords \fBn_clsfs\fP and +\fBclsf_n_list\fP. The "influ-o-text-\fIn\fP" file type is the +default (\fBorder_attributes_by_influence_p\fP = true), and lists each +class's attributes in descending order of attribute influence value. +If the value of \fBorder_attributes_by_influence_p\fP is overridden to +be false in the <...>.r-params file, then each class's attributes will +be listed in ascending order by attribute number. The extension of +the file generated will be "influ-no-text-\fIn\fP". This method of +listing facilitates the visual comparison of attribute values between +classes. +.PP +For example, this command: +.sp +.nf + autoclass -reports sample/imports-85c.results-bin + sample/imports-85c.search sample/imports-85c.r-params +.fi +.sp +with this line in the ".r-params" file: +.sp + xref_class_report_att_list = 2, 5, 6 +.sp +will generate these output files: +.sp +.nf + imports-85.influ-o-text-1 + imports-85.case-text-1 + imports-85.class-text-1 +.fi +.PP +The \fBAutoClass C\fP reports provide the capability to compute sigma +class contour values for specified pairs of real valued attributes, +when generating the influence values report with the data option +(report_mode = "data"). Note that sigma class contours are not +generated from discrete type attributes. +.PP +The sigma contours are the two dimensional equivalent of n-sigma error +bars in one dimension. Specifically, for two independent attributes +the n-sigma contour is defined as the ellipse where +.PP +((x - xMean) / xSigma)^2 + ((y - yMean) / ySigma)^2 == n +.PP +With covariant attributes, the n-sigma contours are defined +identically, in the rotated coordinate system of the distribution's +principle axes. Thus independent attributes give ellipses oriented +parallel with the attribute axes, while the axes of sigma contours of +covariant attributes are rotated about the center determined by the +means. In either case the sigma contour represents a line where the +class probability is constant, irrespective of any other class +probabilities. +.PP +With three or more attributes the n-sigma contours become +k-dimensional ellipsoidal surfaces. This code takes advantage of the +fact that the parallel projection of an n-dimensional ellipsoid, onto +any 2-dim plane, is bounded by an ellipse. In this simplified case of +projecting the single sigma ellipsoid onto the coordinate planes, it +is also true that the 2-dim covariances of this ellipse are equal to +the corresponding elements of the n-dim ellipsoid's covariances. The +Eigen-system of the 2-dim covariance then gives the variances +w.r.t. the principal components of the eclipse, and the rotation that +aligns it with the data. This represents the best way to display a +distribution in the marginal plane. +.PP +To get contour values, set the keyword \fBsigma_contours_att_list\fP +to a list of real valued attribute indices (from .hd2 file), and +request an influence values report with the data option. For example, +.sp +.nf + report_mode = "data" + sigma_contours_att_list = 3, 4, 5, 8, 15 +.fi +.SS "OUTPUT REPORT PARAMETERS" +The contents of the output report are controlled by the ".r-params" +file. In this file, an empty line or a line starting with one of +these characters is treated as a comment: "#", "!", or ";". The +parameter name and its value can be separated by an equal sign, a +space, or a tab: +.sp +.nf + n_clsfs 1 + n_clsfs = 1 + n_clsfs1 +.fi +.sp +Spaces are ignored if "=" or "" are used as separators. Note +there are no trailing semicolons. +.PP +The following are the allowed parameters and their default values: +.IP "\fBn_clsfs\fP = 1" +number of clsfs in the .results file for which to generate reports, +starting with the first or "best". +.IP "\fBclsf_n_list\fP = " +if specified, this is a one-based index list of clsfs in the clsf +sequence read from the .results file. It overrides "n_clsfs". +For example: +.sp + clsf_n_list = 1, 2 +.sp +will produce the same output as +.sp + n_clsfs = 2 +.sp +but +.sp + clsf_n_list = 2 +.sp +will only output the "second best" classification report. +.IP "\fBreport_type\fP = \"all\"" +type of reports to generate: "all", "influence_values", "xref_case", or +"xref_class". +.IP "\fBreport_mode\fP = \"text\"" +mode of reports to generate. "text" is formatted text layout. "data" +is numerical -- suitable for further processing. +.IP "\fBcomment_data_headers_p\fP = false" +the default value does not insert # in column 1 of most +report_mode = "data" header lines. If specified as true, the comment +character will be inserted in most header lines. +.IP "\fBnum_atts_to_list\fP = " +if specified, the number of attributes to list in influence values report. +if not specified, \fIall\fP attributes will be listed. +(e.g. "num_atts_to_list = 5") +.IP "\fBxref_class_report_att_list\fP = " +if specified, a list of attribute numbers (zero-based), whose values will +be output in the "xref_class" report along with the case probabilities. +if not specified, no attributes values will be output. +(e.g. "xref_class_report_att_list = 1, 2, 3") +.IP "\fBorder_attributes_by_influence_p\fP = true" +The default value lists each class's attributes in descending order of +attribute influence value, and uses ".influ-o-text-n" as the +influence values report file type. If specified as false, then each +class's attributes will be listed in ascending order by attribute number. +The extension of the file generated will be "influ-no-text-n". +.IP "\fBbreak_on_warnings_p\fP = true" +The default value asks the user whether to continue or not when data +definition warnings are found. If specified as false, then \fBAutoClass\fP +will continue, despite warnings -- the warning will continue to be +output to the terminal. +.IP "\fBfree_storage_p\fP = true" +The default value tells \fBAutoClass\fP to free the majority of its +allocated storage. This is not required, and in the case of the DEC +Alpha causes a core dump [is this still true?]. If specified as +false, \fBAutoClass\fP will not attempt to free storage. +.IP "\fBmax_num_xref_class_probs\fP = 5" +Determines how many lessor class probabilities will be printed for the +case and class cross-reference reports. The default is to print the +most probable class probability value and up to 4 lessor class prob- +ibilities. Note this is true for both the "text" and "data" class +cross-reference reports, but only true for the "data" case cross- +reference report. The "text" case cross-reference report only has the +most probable class probability. +.IP "\fBsigma_contours_att_list\fP = " +If specified, a list of real valued attribute indices (from .hd2 file) +will be to compute sigma class contour values, when generating +influence values report with the data option (report_mode = "data"). +If not specified, there will be no sigma class contour output. +(e.g. "sigma_contours_att_list = 3, 4, 5, 8, 15") +.SH "INTERPRETATION OF AUTOCLASS RESULTS" +.br +.sp +.SS "WHAT HAVE YOU GOT?" +Now you have run \fBAutoClass\fP on your data set -- what have you got? +Typically, the \fBAutoClass\fP search procedure finds many classifications, +but only saves the few best. These are now available for inspection +and interpretation. The most important indicator of the relative +merits of these alternative classifications is Log total posterior +probability value. Note that since the probability lies between 1 and +0, the corresponding Log probability is negative and ranges from 0 to +negative infinity. The difference between these Log probability values +raised to the power e gives the relative probability of the +alternatives classifications. So a difference of, say 100, implies +one classification is e^100 ~= 10^43 more likely than the other. +However, these numbers can be very misleading, since they give the +relative probability of alternative classifications under the +\fBAutoClass\fP \fIassumptions\fP. +.SS "ASSUMPTIONS" +Specifically, the most important \fBAutoClass\fP assumptions are the use of normal +models for real variables, and the assumption of independence of attributes +within a class. Since these assumptions are often violated in practice, the +difference in posterior probability of alternative classifications can be +partly due to one classification being closer to satisfying the assumptions +than another, rather than to a real difference in classification quality. +Another source of uncertainty about the utility of Log probability values is +that they do not take into account any specific prior knowledge the user may +have about the domain. This means that it is often worth looking at +alternative classifications to see if you can interpret them, but it is worth +starting from the most probable first. Note that if the Log probability value +is much greater than that for the one class case, it is saying that there is +overwhelming evidence for \fIsome\fP structure in the data, and part of this +structure has been captured by the \fBAutoClass\fP classification. +.SS "INFLUENCE REPORT " +So you have now picked a classification you want to examine, based on +its Log probability value; how do you examine it? The first thing to +do is to generate an "influence" report on the classification using +the report generation facilities documented in +\fI/usr/share/doc/autoclass/reports-c.text\fP. An influence report is +designed to summarize the important information buried in the +\fBAutoClass\fP data structures. +.PP +The first part of this report gives the heuristic class "strengths". +Class "strength" is here defined as the geometric mean probability that +any instance "belonging to" class, would have been generated from the +class probability model. It thus provides a heuristic measure of how +strongly each class predicts "its" instances. +.PP +The second part is a listing of the overall "influence" of each of the +attributes used in the classification. These give a rough heuristic +measure of the relative importance of each attribute in the +classification. Attribute "influence values" are a class probability +weighted average of the "influence" of each attribute in the classes, as +described below. +.PP +The next part of the report is a summary description of each of the +classes. The classes are arbitrarily numbered from 0 up to n, in order +of descending class weight. A class weight of say 34.1 means that the +weighted sum of membership probabilities for class is 34.1. Note that +a class weight of 34 does not necessarily mean that 34 cases belong to +that class, since many cases may have only partial membership in that +class. Within each class, attributes or attribute sets are ordered by +the "influence" of their model term. +.SS "CROSS ENTROPY " +A commonly used measure of the divergence between two probability +distributions is the cross entropy: the sum over all possible values x, +of P(x|c...)*log[P(x|c...)/P(x|g...)], where c... and g... define the +distributions. It ranges from zero, for identical distributions, to +infinite for distributions placing probability 1 on differing values of +an attribute. With conditionally independent terms in the probability +distributions, the cross entropy can be factored to a sum over these +terms. These factors provide a measure of the corresponding modeled +attribute's influence in differentiating the two distributions. +.PP +We define the modeled term's "influence" on a class to be the cross +entropy term for the class distribution w.r.t. the global class +distribution of the single class classification. "Influence" is thus a +measure of how strongly the model term helps differentiate the class +from the whole data set. With independently modeled attributes, the +influence can legitimately be ascribed to the attribute itself. With +correlated or covariant attributes sets, the cross entropy factor is a +function of the entire set, and we distribute the influence value +equally over the modeled attributes. +.SS "ATTRIBUTE INFLUENCE VALUES" +In the "influence" report on each class, the attribute parameters for +that class are given in order of highest influence value for the model +term attribute sets. Only the first few attribute sets usually have +significant influence values. If an influence value drops below about +20% of the highest value, then it is probably not significant, but all +attribute sets are listed for completeness. In addition to the +influence value for each attribute set, the values of the attribute +set parameters in that class are given along with the corresponding +"global" values. The global values are computed directly from the +data independent of the classification. For example, if the class +mean of attribute "temperature" is 90 with standard deviation of 2.5, +but the global mean is 68 with a standard deviation of 16.3, then this +class has selected out cases with much higher than average +temperature, and a rather small spread in this high range. Similarly, +for discrete attribute sets, the probability of each outcome in that +class is given, along with the corresponding global probability -- +ordered by its significance: the absolute value of (log +{ / }). The sign of the +significance value shows the direction of change from the global +class. This information gives an overview of how each class differs +from the average for all the data, in order of the most significant +differences. +.SS "CLASS AND CASE REPORTS" +Having gained a description of the classes from the "influence" +report, you may want to follow-up to see which classes your favorite +cases ended up in. Conversely, you may want to see which cases belong +to a particular class. For this kind of cross-reference information +two complementary reports can be generated. These are more fully +documented in \fI/usr/share/doc/autoclass/reports-c.text\fP. The +"class" report, lists all the cases which have significant membership +in each class and the degree to which each such case belongs to that +class. Cases whose class membership is less than 90% in the current +class have their other class membership listed as well. The cases +within a class are ordered in increasing case number. The alternative +"cases" report states which class (or classes) a case belongs to, and +the membership probability in the most probable class. These two +reports allow you to find which cases belong to which classes or the +other way around. If nearly every case has close to 99% membership in +a single class, then it means that the classes are well separated, +while a high degree of cross-membership indicates that the classes are +heavily overlapped. Highly overlapped classes are an indication that +the idea of classification is breaking down and that groups of +mutually highly overlapped classes, a kind of meta class, is probably +a better way of understanding the data. +.SS "COMPARING CLASS WEIGHTS AND CLASS/CASE REPORT ASSIGNMENTS" +The class weight given as the class probability parameter, is +essentially the sum over all data instances, of the normalized +probability that the instance is a member of the class. It is +probably an error on our part that we format this number as an integer +in the report, rather than emphasizing its real nature. You will find +the actual real value recorded as the w_j parameter in the class_DS +structures on any .results[-bin] file. +.PP +The .case and .class reports give probabilities that cases are members +of classes. Any assignment of cases to classes requires some decision +rule. The maximum probability assignment rule is often implicitly +assumed, but it cannot be expected that the resulting partition sizes +will equal the class weights unless nearly all class membership +probabilities are effectively one or zero. With non-1/0 membership +probabilities, matching the class weights requires summing the +probabilities. +.PP +In addition, there is the question of completeness of the EM +(expectation maximization) convergence. EM alternates between +estimating class parameters and estimating class membership +probabilities. These estimates converge on each other, but never +actually meet. \fBAutoClass\fP implements several convergence algorithms +with alternate stopping criteria using appropriate parameters in +the .s-params file. Proper setting of these parameters, to get reasonably +complete and efficient convergence may require experimentation. +.SS "ALTERNATIVE CLASSIFICATIONS " +In summary, the various reports that can be generated give you a way +of viewing the current classification. It is usually a good idea to +look at alternative classifications even though they do not have the +minimum Log probability values. These other classifications usually +have classes that correspond closely to strong classes in other +classifications, but can differ in the weak classes. The "strength" +of a class within a classification can usually be judged by how +dramatically the highest influence value attributes in the class +differ from the corresponding global attributes. If none of the +classifications seem quite satisfactory, it is always possible to run +\fBAutoClass\fP again to generate new classifications. +.SS "WHAT NEXT?" +Finally, the question of what to do after you have found an +insightful classification arises. Usually, classification is a +preliminary data analysis step for examining a set of cases (things, +examples, etc.) to see if they can be grouped so that members of the +group are "similar" to each other. \fBAutoClass\fP gives such a grouping +without the user having to define a similarity measure. The built-in +"similarity" measure is the mutual predictiveness of the cases. The +next step is to try to "explain" why some objects are more like others +than those in a different group. Usually, domain knowledge suggests +an answer. For example, a classification of people based on income, +buying habits, location, age, etc., may reveal particular social +classes that were not obvious before the classification analysis. To +obtain further information about such classes, further information, +such as number of cars, what TV shows are watched, etc., would reveal +even more information. Longitudinal studies would give information +about how social classes arise and what influences their attitudes -- +all of which is going way beyond the initial classification. +.SH PREDICTIONS +Classifications can be used to predict class membership for new +cases. So in addition to possibly giving you some insight into the +structure behind your data, you can now use \fBAutoClass\fP directly +to make predictions, and compare \fBAutoClass\fP to other learning +systems. +.PP +This technique for predicting class probabilities is applicable to all +attributes, regardless of data type/sub_type or likelihood model term type. +.PP +In the event that the class membership of a data case does not exceed +0.0099999 for any of the "training" classes, the following message will appear +in the screen output for each case: +.sp + xref_get_data: case_num xxx => class 9999 +.sp +Class 9999 members will appear in the "case" and "class" cross-reference +reports with a class membership of 1.0. +.PP +Cautionary Points: +.PP +The usual way of using \fBAutoClass\fP is to put all of your data in a +data_file, describe that data with model and header files, and run +"autoclass -search". Now, instead of one data_file you will have two, +a training_data_file and a test_data_file. +.PP +It is most important that both databases have the same \fBAutoClass\fP +internal representation. Should this not be true, \fBAutoClass\fP +will exit, or possibly in in some situations, crash. The prediction +mode is designed to hopefully direct the user into conforming to this +requirement. +.PP +Preparation: +.PP +Prediction requires having a training classification and a test +database. The training classification is generated by the running of +"autoclass -search" on the training data_file +("data/soybean/soyc.db2"), for example: +.sp +.nf + autoclass -search data/soybean/soyc.db2 data/soybean/soyc.hd2 + data/soybean/soyc.model data/soybean/soyc.s-params +.fi +.sp +This will produce "soyc.results-bin" and "soyc.search". Then create a +"reports" parameter file, such as "soyc.r-params" (see +\fI/usr/share/doc/autoclass/reports-c.text\fP), and run +\fBAutoClass\fP in "reports" mode, such as: +.sp +.nf + autoclass -reports data/soybean/soyc.results-bin + data/soybean/soyc.search data/soybean/soyc.r-params +.fi +.sp +This will generate class and case cross-reference files, and an influence +values file. The file names are based on the ".r-params" file name: +.sp +.nf + data/soybean/soyc.class-text-1 + data/soybean/soyc.case-text-1 + data/soybean/soyc.influ-text-1 +.fi +.sp +These will describe the classes found in the training_data_file. +Now this classification can be used to predict the probabilistic class +membership of the test_data_file cases ("data/soybean/soyc-predict.db2") +in the training_data_file classes. +.sp +.nf + autoclass -predict data/soybean/soyc-predict.db2 + data/soybean/soyc.results-bin data/soybean/soyc.search + data/soybean/soyc.r-params +.fi +.sp +This will generate class and case cross-reference files for the +test_data_file cases predicting their probabilistic class memberships +in the training_data_file classes. The file names are based on the +".db2" file name: +.sp +.nf + data/soybean/soyc-predict.class-text-1 + data/soybean/soyc-predict.case-text-1 +.fi +.sp +.SH "SEE ALSO" +\fBAutoClass\fP is documented fully here: +.LP +.I /usr/share/doc/autoclass/introduction-c.text +Guide to the documentation +.LP +.I /usr/share/doc/autoclass/preparation-c.text +How to prepare data for use by AutoClass +.LP +.I /usr/share/doc/autoclass/search-c.text +How to run AutoClass to find classifications. +.LP +.I /usr/share/doc/autoclass/reports-c.text +How to examine the classification in various ways. +.LP +.I /usr/share/doc/autoclass/interpretation-c.text +How to interpret AutoClass results. +.LP +.I /usr/share/doc/autoclass/checkpoint-c.text +Protocols for running a checkpointed search. +.LP +.I /usr/share/doc/autoclass/prediction-c.text +Use classifications to predict class membership for new cases. +.PP +These provide supporting documentation: +.LP +.I /usr/share/doc/autoclass/classes-c.text +What classification is all about, for beginners. +.LP +.I /usr/share/doc/autoclass/models-c.text +Brief descriptions of the model term implementations. +.PP +The mathematical theory behind \fBAutoClass\fP is explained in these +documents: +.LP +.I /usr/share/doc/autoclass/kdd-95.ps +Postscript file containing: +P. Cheeseman, J. Stutz, "Bayesian Classification (AutoClass): +Theory and Results", in "Advances in Knowledge Discovery and +Data Mining", Usama M. Fayyad, Gregory Piatetsky-Shapiro, +Padhraic Smyth, & Ramasamy Uthurusamy, Eds. The AAAI Press, +Menlo Park, expected fall 1995. +.LP +.I /usr/share/doc/autoclass/tr-fia-90-12-7-01.ps +Postscript file containing: +R. Hanson, J. Stutz, P. Cheeseman, "Bayesian Classification +Theory", Technical Report FIA-90-12-7-01, NASA Ames Research +Center, Artificial Intelligence Branch, May 1991 +(The figures are not included, since they were inserted by +"cut-and-paste" methods into the original "camera-ready" +copy.) +.SH AUTHORS +.nf +Dr. Peter Cheeseman +Principal Investigator - NASA Ames, Computational Sciences Division +cheesem@ptolemy.arc.nasa.gov + +John Stutz +Research Programmer - NASA Ames, Computational Sciences Division +stutz@ptolemy.arc.nasa.gov + +Will Taylor +Support Programmer - NASA Ames, Computational Sciences Division +taylor@ptolemy.arc.nasa.gov +.fi +.\" .PP +.\" This manual page was written by James R. Van Zandt , +.\" for the Debian GNU/Linux system (but may be used by others). +.SH "SEE ALSO" +.BR multimix (1). --- autoclass-3.3.6.dfsg.1.orig/debian/changelog +++ autoclass-3.3.6.dfsg.1/debian/changelog @@ -0,0 +1,199 @@ +autoclass (3.3.6.dfsg.1-2) unstable; urgency=medium + + * QA upload. + * Update Homepage. (Closes: #652867) + + -- Adrian Bunk Sun, 28 Feb 2021 12:29:53 +0200 + +autoclass (3.3.6.dfsg.1-1) unstable; urgency=low + + * QA upload. + * Set maintainer to QA group + * Remove sourceless PostScript files + * Add comment to debian/copyright + * Drop debian/autoclass.doc-base.autoclass-theory and + debian/autoclass.doc-base.autoclass-results + * Drop build-depends on gs-common, as we don't build the PDFs anymore + (Closes: #614525) + * Introduce recommended build-arch: and build-indep: targets in + debian/rules + * Bump standards to 3.9.2 (no further changes needed) + + -- Alexander Reichle-Schmehl Wed, 07 Dec 2011 10:36:07 +0100 + +autoclass (3.3.6-1) unstable; urgency=low + + * New upstream release + + * debian/control: Add dependency on ${misc:Depends}. Bump policy level + to 3.8.4 (no changes needed). Update link to home page. + + * debian/rules: call dh_prep instead of deprecated dh_clean -k. Copy + all the import-* files to the /usr/share/doc/autoclass/examples. + + * debian/compat: Bump debhelper compatibility version to 7. + + * debian/README.sample: Also run with -reports. + + * debian/simple.README: adjust awk script for new output format, which + spreads info on each variable across two lines. + + -- James R. Van Zandt Sun, 07 Feb 2010 10:52:08 -0500 + +autoclass (3.3.4-7.1) unstable; urgency=low + + * eliminate segfault in import example (Closes:Bug#554785) + + -- Tim Retout Thu, 31 Dec 2009 18:33:11 +0000 + +autoclass (3.3.4-7) unstable; urgency=low + + * *.doc-base*: Move documentation to section Science/Data Analysis to + conform with revised Debian Menu Policy. + + * debian/dirs: don't package empty /usr/sbin + + * debian/rules: handle DEB_BUILD_OPTIONS flag "parallel" + + * debian/autoclass.1: fix capitalization of bogus macro ".pp" and change + line breaks so file names ".search" and ".s-params" are not + interpreted as roff macros. + + * debian/control: Add Homepage field. Bump policy level to 3.8.0. + + -- James R. Van Zandt Thu, 14 Aug 2008 22:07:47 -0400 + +autoclass (3.3.4-6) unstable; urgency=low + + * prog/utils.c: use O_NONBLOCK rather than O_NDELAY to enable building + on GNU/kFreeBSD (thanks to Cyril Brulebois + , closes:Bug#413860) + * debian/autoclass.1: escape quotes (thanks to "Nicolas François" + , closes:Bug#349817) + * debian/control: bump Debian policy level to 3.7.2 (no changes needed) + + -- James R. Van Zandt Wed, 4 Apr 2007 19:56:58 -0400 + +autoclass (3.3.4-5) unstable; urgency=low + + * debian/autoclass.1: fix manpage formatting errors, thanks to Nicolas + =?UTF-8?Q?Fran=C3=A7ois?= + (fixes:Bug#349817) + + -- James R. Van Zandt Fri, 27 Jan 2006 20:15:56 -0500 + +autoclass (3.3.4-4) unstable; urgency=low + + * debian/control: add gs-common to Build-Depends + + -- James R. Van Zandt Sat, 24 Sep 2005 22:40:20 -0400 + +autoclass (3.3.4-3) unstable; urgency=low + + * debian/control: put in math-optional, to match overrides + + -- James R. Van Zandt Sat, 24 Sep 2005 20:08:06 -0400 + +autoclass (3.3.4-2) unstable; urgency=low + + * debian/control: policy version 3.6.2 + * build and install PDF rather than PostScript documentation. Register + documentation in section Apps/Math rather than Math + (closes:Bug#329961). + * intf-reports.c: split writes to influence_report_fp into more calls to + fprintf, so we don't exceed the ISO C89 guaranteed supported string + length. + * debian/compat, debian/control: debhelper compat level 4 + * debian/rules: call dh_installman rather than dh_installmanpages, + install files into debian/autoclass rather than debian/tmp. + + -- James R. Van Zandt Sat, 24 Sep 2005 19:29:55 -0400 + +autoclass (3.3.4-1) unstable; urgency=low + + * New upstream release (closes:Bug#262242) + + * debian/rules: don't copy sample/imports-85c.search into + /usr/share/doc/autoclass/examples, because it's no longer included in + the upstream sources. (It's generated when the example is executed.) + Don't bother running obsolete program dh_suidregister. Update to + policy 3.6.1. + + * autoclass.doc-base.autoclass-{results,theory}: PostScript + documentation given clearer document IDs and moved from + Apps/Programming to Math. + + * debian/{{pre,post}{inst,rm},watch}.ex: delete unused examples + + -- James R. Van Zandt Sun, 19 Sep 2004 20:42:29 -0400 + +autoclass (3.3.3-5) unstable; urgency=low + + * debian/autoclass.1: Fix typo "were" -> "where". + * new maintainer email + * io-results-bin.c:597: update first_clsf->models (thanks to Nick + Mitchell , fixes:Bug#65522) + + -- James R. Van Zandt Sun, 9 Dec 2001 14:31:46 -0500 + +autoclass (3.3.3-4) unstable; urgency=low + + * debian/control: move build-depends line to source section. Update to + policy version 3.5.2. + * debian/rules: support DEB_BUILD_OPTIONS + + -- James R. Van Zandt Tue, 22 May 2001 21:43:17 -0400 + +autoclass (3.3.3-3) unstable; urgency=low + + * simple.README: Check for writable directory. Display 95% contours of + classes found by AutoClass. + * simple.c, simple.db2: increase y values by 4 to remove the symmetry. + + -- James R. Van Zandt Wed, 10 Jan 2001 22:54:56 -0500 + +autoclass (3.3.3-2) unstable; urgency=low + + * build-depends on debhelper (Closes:Bug#70323) + + -- James R. Van Zandt Sun, 3 Sep 2000 13:48:58 -0400 + +autoclass (3.3.3-1) unstable; urgency=low + + * New upstream release + * debian/rules: fix commands to assemble single changelog from separate + upstream release notes. + * simple.README: Uncompress data files if necessary. Stop if any + command fails. + + -- James R. Van Zandt Mon, 19 Jun 2000 17:15:53 -0400 + +autoclass (3.3.2-3) unstable; urgency=low + + * Add "see also: multimix" reference to man page. + * In man page, fix pointer to documentation files. + * Extend man page to include input file formats. + * Add data files and script for simple example. + * Update policy to 3.1.1 + + -- James R. Van Zandt Mon, 24 Jan 2000 21:15:22 -0500 + +autoclass (3.3.2-2) unstable; urgency=low + + * Amplify man page: searching, checkpoints, predictions. + + -- James R. Van Zandt Sat, 30 Oct 1999 12:31:37 -0400 + +autoclass (3.3.2-1) unstable; urgency=low + + * Initial Release. + * Remove fcntlcom.h, because it's copyright AT&T. + * In prog/utils.c, include fcntl.h rather than fcntlcom.h. + * Add man page. + * In sample/imports-85c.s-params, comment out the line + "force_new_search_p = false". The .results-bin file is not included, + so we have to start a new search. + + -- James R. Van Zandt Sat, 11 Sep 1999 15:43:11 -0400 + + --- autoclass-3.3.6.dfsg.1.orig/debian/compat +++ autoclass-3.3.6.dfsg.1/debian/compat @@ -0,0 +1 @@ +7 --- autoclass-3.3.6.dfsg.1.orig/debian/control +++ autoclass-3.3.6.dfsg.1/debian/control @@ -0,0 +1,22 @@ +Source: autoclass +Section: math +Priority: optional +Maintainer: Debian QA Group +Build-Depends: debhelper (>= 7) +Standards-Version: 3.9.2 +Homepage: http://ti.arc.nasa.gov/tech/rse/synthesis-projects-applications/autoclass/ + +Package: autoclass +Architecture: any +Depends: ${shlibs:Depends},${misc:Depends} +Description: automatic classification or clustering + AutoClass solves the problem of automatic discovery of classes in data + (sometimes called clustering, or unsupervised learning), as distinct + from the generation of class descriptions from labeled examples + (called supervised learning). It aims to discover the "natural" + classes in the data. AutoClass is applicable to observations of + things that can be described by a set of attributes, without referring + to other things. The data values corresponding to each attribute are + limited to be either numbers or the elements of a fixed set of + symbols. With numeric data, a measurement error must be provided. + --- autoclass-3.3.6.dfsg.1.orig/debian/copyright +++ autoclass-3.3.6.dfsg.1/debian/copyright @@ -0,0 +1,36 @@ +This package was debianized by James R. Van Zandt on +Fri, 10 Sep 1999 22:07:41 -0400. + +It was received by email from the upstream maintainer: + Will Taylor + +A slightly earlier version is available from: + http://ic-www.arc.nasa.gov/ic/projects/bayes-group/autoclass/ + +The following files where removed from the tarball, as their prefered form +of modification is missing: +doc/tr-fia-90-12-7-01.ps +doc/kdd-95.ps + + +Upstream Authors: + + Dr. Peter Cheeseman + Principal Investigator - NASA Ames, Computational Sciences Division + cheesem@ptolemy.arc.nasa.gov + + John Stutz + Research Programmer - NASA Ames, Computational Sciences Division + stutz@ptolemy.arc.nasa.gov + + Will Taylor + Support Programmer - NASA Ames, Computational Sciences Division + taylor@ptolemy.arc.nasa.gov + +Copyright: + + AutoClass has been placed in the public domain. + +One file in the upstream sources, prog/fcntlcom-ac.h, contained an +AT&T copyright statement. That file is not necessary for Linux, and +has been omitted from the Debian sources. --- autoclass-3.3.6.dfsg.1.orig/debian/dirs +++ autoclass-3.3.6.dfsg.1/debian/dirs @@ -0,0 +1 @@ +usr/bin --- autoclass-3.3.6.dfsg.1.orig/debian/docpatch +++ autoclass-3.3.6.dfsg.1/debian/docpatch @@ -0,0 +1,43 @@ +diff -rauN ../orig/autoclass-3.3.4/debian/autoclass.1 ./autoclass-3.3.4/debian/autoclass.1 +--- ../orig/autoclass-3.3.4/debian/autoclass.1 2006-01-25 15:00:34.000000000 +0100 ++++ ./autoclass-3.3.4/debian/autoclass.1 2006-01-25 15:03:50.000000000 +0100 +@@ -764,7 +764,7 @@ + restart run. This list should bracket your expected number of + classes, and by a wide margin! + "start_j_list = -999" specifies an empty list (allowed only on restarts) +-.IP "\fBn_classes_fn_type\fP = "random_ln_normal"" ++.IP "\fBn_classes_fn_type\fP = ""random_ln_normal""" + Once \fBstart_j_list\fP is exhausted, \fBAutoClass\fP will call this + function to decide how many classes to start with on the next try, + based on the 10 best classifications found so far. Currently only +@@ -808,7 +808,7 @@ + Specifies the maximum number of classifications stored internally. + .IP "\fBn_final_summary\fP = 10" + Specifies the number of trials to be printed out after search ends. +-.IP "\fBstart_fn_type\fP = "random"" ++.IP "\fBstart_fn_type\fP = ""random""" + One of {"random", "block"}. This specifies the type of class + initialization. For normal search, use "random", which randomly + selects instances to be initial class means, and adds appropriate +@@ -816,7 +816,7 @@ + partitions the database into successive blocks of near equal size. + For repeatable results, also see \fBforce_new_search_p\fP, + \fBmin_report_period\fP, and \fBrandomize_random_p\fP. +-.IP "\fBtry_fn_type\fP = "converge_search_3"" ++.IP "\fBtry_fn_type\fP = ""converge_search_3""" + One of {"converge_search_3", "converge_search_4", "converge"}. + These specify alternate search stopping criteria. + "converge" merely tests the rate of change of the log_marginal +@@ -912,7 +912,7 @@ + .IP "\fBcheckpoint_p\fP = false" + If true, checkpoints of the current classification will be written + every "min_checkpoint_period" seconds, with file extension +-.chkpt[-bin]. This is only useful for very large classifications ++\&.chkpt[-bin]. This is only useful for very large classifications + .IP "\fBmin_checkpoint_period\fP = 10800" + If checkpoint_p = true, the checkpointed classification will be + written this often - in seconds (default = 3 hours) + +--k1lZvvs/B4yU6o8G-- + + --- autoclass-3.3.6.dfsg.1.orig/debian/rules +++ autoclass-3.3.6.dfsg.1/debian/rules @@ -0,0 +1,141 @@ +#!/usr/bin/make -f +#-*- makefile -*- +# Made with the aid of dh_make, by Craig Small +# Sample debian/rules that uses debhelper. GNU copyright 1997 by Joey Hess. +# Some lines taken from debmake, by Christoph Lameter. + +# build without debugging symbols and strip executables when installing, +# unless DEB_BUILD_OPTIONS specifies otherwise + +# Note: when changing the debhelper compatibility level, also update +# the dependency in debian/control (e.g. "debhelper (>= 7)") + +CFLAGS = -Wall -g +INSTALL = install +INSTALL_FILE = $(INSTALL) -p -o root -g root -m 644 +INSTALL_PROGRAM = $(INSTALL) -p -o root -g root -m 755 +INSTALL_SCRIPT = $(INSTALL) -p -o root -g root -m 755 +INSTALL_DIR = $(INSTALL) -p -d -o root -g root -m 755 + +ifneq (,$(findstring noopt,$(DEB_BUILD_OPTIONS))) +CFLAGS += -O0 +else +CFLAGS += -O2 +endif +ifeq (,$(findstring nostrip,$(DEB_BUILD_OPTIONS))) +INSTALL_PROGRAM += -s +endif +ifneq (,$(filter parallel=%,$(DEB_BUILD_OPTIONS))) + NUMJOBS = $(patsubst parallel=%,%,$(filter parallel=%,$(DEB_BUILD_OPTIONS))) + MAKEFLAGS += -j$(NUMJOBS) +endif + +DOC= doc/checkpoint-c.text \ + doc/classes-c.text \ + doc/interpretation-c.text \ + doc/introduction-c.text \ + doc/models-c.text \ + doc/prediction-c.text \ + doc/preparation-c.text \ + doc/reports-c.text \ + doc/search-c.text \ + read-me.text + +EXAMPLES= sample/imports-* \ + sample/read.me.c \ + sample/screenc.text \ + sample/scriptc.text \ + debian/README.sample \ + debian/simple.README \ + debian/simple.c \ + debian/simple.db2 \ + debian/simple.hd2 \ + debian/simple.model \ + debian/simple.r-params \ + debian/simple.s-params + +# Uncomment this to turn on verbose mode. +#export DH_VERBOSE=1 + + +build: build-arch build-indep +build-arch: build-stamp +build-indep: build-stamp +build-stamp: + dh_testdir + + # Add here commands to compile the package. + (cd prog; $(MAKE) $(MAKEFLAGS)) + + # assemble changelog in reverse chronological order + -rm changelog + for a in 9 8 7 6 5 4 3 2 1; do \ + for b in 9 8 7 6 5 4 3 2 1 0; do \ + for c in 9 8 7 6 5 4 3 2 1; do \ + if [ -f version-$$a-$$b-$$c.text ]; then \ + cat version-$$a-$$b-$$c.text >>changelog; fi; \ + done; \ + if [ -f version-$$a-$$b.text ]; then \ + cat version-$$a-$$b.text >>changelog; fi; \ + done; \ + done + + touch build-stamp + +clean: + dh_testdir + dh_testroot + rm -f build-stamp install-stamp + + # Add here commands to clean up after the build process. + rm -f doc/*.pdf prog/autoclass prog/*.o + + dh_clean + +install: install-stamp +install-stamp: build-stamp + dh_testdir + dh_testroot + dh_prep + dh_installdirs + + # Add here commands to install the package into debian/autoclass. + $(INSTALL_PROGRAM) prog/autoclass debian/autoclass/usr/bin/autoclass + + touch install-stamp + +# Build architecture-independent files here. +binary-indep: build install +# We have nothing to do by default. + +# Build architecture-dependent files here. +binary-arch: build install + rm -f debian/postinst.debhelper debian/prerm.debhelper +# dh_testversion + dh_testdir + dh_testroot + dh_installdocs $(DOC) + + dh_installexamples $(EXAMPLES) + +# dh_installmenu +# dh_installemacsen +# dh_installinit +# dh_installcron + dh_installman debian/autoclass.1 +# dh_undocumented + dh_installchangelogs changelog + dh_link + dh_strip + dh_compress -X.pdf + dh_fixperms +# dh_makeshlibs + dh_installdeb +# dh_perl + dh_shlibdeps + dh_gencontrol + dh_md5sums + dh_builddeb + +binary: binary-indep binary-arch +.PHONY: build clean binary-indep binary-arch binary --- autoclass-3.3.6.dfsg.1.orig/debian/simple.README +++ autoclass-3.3.6.dfsg.1/debian/simple.README @@ -0,0 +1,69 @@ +#!/bin/sh -e +# use autoclass to separate observations into two groups + +if [ ! -f simple.db2 ]; then gunzip *.gz; fi + +# is this directory writable? +if [ ! -w . ]; then + echo "permission denied." + echo 'please copy simple* to a directory where you have write permission' + exit 1 +fi + +# to see how the data file simple.db2 was created, see simple.c + +# output files may not already exist +rm -f simple.results *data-1 *text-1 + +# use autoclass to find the groups +autoclass -search simple.db2 simple.hd2 simple.model simple.s-params + +# ask autoclass to generate reports +autoclass -reports simple.results simple.search simple.r-params + +# copy the oservations into two files, +# depending on the group assigned by autoclass +awk ' +/^[0-9]/{ + group_number=$2; + getline data < "simple.db2"; + print data > "group" group_number; +}' simple.case-data-1 + +# display the original data (under X windows) +gnuplot -persist < +#include +#include + +double gauss(void); + +/* return normally distributed random number (zero mean, unit variance) */ +double gauss() +{ int i; + double sum=0; + for (i = 0; i < 12; i++) sum += rand(); + return sum/RAND_MAX - 6.; +} + +int main() +{ + int threshold = RAND_MAX/3; + int i; + double x, y; + + for (i = 0; i < 1000; i++) + { + if (rand() < threshold) + { + x = gauss(); + y = gauss() + 4; + } + else + { + x = gauss()*3 + 5; + y = gauss()*3 + 9; + } + printf("%6.5f %6.5f\n", x, y); + } +} --- autoclass-3.3.6.dfsg.1.orig/debian/simple.db2 +++ autoclass-3.3.6.dfsg.1/debian/simple.db2 @@ -0,0 +1,1000 @@ +6.47411 8.52204 +6.03408 12.10169 +1.69021 11.53269 +6.41825 15.81013 +6.88733 6.62744 +3.03462 9.23578 +0.56917 8.68020 +0.30626 6.37398 +8.84620 6.45313 +4.57104 8.03925 +6.99860 6.31329 +0.23262 5.69547 +1.36513 3.57888 +0.82512 2.28284 +6.13593 8.66683 +4.49422 5.58086 +0.24895 3.10398 +4.07617 5.82475 +6.18622 10.21199 +-2.24980 9.84061 +10.01128 17.74503 +10.01036 9.39770 +-1.43424 4.19985 +2.48168 5.31511 +5.73595 8.54930 +0.39210 4.76598 +2.61921 4.29809 +-0.13925 4.76851 +-0.29649 9.69812 +3.57019 8.09115 +0.64643 3.21637 +2.49967 12.26122 +0.15038 3.63772 +4.78507 2.28683 +0.26773 4.96968 +2.12912 3.74354 +-0.55426 5.40707 +-0.47952 3.64264 +3.59349 5.08915 +7.01792 10.10183 +2.21268 7.31724 +6.09519 7.48424 +2.50726 12.52909 +0.68629 4.71384 +3.43719 9.61118 +7.24799 7.76748 +0.77387 1.49601 +0.42292 3.57298 +11.11190 10.65776 +0.14411 3.39466 +5.40423 7.05829 +0.09255 2.61280 +-0.11900 4.16425 +1.53801 10.19207 +-1.62041 7.13972 +1.43796 3.30501 +0.14051 3.93709 +8.29858 8.39243 +4.77123 5.57193 +0.06633 3.18669 +7.52160 10.87793 +5.75859 9.13055 +0.68408 3.29092 +1.98665 7.82992 +6.52701 6.52950 +2.07815 4.53110 +7.32296 8.10009 +5.36525 12.78335 +-1.67320 3.58650 +0.54977 4.57017 +2.64186 5.56179 +-1.08834 8.68426 +-0.43986 4.43387 +3.64356 7.45999 +7.72428 10.60420 +-1.04578 5.42338 +-1.46808 3.97641 +3.85830 11.64139 +-1.62587 4.35815 +7.93687 5.61335 +-0.90977 4.01904 +4.70511 7.34670 +5.96141 5.88889 +6.95314 13.59319 +-0.87194 2.06148 +6.63372 8.30599 +5.39415 4.87905 +-1.00848 5.20294 +0.90222 4.61660 +3.29879 10.94262 +4.57913 8.57852 +9.91470 9.65970 +5.50480 10.57314 +4.53351 5.66922 +8.21124 6.57352 +-0.75995 4.22220 +5.40859 9.00154 +6.61520 11.73744 +3.83264 12.35285 +5.93636 7.32511 +4.86437 15.24050 +2.25077 10.55843 +-0.72045 4.41851 +3.97083 10.31395 +7.29916 6.08941 +6.31965 5.74379 +0.37006 3.61336 +0.74674 3.49445 +0.98832 8.09421 +1.62607 9.74020 +-0.68183 4.57320 +2.78784 10.99358 +1.38889 6.01387 +1.72964 4.60113 +5.57646 6.63948 +5.87227 6.66568 +0.62652 3.99227 +1.75981 12.82710 +3.85248 9.80220 +7.06130 13.15199 +-0.15854 2.98787 +4.77291 9.35392 +0.19132 4.07762 +2.52296 7.20886 +0.22038 4.43633 +8.42402 8.26641 +13.57175 13.51009 +6.28140 8.17800 +3.70933 12.14535 +1.32815 6.77294 +-0.75262 3.03546 +7.01132 8.92231 +1.36460 5.35233 +2.22751 8.30193 +-1.71300 11.77279 +2.20668 12.03863 +3.17074 9.89931 +7.86772 7.93197 +6.65042 11.76865 +0.35002 5.77554 +11.32330 6.03909 +12.77335 4.85357 +-1.85305 3.97239 +7.46470 9.13659 +0.96616 6.20164 +6.08306 12.07571 +7.16448 5.21365 +1.42140 3.22567 +3.91549 7.99391 +1.74227 7.47739 +0.88088 4.82143 +-1.82184 1.89272 +5.55561 8.31860 +4.93874 11.35378 +6.89592 13.78367 +8.23971 10.81535 +4.68516 3.55492 +3.99120 5.99647 +5.75342 8.49393 +4.16792 16.04939 +0.16654 10.01340 +10.24651 7.80585 +1.47858 10.55855 +1.84640 7.96620 +-0.68239 2.89951 +5.87707 4.90888 +0.32335 5.50398 +7.80134 11.49392 +-1.34352 5.23054 +0.58204 5.05323 +0.20920 4.98051 +5.79682 3.78756 +6.66276 12.50818 +3.61687 9.65659 +-1.10711 4.85312 +-0.00056 4.59377 +3.09309 11.76583 +-0.14830 3.48858 +-0.85244 2.71464 +0.10465 4.90049 +0.84144 4.92130 +5.10662 9.40113 +5.65690 12.12388 +0.60651 4.59373 +1.06709 3.86319 +5.26707 11.76285 +4.87034 11.71785 +1.81605 12.04935 +-0.00133 12.10707 +1.40559 2.89801 +1.65050 2.47420 +-0.09660 2.59164 +-1.92786 6.10468 +3.80592 9.42370 +0.45220 2.31552 +-1.46272 4.62638 +3.24761 11.11989 +0.33204 5.55187 +2.56618 9.87757 +2.36006 7.16720 +-0.87875 4.80005 +6.52687 6.53333 +-0.13762 3.81592 +7.49136 10.33138 +5.76797 10.03605 +10.07184 10.92285 +-0.17405 5.76490 +4.96755 10.45922 +-0.68890 4.60228 +1.60724 4.62581 +-0.77160 3.71831 +5.63572 11.85699 +4.94107 10.32617 +8.76414 16.91912 +2.75477 8.81621 +-0.68522 4.68729 +-0.74937 4.87403 +4.33894 9.53236 +2.62042 3.96817 +-1.54588 5.75261 +8.63603 10.27094 +2.63635 16.45880 +6.64864 10.00027 +1.70891 9.31286 +8.03731 8.07862 +-0.59625 0.97494 +4.80487 13.49042 +0.85944 2.47651 +9.90420 11.22644 +7.10534 9.68209 +1.05759 7.33198 +5.64499 5.36882 +1.84376 2.70074 +3.03358 7.44989 +9.33808 12.95726 +4.87733 7.37083 +7.73317 8.47950 +5.21599 7.78141 +6.93616 7.55084 +-1.48890 3.51946 +1.61360 10.25194 +3.06914 6.43474 +-1.47134 6.10114 +0.42517 3.76606 +4.14937 10.38817 +-0.10815 3.30962 +7.04167 7.78009 +1.24850 8.64532 +-0.24527 4.25132 +0.68793 4.21304 +2.10560 7.51874 +2.50379 9.05400 +5.12609 8.84190 +2.96779 7.40688 +-0.67228 8.39180 +1.39454 8.51198 +1.30468 9.62950 +3.00464 9.47602 +2.42754 7.42288 +0.71331 4.44906 +-0.86466 8.79293 +1.24634 4.21503 +3.83550 5.05189 +0.38364 9.20138 +6.93085 6.32325 +4.04396 3.35066 +8.54944 7.60672 +2.64277 9.66549 +7.73388 4.40003 +2.98021 7.37592 +8.71465 12.90225 +-2.17667 4.22985 +11.20003 10.36956 +1.38446 8.59197 +7.09172 11.63972 +3.50158 8.18273 +6.67540 13.41419 +3.80722 10.64674 +4.29616 10.36679 +0.05656 2.87273 +0.26152 5.09858 +5.55108 10.46783 +1.25682 11.43840 +3.65723 6.33441 +0.85274 9.11738 +-1.45642 6.33397 +0.96104 2.68734 +1.67908 5.24691 +0.72219 3.82771 +11.93373 10.53648 +1.63131 4.82731 +9.93589 8.51839 +-2.17333 3.61118 +1.21456 3.14538 +8.25687 6.72309 +-0.66795 3.94966 +3.98170 3.01459 +7.88310 9.20950 +3.32671 6.04499 +0.74158 3.48540 +5.01209 7.73970 +8.60361 6.95117 +2.37944 8.89171 +1.35908 2.34799 +8.13654 8.03203 +6.48301 6.43598 +-0.51226 1.89774 +5.14129 14.59357 +4.80838 3.75899 +3.16084 11.70521 +7.27579 13.35052 +4.23683 11.87940 +1.37238 3.83117 +-0.80396 13.17319 +-0.43592 4.84612 +3.94187 10.95570 +2.61504 4.81629 +4.41527 6.17994 +1.01003 16.86741 +-0.03100 4.47419 +4.35329 7.15006 +-0.80574 4.49108 +3.95757 12.30967 +-0.30324 2.62706 +6.49703 8.49309 +0.24149 3.91828 +3.33027 10.87280 +7.17859 10.56901 +2.35120 8.47723 +3.06392 8.41029 +5.03082 10.46895 +5.44941 10.01710 +0.40060 4.67584 +7.23482 11.08575 +7.33034 6.74993 +-0.41767 4.57349 +7.88303 9.24298 +-0.13126 5.97880 +0.56563 2.86604 +-0.24280 5.11248 +6.53261 8.16773 +-0.25994 2.66866 +0.44307 3.90338 +-1.62040 4.06410 +1.38336 14.47980 +0.77628 8.98144 +1.98415 5.35914 +-0.01117 2.88619 +2.29893 8.22343 +5.17814 7.62690 +-0.00520 4.39892 +6.58446 10.92309 +0.70497 3.08535 +5.08309 6.31064 +5.17358 9.20404 +1.56854 8.71432 +3.93654 2.82252 +3.69357 9.39769 +-0.09058 11.93809 +4.22650 8.35002 +-0.18868 4.70066 +0.64229 4.71213 +9.33914 8.13266 +2.90307 6.88872 +0.59532 3.41678 +1.00127 10.46502 +5.75712 8.40285 +0.35956 4.34263 +0.61434 9.93859 +2.20429 9.35508 +8.90771 4.90078 +13.79397 7.58955 +4.67257 14.37567 +5.28259 11.65167 +-0.20671 2.50827 +1.25101 2.74505 +2.64402 7.43420 +-2.25650 7.70927 +-0.23931 3.03651 +4.60985 14.90344 +0.62454 3.96345 +-0.22059 2.32191 +-2.19678 2.94374 +7.53457 5.71555 +0.53261 4.73960 +0.23032 2.93985 +2.48980 9.01890 +0.09544 3.67889 +5.06591 11.19512 +-0.48777 4.53258 +6.89973 6.16297 +0.34223 3.93214 +8.28918 10.06558 +2.34690 12.84834 +0.64021 8.70901 +3.83892 9.38960 +-1.23427 4.77723 +5.80469 4.55691 +4.09466 6.63581 +5.88956 11.54856 +0.11405 4.14978 +4.40090 7.02253 +-0.35999 3.90362 +-1.48020 8.50269 +0.20707 3.36630 +-0.74197 4.91508 +0.67459 3.64103 +8.15089 5.97601 +0.38955 3.50635 +1.95588 7.58896 +1.10886 5.76493 +2.29034 12.78107 +6.76764 9.58886 +-0.37106 4.03718 +5.31272 7.26390 +8.13808 5.44523 +6.23281 9.26336 +2.85999 10.59245 +5.76208 11.62269 +4.41000 9.12709 +5.16985 7.94348 +4.44808 14.19930 +5.50234 17.83445 +-0.81646 3.87765 +7.12524 9.55952 +5.81061 2.41337 +0.51187 3.60414 +5.90552 12.09094 +6.73457 7.01848 +4.67349 11.80596 +4.15686 7.38553 +5.43490 11.06468 +6.75528 13.48452 +-1.21990 3.74322 +8.96175 4.07581 +2.43102 0.70900 +3.60980 7.42365 +6.75090 9.46826 +9.22618 10.02239 +2.60045 13.79525 +3.09993 6.87375 +6.43897 6.04818 +-0.33285 4.58367 +7.44161 8.21361 +0.23681 5.13510 +0.39275 4.90460 +7.03667 7.78028 +5.00834 9.38794 +-1.35991 2.96050 +6.74471 13.12901 +-0.49870 5.43879 +6.39673 7.25373 +0.70790 8.04538 +7.22531 6.25917 +6.41367 9.21234 +2.09860 4.35900 +6.20645 8.74587 +-0.71686 4.04390 +9.46234 8.32545 +0.07649 3.14036 +6.40917 12.67687 +3.27905 3.40478 +8.55889 4.74628 +3.08580 9.78278 +0.28099 4.21411 +-0.35010 8.99756 +6.43670 12.28036 +5.28047 2.32343 +0.12974 4.71407 +-1.14769 5.49419 +-0.29193 3.13208 +4.58976 1.70683 +0.94110 7.06626 +5.20711 12.51439 +-0.27869 3.65818 +5.18032 10.78652 +4.24458 10.87536 +-1.27305 3.76184 +0.93124 2.86309 +0.29142 3.78072 +-0.05320 3.49546 +3.34635 5.34577 +7.05656 6.19792 +0.38457 4.03908 +7.96114 12.21005 +3.56720 9.45908 +-0.07138 4.22571 +-0.44445 2.93883 +4.76478 9.39184 +3.01662 4.90551 +5.44175 9.37556 +3.29412 8.88431 +2.56890 10.20768 +6.57119 4.64936 +-1.34251 4.23964 +-0.75873 4.66831 +0.98584 3.50943 +3.79762 7.94984 +-0.66584 9.01689 +5.60303 9.81525 +4.06830 5.24926 +7.28833 5.67612 +10.22459 9.81677 +5.91226 6.75738 +7.62045 8.85179 +10.76546 4.98741 +5.55987 7.03891 +6.65649 5.76636 +0.59935 5.41545 +1.14014 4.24986 +8.18679 13.92826 +1.44450 7.51526 +4.66329 7.45037 +-0.08336 10.75524 +-1.11631 5.04979 +4.58722 11.39882 +0.59811 4.48033 +-0.18134 4.56665 +9.71454 9.12064 +-1.63814 3.50913 +6.56394 9.82903 +3.90881 7.84447 +0.63036 5.12403 +4.10738 10.20902 +3.09520 4.36193 +0.28923 4.06191 +0.11154 6.97493 +0.75240 7.14512 +0.76672 4.06281 +4.53462 10.23158 +1.41739 4.02112 +7.63820 16.81839 +-1.89670 4.07849 +9.62214 9.49660 +-1.74556 4.37050 +1.15496 3.78955 +-0.14183 3.04982 +4.17186 7.36768 +-0.45599 3.48096 +4.51331 12.88419 +1.01569 4.35072 +1.28971 6.41647 +9.63775 13.83156 +6.67495 14.47251 +-1.25674 8.55259 +0.63931 4.83431 +3.69145 13.58494 +7.94792 10.40695 +0.40616 5.71479 +0.17917 8.27295 +4.16912 8.99012 +-0.24403 3.58635 +1.25605 3.81315 +-2.31551 11.04443 +6.29139 12.93473 +7.43130 7.55005 +5.84836 10.62618 +5.37591 6.50566 +3.19883 11.39101 +8.10192 14.83846 +7.43682 12.93534 +0.52525 5.30390 +9.48481 8.11530 +3.80394 3.80465 +-0.13932 3.91362 +7.18204 7.48858 +-0.75523 3.44702 +6.19971 7.35681 +-0.17006 3.88530 +2.68220 9.50673 +3.27987 6.29507 +9.88010 7.23969 +-1.04323 4.63606 +3.30582 7.32952 +10.82849 8.86868 +-0.76694 4.18079 +7.25683 9.04118 +6.40660 10.52814 +5.94362 10.55762 +2.13152 11.53492 +4.48419 14.36706 +-0.46903 6.02494 +8.88524 5.25895 +1.63891 2.79159 +7.97549 10.34282 +2.57088 6.85595 +2.07599 6.07788 +8.76355 12.74615 +5.10479 3.54089 +5.57808 12.65889 +0.02410 5.43370 +2.51556 7.80967 +3.94780 9.03348 +0.40143 4.51826 +4.49798 6.81030 +2.88924 10.62941 +6.26598 7.26701 +8.05218 10.83184 +6.86997 6.91292 +4.29549 9.01034 +2.81083 7.44301 +3.47525 9.25995 +5.11569 6.08596 +7.05239 8.69939 +-0.85067 3.79840 +-0.32460 3.16702 +5.12627 6.00698 +6.51097 5.32656 +5.59219 10.53330 +-0.18722 2.78644 +-2.57476 2.84465 +4.82221 6.39076 +1.60079 3.54828 +8.00480 8.81632 +4.30050 8.66000 +2.65497 4.92818 +3.76928 5.95663 +2.85799 13.50438 +7.89786 5.40029 +1.35772 4.49526 +2.43432 10.16035 +2.91737 10.20505 +-0.97770 4.46857 +6.97272 4.96341 +5.52989 9.98050 +3.57283 8.09034 +3.29525 3.96848 +-0.98713 5.44706 +-0.61774 2.94263 +10.72760 14.58675 +-1.21053 8.63321 +6.77749 14.71184 +4.79161 8.53344 +-0.95856 8.21388 +0.89599 1.96625 +7.37240 7.67318 +0.43894 2.89736 +4.75648 5.23077 +7.48337 11.06176 +5.37359 10.17689 +6.65165 6.19662 +0.45743 4.21188 +3.79156 13.90569 +7.61849 9.68925 +-0.17249 2.92461 +4.68634 6.81937 +-1.85645 2.45728 +2.89330 13.43623 +5.25998 10.81487 +-1.47000 3.06622 +7.89669 4.10882 +1.61235 10.87126 +3.03933 4.71637 +4.74883 7.24337 +4.22654 12.01462 +5.79531 8.25689 +9.06890 4.18197 +6.35887 10.81708 +2.43573 10.35323 +7.57732 6.72256 +2.59566 6.08585 +0.15717 4.37175 +-0.11123 2.49226 +4.52298 10.89616 +-0.14559 9.79119 +-2.36138 3.78428 +-1.62188 3.82054 +6.85333 12.01751 +8.80277 7.46693 +0.81142 4.64299 +-0.73006 1.73189 +6.66033 5.05456 +4.32772 6.99601 +6.66086 12.20415 +4.56762 15.64574 +-1.22026 4.66821 +7.51775 7.33567 +-0.20220 3.85987 +-1.34261 3.77671 +6.64169 12.46195 +1.87754 8.79444 +5.86020 12.04689 +5.48065 6.83372 +0.23865 2.98051 +5.75589 6.02356 +3.77214 10.24732 +7.11973 13.21560 +8.76286 9.14646 +9.94878 8.61740 +-0.64547 3.57993 +-0.93080 9.15835 +5.30224 8.16297 +9.90909 7.26340 +2.38310 14.89811 +8.34123 8.90013 +0.71201 2.95040 +5.54602 4.93946 +0.16308 5.41759 +6.79658 12.02757 +5.93504 8.07721 +8.27176 10.83698 +8.24985 7.69036 +3.68341 4.49405 +5.92516 5.10385 +4.33672 9.34005 +4.04560 8.94778 +8.89449 13.53490 +1.65552 3.14224 +3.55185 4.21463 +-0.81517 3.60054 +5.95069 8.93766 +10.38590 8.23538 +0.22213 2.37814 +0.47315 3.64978 +-0.00279 2.78098 +4.28349 7.22368 +4.12181 9.62910 +0.04694 4.28649 +6.74817 6.36595 +0.41686 5.96614 +4.19620 11.80183 +2.04194 10.42122 +2.89339 4.08967 +5.43327 3.29510 +0.30804 14.81484 +-0.19324 10.69779 +-1.27903 3.78722 +1.25485 4.55173 +5.40043 4.74170 +4.92058 12.50586 +0.05954 4.49407 +0.07578 4.90648 +7.86481 13.45597 +-0.63451 5.52631 +-2.14971 11.19505 +0.40626 5.93345 +-1.59314 10.16276 +1.18898 4.36918 +7.68223 8.12178 +1.13650 12.26733 +-1.28774 17.48527 +9.19048 7.26835 +-0.12545 4.08947 +1.03988 4.15302 +8.64929 9.81836 +6.57674 7.62938 +-0.64894 1.54311 +3.96128 6.39132 +0.43242 7.54817 +0.47048 4.72719 +2.27834 6.99789 +-0.15000 4.21626 +2.06641 8.06694 +11.54021 8.61944 +2.61126 11.49683 +-1.17295 2.81183 +8.01204 3.65322 +11.70486 7.40255 +7.10948 9.82639 +4.97867 13.09519 +-0.87292 6.14600 +-0.17121 3.09945 +-0.15741 4.14823 +0.38712 4.62347 +1.36056 4.03969 +3.14980 13.57802 +4.81428 8.97893 +6.74483 9.11730 +-1.57327 9.78392 +2.58225 8.79425 +10.75156 6.63278 +5.13416 8.10872 +0.84871 3.29551 +5.48112 12.06046 +-3.57230 13.67344 +-0.74780 2.23625 +4.20331 9.15107 +-0.70474 4.24888 +6.42027 2.98495 +7.87599 7.73598 +4.75454 5.30726 +0.53584 5.55693 +3.77514 7.55889 +2.70380 7.56259 +-1.05790 3.73967 +0.94426 9.65921 +-0.56050 4.33000 +5.52898 8.26188 +4.55742 -1.66959 +0.53560 4.89989 +8.61121 11.03502 +6.30260 5.13213 +3.92367 8.57877 +-0.75412 3.23816 +1.59679 4.49707 +2.90023 8.83994 +0.52841 5.19996 +3.26733 4.12801 +6.76619 14.76830 +-0.91896 6.44617 +4.41154 5.05818 +0.80980 5.04548 +0.14401 4.06338 +7.00382 9.37655 +-1.07623 2.76855 +0.57576 3.87306 +6.39529 9.46449 +-0.95918 5.16779 +4.23371 11.04769 +1.60419 3.49952 +-1.52756 5.24436 +-0.60141 4.10569 +6.03134 6.29655 +-0.93922 3.25762 +7.83627 10.45393 +4.16566 4.22947 +3.61933 6.88059 +-1.29191 2.54421 +-0.21138 3.16483 +-3.62172 7.08569 +3.64628 7.23993 +3.14326 7.17790 +4.28782 7.18677 +0.53912 3.24671 +0.26669 9.30748 +6.21176 10.47613 +5.38000 11.56842 +8.76613 11.71450 +0.36446 3.05402 +4.81059 11.50578 +2.76235 12.15753 +1.46249 12.25062 +7.37018 7.57409 +0.96676 6.40137 +6.84457 13.48688 +10.11012 8.85568 +0.44665 4.26260 +2.53925 7.61379 +7.15952 7.90175 +3.87893 6.95884 +-0.50341 5.68287 +1.03093 3.72579 +4.45990 10.97449 +4.60154 9.03428 +5.91755 12.96538 +0.08137 2.65512 +4.16764 7.53963 +10.60285 3.23072 +4.30514 12.21157 +7.23955 10.45595 +4.46414 9.99226 +3.62994 14.28462 +0.88101 3.32273 +3.25380 3.19327 +-1.06579 3.12892 +0.91007 3.38739 +11.81991 13.83720 +0.80387 9.29426 +1.26277 3.36279 +0.78554 3.99323 +1.55970 3.77415 +0.89968 6.10849 +1.15990 3.10282 +2.83425 9.64892 +4.49776 9.95772 +7.50072 12.80876 +0.36654 3.48398 +2.61200 3.81892 +6.15587 10.37147 +1.41964 5.43694 +-1.08654 3.90689 +2.02467 4.58960 +11.17210 10.27321 +4.22220 9.42267 +-0.71006 4.28869 +6.18581 9.10360 +0.77580 4.11935 +6.32222 6.27479 +0.14326 2.83728 +-0.49358 5.89144 +-0.01815 2.37280 +0.16423 4.50118 +3.91059 6.51333 +6.44999 4.98656 +1.18299 4.26762 +1.98288 4.48710 +7.16928 7.66772 +6.09560 7.72628 +-0.74467 4.91406 +-4.18353 10.80272 +5.32021 13.75570 +7.38546 1.36276 +6.41250 9.71887 +0.97320 15.35476 +-1.58659 3.38463 +9.68611 11.82949 +3.43329 11.85007 +2.67656 4.04667 +6.46654 12.26304 +7.14134 5.64641 +6.42067 12.94545 +0.85614 3.90645 +0.92659 3.96967 +-0.72385 3.21534 +1.79552 5.45134 +5.88220 8.45948 +5.33427 9.56704 +4.32692 8.91593 +0.16273 5.38764 +0.04957 3.90401 +2.46473 8.66324 +-0.72739 4.74743 +9.30428 10.59393 +2.86438 9.42686 +-0.55061 3.81300 +5.04912 4.45385 +0.91639 7.11401 +-1.44679 2.69977 +12.47756 10.34016 +1.06786 4.04211 +7.63307 13.08870 +1.32125 3.75617 +1.04650 9.60045 +5.33899 12.16210 +2.02282 7.82414 +3.82594 8.06495 +-0.27634 3.12631 +6.69245 10.45522 +4.46439 12.06351 +7.26189 11.75900 +5.97677 15.89643 +2.67862 9.97959 +0.26084 8.56314 +0.90767 5.58926 +3.85869 9.74750 +9.08872 6.73068 +9.15839 11.72578 +7.79691 9.82449 +8.78117 10.92053 +-0.27600 1.78642 +9.45264 13.19927 +6.27099 4.65737 +1.93366 3.99563 +11.90003 16.95420 +-0.93895 3.76175 +5.81839 8.99296 +-0.18749 3.56822 +4.12213 3.86681 +8.35913 11.75641 +10.71574 12.36097 +3.72670 10.65000 +1.27611 4.05221 +0.95123 3.68666 +6.01843 13.65318 +0.89402 12.98084 +1.08531 2.95047 +7.76068 6.43291 +9.24823 6.66024 +-0.36732 4.57881 +-1.80288 4.20722 +8.01360 7.85246 +8.15571 5.73405 +5.96132 9.40802 +0.11376 3.39020 +-0.95993 2.51865 +0.03131 4.99426 +0.83125 5.05747 +0.33704 4.01321 +0.08113 4.46726 +6.22822 5.86982 +4.21350 11.28038 +5.49593 9.86692 +0.49359 8.00960 +9.34221 9.98011 +4.86881 4.54651 +0.19954 4.95170 +0.30786 2.53061 +0.37373 2.82666 +-0.73184 3.70990 +-0.51740 4.84857 +10.64590 12.21535 +2.23352 9.65201 +1.41586 12.74854 +-1.32190 6.26452 +0.25718 5.34299 +3.94091 6.71190 +-2.85483 13.45575 +4.50883 11.92996 +0.24967 4.00636 +-0.22916 3.15734 +-1.49271 3.76489 +3.14430 8.64190 +4.73807 5.18044 +1.50709 2.76883 +5.46345 10.21153 +4.74764 8.16592 +-0.92657 4.46892 +0.76716 3.85733 +2.34091 6.30231 +3.01608 11.41887 --- autoclass-3.3.6.dfsg.1.orig/debian/simple.hd2 +++ autoclass-3.3.6.dfsg.1/debian/simple.hd2 @@ -0,0 +1,12 @@ +# AutoClass C header file -- simple.hd2 + +; -- Creator/Donor: Jim Van Zandt +; -- Date: 23 January 2000 + +num_db2_format_defs 2 +number_of_attributes 2 +separator_char ' ' + + +0 real location "longitude" error 0.1 +1 real location "latitude" error 0.1 --- autoclass-3.3.6.dfsg.1.orig/debian/simple.model +++ autoclass-3.3.6.dfsg.1/debian/simple.model @@ -0,0 +1,7 @@ +# AutoClass C model file -- simple.model + +; -- Creator/Donor: Jim Van Zandt +; -- Date: 23 January 2000 + +model_index 0 1 +single_normal_cn 0 1 --- autoclass-3.3.6.dfsg.1.orig/debian/simple.r-params +++ autoclass-3.3.6.dfsg.1/debian/simple.r-params @@ -0,0 +1,24 @@ +# AutoClass C report parameter file -- simple.r-params + +; -- Creator/Donor: Jim Van Zandt +; -- Date: 23 January 2000 + +report_mode = "data" + +comment_data_headers_p = true +! the default value does not insert # in column 1 of most +! report_mode = "data" header lines. If specified as true, the comment +! character will be inserted in most header lines. + +free_storage_p = false +! The default value tells AutoClass to free the majority of its allocated +! storage. This is not required, and in the case of DEC Alpha's causes +! core dump. If specified as false, AutoClass will not attempt to free +! storage. + +# sigma_contours_att_list = +! If specified, a list of real valued attribute indices (from .hd2 file) +! will be to compute sigma class contour values, when generating +! influence values report with the data option (report_mode = "data"). +! If not specified, there will be no sigma class contour output. +! (e.g. sigma_contours_att_list = 3, 4, 5, 8, 15) --- autoclass-3.3.6.dfsg.1.orig/debian/simple.s-params +++ autoclass-3.3.6.dfsg.1/debian/simple.s-params @@ -0,0 +1,15 @@ +# AutoClass C search parameters -- simple.s-params + +; -- Creator/Donor: Jim Van Zandt +; -- Date: 23 January 2000 + +max_duration = 20 +# max_n_tries = 0 + +save_compact_p = false +! true saves classifications as machine dependent binary (.results-bin & +! .chkpt-bin); false saves as ascii text (.results & .chkpt) + +# max_cycles = 200 +! passed to all try functions. They will end a trial if this many cycles +! have been done and the convergence criterion has not been satisfied. --- autoclass-3.3.6.dfsg.1.orig/debian/simple.script +++ autoclass-3.3.6.dfsg.1/debian/simple.script @@ -0,0 +1,19 @@ +BEGIN{ + FS="[ ()]+"; + printf("set key bottom\n"); + printf("set title \"groups found by AutoClass\"\n"); + printf("set parametric; set trange [0:6.28]\n"); + printf("plot \\\n"); +} +/^[0-9][0-9]/{ + if ($5~/latitude/) {next; latmean=$3; latsd=$4; ++data} + if ($5~/longitude/) {next; lonmean=$3; lonsd=$4; ++data} + if (data == 2) { + if (group) printf(",\\\n"); + printf("%s+2.45*%s*sin(t),%s+2.45*%s*cos(t) title \"group %d 95%% contour\",\\\n", + lonmean, lonsd, latmean, latsd, group); + printf(" \"group%d\" title \"group %d members\"", group, group); + data=0; group++; + } +} +END { printf("\n");} --- autoclass-3.3.6.dfsg.1.orig/debian/tests +++ autoclass-3.3.6.dfsg.1/debian/tests @@ -0,0 +1,110 @@ +#!/bin/sh +# +# This is a simple script that tests autoclass +# +# install as: /usr/lib/debian-test/tests/autoclass +# or else as: /usr/lib/debian-test/tests/autoclass/test-1 +# In the latter case, /usr/lib/debian-test/tests/autoclass/ can contain +# other file with test data and/or other scripts named test-2, test-3, etc. +# +# You can run this script with the command +# sh debian/tests +# After installation, you can run it with the commands +# /usr/lib/debian-test/tests/autoclass +# or +# debian-test -v autoclass +# see debian-test(1) + +. ${DEBIANTEST_LIB:-/usr/lib/debian-test/lib}/functions.sh + +if [ -f my-data-file ]; then + TESTDIR=`pwd`; +else + TESTDIR=/usr/lib/debian-test/tests/autoclass; +fi + + +## we need a scratch directory in which to execute + +TMP=/tmp/autoclass-test.$$ +test -e $TMP && rm -rf $TMP +mkdir $TMP +cd $TMP +trap "rm -rf $TMP" EXIT + +test1(){ + RESULT=0 + cp /usr/share/doc/autoclass/examples/simple* . + + # use autoclass to separate observations into two groups + + if [ ! -f simple.db2 ]; then gunzip *.gz; fi + + # use autoclass to find the groups + autoclass -search simple.db2 simple.hd2 simple.model simple.s-params + + # ask autoclass to generate reports + autoclass -reports simple.results simple.search simple.r-params +## FIXME + +# copy the oservations into two files, +# depending on the group assigned by autoclass +awk ' +/^[0-9]/{ + group_number=$2; + getline data < "simple.db2"; + print data > "group" group_number; +}' simple.case-data-1 + +# display the original data (under X windows) +gnuplot -persist <