GENERAL DESCRIPTION
Note: This write-up contains only an outline of the options and features of the SEARCH command. Before attempting to use SEARCH, you should read about the technique in Searching for Structure (Sonquist, et al., 1974).
SEARCH is a binary segmentation procedure used to develop a predictive model for a dependent variable. It searches among a set of predictor variables for those predictors which most increase the researcher's ability to account for the variance or distribution of a dependent variable. The question, "what dichotomous split on which single predictor variable will give us a maximum improvement in our ability to predict values of the dependent variable?," embedded in an iterative scheme, is the basis for the algorithm used in this command.
SEARCH divides the sample, through a series of binary splits, into a mutually exclusive series of subgroups. They are chosen so that, at each step in the procedure, the split into the two new subgroups accounts for more of the variance or distribution (reduces the predictive error more) than a split into any other pair of subgroups. The predictor variables may be ordinally or nominally scaled. The dependent variable may be continuous or categorical. SEARCH is an elaboration of the Osiris III AID and THAID programs.
DISCUSSION
(Adapted with minor changes from: http://www.isr.umich.edu/src/search/search_document.html)
Research questions are often of the type "What is the effect of X on Y?" But the answer requires answering a larger question "What set of variables and their combinations seems to affect Y?" With SEARCH a variable X that seems to have an overall effect may have its apparent influence disappear after a few splits, with the final groups, while varying greatly as to their levels of Y, showing no effect of X. The implication is that, given other things, X does not really affect Y.
Conversely, while X may seem to have no overall effect on Y, after splitting the sample into groups that take account of other powerful factors, there may be some groups in which X has a substantial effect. Think of economists' notion of the actor at the margin. A motivating factor might affect those not constrained or compelled by other forces. Those who, other things considered, have a 40-60 percent probability of acting, might show substantial response to some motivator. Or a group with very high or very low likelihood of acting might be discouraged or encouraged by some motivator. But if X has no effect on any of the subgroups generated by Search, one has pretty good evidence that it does not matter, even in an interactive way.
The purpose of SEARCH is to allow an evaluation of many competing and probably mis-specified models. It relies on the fact that the explanatory power of any one predictor is rapidly exhausted by a few binary splits using it, so that a sequence of binary splits allowing competing predictors at each split, can search data for structure without restrictive assumptions of linearity or additivity of effects. The approach is closer to analysis of variance components than to sequential regression.
SEARCH makes a sequence of binary divisions of a dataset in such a way that each split maximally reduces the error variance or increases the information (chi-square or rank correlation). It finds the best split on each predictor and takes the best of the best.
The process stops when additional splits are not likely to improve predictions to a fresh sample or to the population, i.e., when the null probability from that split rises above some selected level (e.g., .05, .025, .01 or .005). Of course, having tried several possibilities for each of several predictors, the null probability is clearly understated. Alternative stopping rules can be used in any combination: minimum group size, maximum number of splits, minimum reduction in explained variance relative to the original total, or maximum null probability.
SEARCH provides for four kinds of dependent or criterion variables: means, simple regressions of Y on X, classifications, and ranks. The split criterion uses some measure of reduction in uncertainty or error, not a level of significance. The reasons to use error reduction rather than significance are that 1) particularly at the start with large numbers of cases in the groups being split, almost any split that maximizes error reduction will be highly significant, and 2) a small potential split-off might be very highly significant because it is very extreme or very homogenous, but splitting it off would do little to improve overall predictions back to the population.
With means the splitting criterion is reduction in unexplained error variance from using two means, rather than the single parent group mean. With regressions the splitting criterion is the reduction in error variance from using two simple regressions, rather than the single parent group regression. With classifications the splitting criterion is the likelihood-ratio chi-square, which fits with the variance components approach. With ranks the splitting criterion is Kendall's tau-b, a rank correlation based on all possible pairs adjusted for ties.
For each predictor one can maintain its monotonic order ("monotonic"), try each class against all the others ("select"), or reorder each time according to the criterion variable ("free"). The last should be used rarely, and only with predictors with few classes, for it involves implicitly trying many things, resulting in a bias in favor of that predictor. In addition, the combinations split off are difficult to interpret and probably idiosyncratic, as the parent groups become smaller.
One might want to reassign missing information to some large class, or, better, use a multivariate assignment procedure, for example, SEARCH itself with the chi-square option.
With monotonic predictors one tries the first class against the rest, then the first two classes against the rest, etc., making k-1 tries. With select predictors one tries each class against all the others, making k tries, but since the splitting criterion combines difference between the two new groups with both their sizes, there is an offsetting bias against the select option. The alternatives are not really independent, so the bias in favor of predictors with more classes should be small. And with at least 50 cases, adjusting the degrees of freedom would make little difference.
Predictors can also be hierarchically ranked as to when they are used. Rank 0 means compute the potential gain but do not split on that predictor. Ranks 1, 2, 3, etc., mean exhaust the rank 1 variables first, then try the rank 2 ones, then the rank 3 ones, etc. Since the program will produce recode statements to generate expected values or residuals, one can also hold aside some later-stage predictors for an analysis of the residuals.
SEARCH also has a significance test for stopping the splitting process. Given the prior searching and the possibility of sample design effects, the test is crude. Purists will object that the more classes in a predictor, the more alternatives tried, so a bias exists toward predictors with many classes. One can think of using up degrees of freedom, or adding alternative null probabilities. But adding probabilities in a Bronferroni-type correction vastly overcorrects, since the k-1 or k alternatives are not really independent. The only serious bias would come from freeing a predictor with 5 or more classes and reordering it at each stage.
The other three stopping rules, with their defaults, are: minimum final group size (default 25), minimum reduction in error variance relative to original total (default .8%), and maximum number of splits (default 25). The error variance reduction rule can be too stringent when the first few splits greatly reduce the remaining error.
The use of weights to adjust for different sampling or response rates affects variances and tests, so the program calculates an estimate of that effect for the whole sample, based on the variance of the weights, and issues a warning. Weights should be used, because if they do not make a difference, nothing is lost, but if they do, the unweighted data are biased.
COMMAND FEATURES
Functions. SEARCH can perform the following functions:
Maximize differences in group means, group regression lines, or distributions (maximum likelihood chi-square criterion).
Rank the predictors to give them preference in the partitioning.
Sacrifice explanatory power for symmetry.
Start after a specified partial tree structure has been generated.
Missing Data. Cases with missing-data in a continuous dependent variable or a covariate are deleted automatically. Cases with missing-data in a categorical dependent variable can be excluded by using a Filter or by specifying valid codes with the DEPV selection. Cases with missing-data in the predictor variables are not automatically excluded. However, the Filter and the CODES list may be used to exclude missing-data on predictor variables.
PRINTED OUTPUT
The major components of the printed output are specified below. For details see Searching for Structure.
Trace Printout: (Optional: See options PRINT=TRACE and PRINT=FULLTRACE). Can be voluminous.
The candidate groups for splitting
The group selected for splitting
All eligible splits for each predictor (optional)
The best split for each predictor
The split selected
Final Tables Printout:
The analysis of variance or distribution on final groups (except for “analysis=tau”)
The split summary
The final group summary
Summary table of best splits for each predictor for each group (except for “analysis= tau”)
The predictor summary table. You may request the first group (PRINT=FIRST), the final groups (PRINT=FINAL), or all groups (PRINT=TABLE). The tables are printed in reverse group order, i.e., last group first and first group last.
Group Tree Structure
A structure table with entries for each group, numbered in order and indented, so that one can easily see the pedigree of each final group and its detail. With relatively little word-processing one has a publishable table. Optionally, print a tree diagram.
RESIDUAL RECODE CONTROL STATEMENT OUTPUT
RECODE control statements to determine group numbers and residual values from raw data may be written to the file assigned to RESIDUAL (options GNUM and RESIDUALS). These statements may be used with LISTDATA to list the group numbers and residuals or with TRANS to create a permanent residuals dataset. They also may be used with SEARCH to perform a second stage search for structure. In running a second-stage SEARCH, place the RECODE statements generated by the first-stage SEARCH after any RECODE statements required to perform the first-stage SEARCH.
INPUT DATA
The dependent variable may be continuous or categorical. Predictor variables may be ordinal or nominal scales.
RESTRICTIONS
1. Maximum number of predictors: 200.
2. Maximum predictor value: 31.
4. Maximum number of predefined splits: 49.
5. To perform its analysis, SEARCH writes records to a scratch file with record length based on the number of predictor codes. Specifying the list of codes makes this more efficient and can save a lot of time.
CONTROL STATEMENTS
Filter (optional)
Job Title (required if using a Runfile)
Options and Parameters
ANALYSIS=MEAN|REGRESSION|CHI|TAU
Analysis type (see Searching for Structure).
MEAN Means analysis.
REGR Regression analysis
CHI Chi
analysis
TAU Ranks
Default: ANALYSIS=MEAN. Note: ANALYSIS=CHI with a single
dependent variable implies the default list of codes 0-9 within missing-data
tests.
COV=variable number
The covariate variable number for REGRESSION analyses.
DEPV=variable
number|(variable list)|(Vn/list of codes)
The dependent variable or variables. If a list of variables is given the
analysis is a done on the distribution if the variables (see Searching for
Structure). A list of codes or variable list may only be supplied
for ANALYSIS=CHI or ANALYSIS=TAU. If a list of codes is supplied (e.g., DEPV=
V7/1,2,4-7), no missing data tests are made for the dependent variable and only
the codes listed are used in analysis.
Default: none, DEPV must be specified (see note under ANALYSIS option).
ESTIMATE=variable number|variable
list
Variable(s) for estimates or expected values. For a categorical dependent
variables or a distribution set of dependent variables, a ast of variables
representing the expected distribution for the case.
EXPL=x
Minimum percentage increase in explanatory power required for a split.
Default: EXPL=0.8
GROUP=variable number
Variable number for final group number when RESIDUALS specified.
IDVAR=variable number
Identification variable to print with each case classified as an outlier.
Default: dependent variable.
MAX=n
Maximum number of partitions.
Default: MAX=25.
MIN=n
Minimum
number of cases in one group.
Default: MIN=25.
NULL=n
Maximum probability that there is really no gain from the split.
Default: No significance test.
OUTDISTANCE=n Number of standard deviations from
the parent group mean defining an outlier. Outliers are reported if TRACE is
specified but not excluded from the analysis. Outliers can be excluded in
subsequent runs by filtering
Default: OUTD=5.0
PRINT=(DICT|CODES,TRACE|FULLTRACE,TABLE,FIRST, FINAL,TREE)
DICT Print the input dictionary.
CODES Print the input dictionary and category labels.
TRACE: Print the trace of splits for each predictor for each split.
FULLTR: Print the full trace of splits for each predictor, including eligible but suboptimal splits.
TABLE: Print all the predictor summary tables.
FIRST: Print the predictor summary tables for the first group.
FINAL: Print the predictor summary tables for the final groups.
TREE: Print the hierarchical tree diagram.
RECODE=n Use RECODE n, previously entered via the RECODE command.
RESIDUALS=variable
number|variable list
To generate residuals, specify the residuals variable number(s). For a multiple
or categorical dependent variable, "residuals" consist of a set of
variables representing the deviation of the case from the expected pattern.
(Note: A two-stage analysis can be performed by using the residuals from one
analysis as the dependent variable(s) for a subsequent analysis.)
SYMMETRY=n
The amount of explanatory power one is willing to lose in order to have
symmetry, expressed as a percentage.
Default: SYMMETRY=0.
WT=n Use variable n as a weight variable.
Predictor Statements
Supply one set of parameters for each group of predictors which may be described with the same parameter values.
VARS=(variable numbers)
Use the variables specified in the list. If you want RECODE R-type variables
you must list them explicitly.
Default: none, VARS must be supplied.
M|F|S The predictor constraint.
M: Predictors are considered to be "monotonic," i.e., the codes of the predictors are to be kept adjacent during the partition scan.
F: Predictor codes are considered to be "free."
S: Predictor codes will be "selected" and separated from the remaining codes in forming trial partitions.
Default: M.
CODES=maxcode|(list
of codes)
The value of the largest acceptable code or a list of acceptable codes. Cases
outside the range 0 to 31 are always discarded.
Specifying a list can greatly improve efficiency. Default:
CODES(0-9).
RANK=n Assigned rank.
Rank 1 predictors are used before rank 2, rank 2 before rank 3, etc. A zero
rank indicates that statistics are to be computed for the predictors, but they
are not to be used in the partitioning.
Default: RANK=1.
Predefined Split Statements
If predefined splits are desired, supply one set of parameters for each predefined split.
GNUM=n
Number of the group to be split. Groups are specified in ascending order, where
the entire original sample is group 1. Each set of parameters forms two new
groups.
Default: none, GNUM must be supplied.
VAR=variable number
Predictor variable used to make the split.
Default: none, VAR must be supplied.
CODES=(list) List of the predictor codes
defining the first subgroup. All other codes will belong to the second
subgroup.
Default: none, CODES must be specified.
Splitting criteria
There can be four splitting criteria, based on the dependent variable type:
Means
Regressions
Classifications
Ranks
The splitting criterion in each case is the reduction in ignorance (error variance, etc.) or increase in information. Terms like classification and regression trees should be replaced by binary segmentation or unrestricted analysis of variance components, or searching for structure. With rich bodies of data, many non-linearity’s and non-additivity possible, and many competing theories, the usual restrictions and assumptions that one is testing a single model are not appropriate. What does remain, however, is a systematic, pre-stated searching strategy that is reproducible, not a free ransacking.
Means. For means the splitting criterion is the reduction in error variance, that is, the sum of squares around the mean, using two subgroup means instead of one parent group mean.
Regressions. For regressions (y=a+bx) the splitting criterion is the reduction in error variance from using two regressions rather than one.
Classifications. For classifications (categorical dependent variable), the splitting criterion is the likelihood-ratio chi-square for dividing the parent group into two subgroups.
Ranks. For rankings (ordered dependent variable), the splitting criterion is Kendall's tau-b, a rank correlation measure.
Stopping Rules
There are four stopping rules, each with a default option:
Maximum
number of splits. Default: 25.
Minimum
number in any final group. Default: 25.
Minimum
reduction in error, relative to the original total. Default: 0.8 percent.
Maximum
null probability. Default: none, no significance test.
A combination of the minimum number in any final group and the minimum reduction in error is a primitive significance test, but a more formal test is possible. Assuming that the minimum in any final group is 15 or more, the degrees of freedom for any test will be over 30, large enough to assure reasonable normality, and a Z-ratio (ratio of the gain from a split relative to its standard error) would be 2.33 for a maximum probability that there is nothing there (null hypothesis) of .01. The loss from trying several splits is small if predictor order is maintained, or each class is only tried against all the others (k-1 or k).
For the tau-b option, we cannot define a minimum reduction in error, relative to the original total, so we use a minimum tau-b value for each split. Even with means a minimum error reduction can cause difficulty if the first few splits account for a large fraction of the variance, and the "significance level" however fraudulent, is perhaps a better stopping rule.
For the means and ranks criteria, the maximum null probability stopping rule is based on Z, the ratio of the gain from a split to its standard error, using the normal distribution for the null probabilities. For the regression option, we use an f-test to get the null probabilities, and for the chi option, we use the chi-squared distribution.
We do not multiply the null probabilities by the number of alternatives tried (the Bronferroni correction), since for monotonic predictors or select predictors with fewer than 10 categories, the alternatives are few enough and not really independent. We suggest not using the "free" option with more than three or four categories.
REFERENCES
Agresti, Alan (1996), Introduction to Categorical Data Analysis, New York: John Wiley & Sons, Inc.
Dunn, Olive Jean, and Virginia A. Clark (1974), Applied Statistics: Analysis of Variance and Regression, New York: Holt, Rinehart and Winston.
Chow, G. (1960), "Test of Equality between Sets of Coefficients in Two Linear Regressions," Econometrica, 29:591-605.
Gibbons, Jean Dickinson (1997), Nonparametric Methods for Quantitative Analysis, 3rd edition, Syracuse: American Sciences Press.
Hays, William (1988), Statistics, 4th edition, New York: Holt, Rinehart, & Winston.
Klem, Laura (1974), "Formulas and Statistical References," in Osiris III, Volume 5, Ann Arbor: Institute for Social Research.
Sonquist, J. A., E. L. Baker and J. N. Morgan (1974), Searching for Structure, revised edition, Ann Arbor: Institute for Social Research, The University of Michigan.
EXAMPLES
Example 1: Investigates income (V268) using ANALYSIS=MEANS
File assignments: dictin=scf.dic datain=scf.dat
Job Title: SEARCH SAMPLE SETUP for Means Analysis, Predefined split
Options and Parameters: depv=v268 outd=2 idvar=v3 analysis=means expl=.1 min=25
predictor statements: v=v32 codes=(0-8)
v=v37,v251,v30
predefined split: gnum=1 var=v37 codes=1
end
*** SEARCH - SEARCHING FOR STRUCTURE ***
ANALYSIS TYPE: MEANS
Using input dictionary: D:\PROJECTS\TESTDATA\SCF.DIC
Using input data file: D:\PROJECTS\TESTDATA\SCF.DAT
Number of variables: 6
The data are not weighted
Dependent variables: 268
Predictor variables: 32 37 251 30
The number of cases rejected is 1:
1 for code outside range
The number of cases is 326
The partitioning ends with 9 final groups
The variation explained is 38.2 percent
One-way Analysis of Final Groups
Source Variation DF
Explained .701177E+10 8
Error .113438E+11 317
Total .183555E+11 325
Split Summary Table
Group 1, N=326
Mean(Y)=10451.0, Var(Y)=.564786E+08, Variation=.183555E+11
Split on V37: RACE, Var expl=.216040E+08, Significance=.544344
Into Group 2, Codes 1
And Group 3, Codes 0,2-9
Group 2, N=299
Mean(Y)=10528.3, Var(Y)=.570540E+08, Variation=.170021E+11
Split on V30: MARITAL STATUS, Var expl=.312812E+10, Significance=0.000100
Into Group 4, Codes 1
And Group 5, Codes 2-5
Group 4, N=221
Mean(Y)=12449.9, Var(Y)=.571999E+08, Variation=.125840E+11
Split on V32: EDUC OF HEAD, Var expl=.173944E+10, Significance=0.000100
Into Group 6, Codes 1-5
And Group 7, Codes 6-8
Group 6, N=171
Mean(Y)=10932.9, Var(Y)=.430128E+08, Variation=.731217E+10
Split on V251: OCCUPATION B, Var expl=.140900E+10, Significance=0.000100
Into Group 8, Codes 0
And Group 9, Codes 1-9
Group 9, N=142
Mean(Y)=12230.1, Var(Y)=.402303E+08, Variation=.567247E+10
Split on V251: OCCUPATION B, Var expl=.423362E+09, Significance=0.001380
Into Group 10, Codes 1-3
And Group 11, Codes 4-9
Group 11, N=115
Mean(Y)=11393.4, Var(Y)=.249652E+08, Variation=.284603E+10
Split on V251: OCCUPATION B, Var expl=.495146E+08, Significance=.156284
Into Group 12, Codes 4-6
And Group 13, Codes 7-9
Group 12, N=69
Mean(Y)=11929.2, Var(Y)=.212965E+08, Variation=.144816E+10
Split on V32: EDUC OF HEAD, Var expl=.571610E+08, Significance=0.097853
Into Group 14, Codes 1-3
And Group 15, Codes 4,5
Group 5, N=78
Mean(Y)=5083.86, Var(Y)=.167531E+08, Variation=.128999E+10
Split on V251: OCCUPATION B, Var expl=.183562E+09, Significance=0.000992
Into Group 16, Codes 0
And Group 17, Codes 1,2,4-9
Final Group Summary Table
Group 3, N=27
Mean(Y)=9594.30, Var(Y)=.512249E+08, Variation=.133185E+10
Group 7, N=50
Mean(Y)=17638.2, Var(Y)=.720890E+08, Variation=.353236E+10
Group 8, N=29
Mean(Y)=4580.97, Var(Y)=.823915E+07, Variation=.230696E+09
Group 10, N=27
Mean(Y)=15793.6, Var(Y)=.924261E+08, Variation=.240308E+10
Group 13, N=46
Mean(Y)=10589.8, Var(Y)=.299634E+08, Variation=.134835E+10
Group 14, N=28
Mean(Y)=13030.6, Var(Y)=.309307E+08, Variation=.835128E+09
Group 15, N=41
Mean(Y)=11177.0, Var(Y)=.138968E+08, Variation=.555873E+09
Group 16, N=35
Mean(Y)=3383.49, Var(Y)=.515942E+07, Variation=.175420E+09
Group 17, N=43
Mean(Y)=6467.88, Var(Y)=.221668E+08, Variation=.931006E+09
Percent Total Variation Explained by Best Split for Each Group (*=Final Groups)
1 2 3* 4 5 6 7* 8* 9 10*
V32 12.00 11.90 0.00 9.48 0.86 3.62 0.00 0.00 0.68 0.00
V37 0.12 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
V251 18.12 16.90 0.00 9.14 1.00 7.68 0.00 0.00 2.31 0.00
V30 17.92 17.04 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Percent Total Variation Explained by Best Split for Each Group (*=Final Groups) - continued
11 12 13* 14* 15* 16* 17*
V32 0.16 0.31 0.00 0.00 0.00 0.00 0.00
V37 0.00 0.00 0.00 0.00 0.00 0.00 0.00
V251 0.27 0.01 0.00 0.00 0.00 0.00 0.00
V30 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Group TREE Structure
Group 1: All Cases
N=326, Mean(Y)=10451.0
Group 2 V37: RACE, Codes 1
N=299, Mean(Y)=10528.3
Group 4 V30: MARITAL STATUS, Codes 1
N=221, Mean(Y)=12449.9
Group 6 V32: EDUC OF HEAD, Codes 1-5
N=171, Mean(Y)=10932.9
Group 8 V251: OCCUPATION B, Codes 0
N=29, Mean(Y)=4580.97
Group 9 V251: OCCUPATION B, Codes 1-9
N=142, Mean(Y)=12230.1
Group 10 V251: OCCUPATION B, Codes 1-3
N=27, Mean(Y)=15793.6
Group 11 V251: OCCUPATION B, Codes 4-9
N=115, Mean(Y)=11393.4
Group 12 V251: OCCUPATION B, Codes 4-6
N=69, Mean(Y)=11929.2
Group 14 V32: EDUC OF HEAD, Codes 1-3
N=28, Mean(Y)=13030.6
Group 15 V32: EDUC OF HEAD, Codes 4,5
N=41, Mean(Y)=11177.0
Group 13 V251: OCCUPATION B, Codes 7-9
N=46, Mean(Y)=10589.8
Group 7 V32: EDUC OF HEAD, Codes 6-8
N=50, Mean(Y)=17638.2
Group 5 V30: MARITAL STATUS, Codes 2-5
N=78, Mean(Y)=5083.86
Group 16 V251: OCCUPATION B, Codes 0
N=35, Mean(Y)=3383.49
Group 17 V251: OCCUPATION B, Codes 1,2,4-9
N=43, Mean(Y)=6467.88
Group 3 V37: RACE, Codes 0,2-9
N=27, Mean(Y)=9594.30