SEARCH is a binary segmentation procedure used to develop a predictive model for a dependent variable. It searches among a set of predictor variables for those predictors which most increase the researcher's ability to account for the variance or distribution of a dependent variable. The question, "what dichotomous split on which single predictor variable will give us a maximum improvement in our ability to predict values of the dependent variable?," embedded in an iterative scheme, is the basis for the algorithm used in this command.
SEARCH divides the sample, through a series of binary splits, into a mutually exclusive series of subgroups.They are chosen so that, at each step in the procedure, the split into the two new subgroups accounts for more of the variance or distribution (reduces the predictive error more) than a split into any other pair of subgroups. The predictor variables may be ordinally or nominally scaled. The dependent variable may be continuous or categorical.
Research questions are often of the type "What is the effect of X on Y?" But the answer requires answering a larger question "What set of variables and their combinations seems to affect Y?" With SEARCH a variable X that seems to have an overall effect may have its apparent influence disappear after a few splits, with the final groups, while varying greatly as to their levels of Y, showing no effect of X. The implication is that, given other things, X does not really affect Y.
Conversely, while X may seem to have no overall effect on Y, after splitting the sample into groups that take account of other powerful factors, there may be some groups in which X has a substantial effect. Think of economists' notion of the actor at the margin. A motivating factor might affect those not constrained or compelled by other forces. Those who, other things considered, have a 40-60 percent probability of acting, might show substantial response to some motivator. Or a group with very high or very low likelihood of acting might be discouraged or encouraged by some motivator. But if X has no effect on any of the subgroups generated by Search, one has pretty good evidence that it does not matter, even in an interactive way.
SEARCH makes a sequence of binary divisions of a dataset in such a way that each split maximally reduces the error variance or increases the information (chi-square or rank correlation). It finds the best split on each predictor and takes the best of the best.The process stops when additional splits are not likely to improve predictions to a fresh sample or to the population, i.e., when the null probability from that split rises above some selected level (e.g., .05, .025, .01 or .005). Of course, having tried several possibilities for each of several predictors, the null probability is clearly understated. Alternative stopping rules can be used in any combination: minimum group size, maximum number of splits, minimum reduction in explained variance relative to the original total, or maximum null probability
There can be four splitting criteria, based on the dependent variable type:
The splitting criterion in each case is the reduction in ignorance (error variance, etc.) or increase in information. Terms like classification and regression trees should be replaced by binary segmentation or unrestricted analysis of variance components, or searching for structure. With rich bodies of data, many non-linearity’s and non-additivity possible, and many competing theories, the usual restrictions and assumptions that one is testing a single model are not appropriate. What does remain, however, is a systematic, pre-stated searching strategy that is reproducible, not a free ransacking.
Means. For means the splitting criterion is the reduction in error variance, that is, the sum of squares around the mean, using two subgroup means instead of one parent group mean.
Regressions. For regressions (y=a+bx) the splitting criterion is the reduction in error variance from using two regressions rather than one.
Classifications (Chi option). For classifications (categorical dependent variable), the splitting criterion is the likelihood-ratio chi-square for dividing the parent group into two subgroups.
Ranks (Tau option). For rankings (ordered dependent variable), the splitting criterion is Kendall's tau-b, a rank correlation measure.
The major components of output:
The analysis of variance or distribution on final groups (except for “analysis=tau”)
The split summary
The final group summary
Summary table of best splits for each predictor for each group (except for “analysis= tau”)
The predictor summary table. You may request the first group (PRINT=FIRST), the final groups (PRINT=FINAL), or all groups (PRINT=TABLE). The tables are printed in reverse group order, i.e., last group first and first group last.
Group Tree Structure
A structure table with entries for each group, numbered in order and indented, so that one can easily see the pedigree of each final group and its detail.
Agresti, Alan (1996), Introduction to Categorical Data Analysis,
New York: John Wiley & Sons, Inc.
Dunn, Olive Jean, and Virginia A. Clark (1974), Applied Statistics: Analysis of Variance and Regression, New York: Holt, Rinehart and Winston.
Chow, G. (1960), "Test of Equality between Sets of Coefficients in Two Linear Regressions," Econometrica, 29:591-605.
Gibbons, Jean Dickinson (1997), Nonparametric Methods for Quantitative Analysis, 3rd edition, Syracuse: American Sciences Press.
Hays, William (1988), Statistics, 4th edition, New York: Holt, Rinehart, & Winston.
Klem, Laura (1974), "Formulas and Statistical References," in Osiris III, Volume 5, Ann Arbor: Institute for Social Research.
Sonquist, J. A., E. L. Baker and J. N. Morgan (1974), Searching for Structure, revised edition, Ann Arbor: Institute for Social Research, The University of Michigan.
Example: Investigates income (V268)