SEARCH -- SEARCHING FOR STRUCTURE

Note: This write-up contains only an outline of the options and features of the SEARCH command. Before attempting to use SEARCH, you should read about the technique in Searching for Structure (Sonquist, et al., 1974).

 

GENERAL DESCRIPTION

DISCUSSION

COMMAND FEATURES

PRINTED OUTPUT

RESIDUAL RECODE CONTROL STATEMENT OUTPUT

INPUT DATA

RESTRICTIONS

CONTROL STATEMENTS

REFERENCES

EXAMPLES

Example 1: Investigates income (V268) using ANALYSIS=MEANS

Example 2:  CHI analysis on variable V46.

 

GENERAL DESCRIPTION

(Adapted with minor changes from: http://www.isr.umich.edu/src/search/search_document.html)

SEARCH is a binary segmentation procedure used to develop a predictive model for a depend­ent variable. It searches among a set of predictor variables for those predictors which most increase the researcher's ability to account for the variance or distribution of a dependent vari­able. The ques­tion, "what dichotomous split on which single predictor variable will give us a maximum improve­ment in our ability to predict values of the dependent variable?," embedded in an iterative scheme, is the basis for the algorithm used in this command.

SEARCH divides the sample, through a series of binary splits, into a mutually exclusive series of subgroups. After each split, every observation is a member of exactly one of these subgroups.  They are chosen so that, at each step in the procedure, the split into the two new sub­groups accounts for more of the variance or distribution (reduces the predictive error more) than a split into any other pair of sub­groups. The predictor variables may be ordinally or nominally scaled. The dependent variable may be continuous or categorical. SEARCH is an elaboration of the Osiris III AID and THAID programs.

DISCUSSION

Research questions are often of the type "What is the effect of X on Y?" But the answer requires answering a larger question "What set of variables and their combinations seems to affect Y?" With SEARCH a variable X that seems to have an overall effect may have its apparent influence disappear after a few splits, with the final groups, while varying greatly as to their levels of Y, showing no effect of X. The implication is that, given other things, X does not really affect Y.

Conversely, while X may seem to have no overall effect on Y, after splitting the sample into groups that take account of other powerful factors, there may be some groups in which X has a substantial effect. Think of economists' notion of the actor at the margin. A motivating factor might affect those not constrained or compelled by other forces. Those who, other things considered, have a 40-60 percent probability of acting, might show substantial response to some motivator. Or a group with very high or very low likelihood of acting might be discouraged or encouraged by some motivator. But if X has no effect on any of the subgroups generated by Search, one has pretty good evidence that it does not matter, even in an interactive way.

The purpose of SEARCH is to allow an evaluation of many competing and probably mis-specified models. It relies on the fact that the explanatory power of any one predictor is rapidly exhausted by a few binary splits using it, so that a sequence of binary splits allowing competing predictors at each split, can search data for structure without restrictive assumptions of linearity or additivity of effects. The approach is closer to analysis of variance components than to sequential regression.

SEARCH makes a sequence of binary divisions of a dataset in such a way that each split maximally reduces the error variance or increases the information (chi-square or rank correlation). It finds the best split on each predictor, then takes the best of the best.

The process stops when additional splits are not likely really to improve predictions to a fresh sample or to the population, i.e., when the null probability from that split rises above some selected level (e.g., .05, .025, .01 or .005). Of course, having tried several possibilities for each of several predictors, the null probability is clearly understated. Alternative stopping rules can be used in any combination: minimum group size, maximum number of splits, minimum reduction in explained variance relative to the original total, or maximum null probability.

SEARCH provides for four kinds of dependent or criterion variables: means, simple regressions of Y on X, classifications, and ranks. The split criterion uses some measure of reduction in uncertainty or error, not a level of significance. The reasons to use error reduction rather than significance are that 1) particularly at the start with large numbers of cases in the groups being split, almost any split that maximizes error reduction will be highly significant, and 2) a small potential splitoff might be very highly significant because it is very extreme or very homogenous, but splitting it off would do little to improve overall predictions back to the population.

(The above discussion was adapted with minor changes from: http://www.isr.umich.edu/src/search/search_document.html)

With means the splitting criterion is reduction in unexplained error variance from using two means, rather than the single parent group mean. With regressions the splitting criterion is the reduction in error variance from using two simple regressions, rather than the single parent group regression. With classifications the splitting criterion is the likelihood-ratio chi-square, which fits with the variance components approach. With ranks the splitting criterion is Kendall's tau-b, a rank correlation based on all possible pairs adjusted for ties.

For each predictor one can maintain its monotonic order ("monotonic"), try each class against all the others ("select"), or reorder each time according to the criterion variable ("free"). The last should be used rarely, and only with predictors with few classes, for it involves implicitly trying many things, resulting in a bias in favor of that predictor. In addition, the combinations split off are difficult to interpret and probably idiosyncratic, as the parent groups become smaller.

One might want to reassign missing information to some large class, or, better, use a multivariate assignment procedure, for example, SEARCH itself with the chi-square option.

With monotonic predictors one tries the first class against the rest, then the first two classes against the rest, etc., making k-1 tries. With select predictors one tries each class against all the others, making k tries, but since the splitting criterion combines difference between the two new groups with both their sizes, there is an offsetting bias against the select option. The alternatives are not really independent, so the bias in favor of predictors with more classes should be small. And with at least 50 cases, adjusting the degrees of freedom would make little difference.

Predictors can also be hierarchically ranked as to when they are used. Rank 0 means compute the potential gain but do not split on that predictor. Ranks 1, 2, 3, etc., mean exhaust the rank 1 variables first, then try the rank 2 ones, then the rank 3 ones, etc. Since the program will produce recode statements to generate expected values or residuals, one can also hold aside some later-stage predictors for an analysis of the residuals.

SEARCH also has a significance test for stopping the splitting process. Given the prior searching and the possibility of sample design effects, the test is crude. Purists will object that the more classes in a predictor, the more alternatives tried, so a bias exists toward predictors with many classes. One can think of using up degrees of freedom, or adding alternative null probabilities. But adding probabilities in a Bronferroni-type correction vastly overcorrects, since the k-1 or k alternatives are not really independent. The only serious bias would come from freeing a predictor with 5 or more classes and reordering it at each stage.

The other three stopping rules, with their defaults, are: minimum final group size (default 25), minimum reduction in error variance relative to original total (default .8%), and maximum number of splits (default 25). The error variance reduction rule can be too stringent when the first few splits greatly reduce the remaining error.

The use of weights to adjust for different sampling or response rates affects variances and tests, so the program calculates an estimate of that effect for the whole sample, based on the variance of the weights, and issues a warning. Weights should be used, because if they do not make a difference, nothing is lost, but if they do, the unweighted data are biased.

COMMAND FEATURES

Functions. SEARCH can perform the following functions:

     Maximize differences in group means, group regression lines, or distributions (maximum likelihood chi-square criterion).

     Rank the predictors to give them preference in the partitioning.

     Sacrifice explanatory power for symmetry.

     Start after a specified partial tree structure has been generated.

Missing Data. Cases with missing-data in a continuous dependent variable or a covariate are deleted automatically. Cases with missing-data in a categorical dependent variable can be excluded by using a filter statement or by specifying valid codes with the DEPV keyword. Cases with missing-data in the predictor variables are not automatic­ally excluded. However, the filter statement and the CODES keyword may be used to exclude missing-data on predictor variables.

PRINTED OUTPUT

The major components of the printed output are specified below. For details see Searching for Structure.

Trace Printout: (Optional: See keywords PRINT=TRACE and PRINT=FULLTRACE). Can be voluminous.

The candidate groups for splitting

The group selected for splitting

All eligible splits for each predictor (optional)

The best split for each predictor

The split selected

Final Tables Printout:

The analysis of variance or distribution on final groups (except for “analysis=tau”)

The split summary

The final group summary

Summary table of best splits for each predictor for each group (except for
“analysis=tau”)

The predictor summary table. You may request the first group (PRINT=FIRST), the final groups (PRINT=FINAL), or all groups (PRINT=TABLE). The tables are printed in reverse group order, i.e., last group first and first group last.

Group Tree Structure

A structure table with entries for each group, numbered in order and indented, so that one can easily see the pedigree of each final group and its detail. With relatively little wordprocessing one has a publishable table. It is also easy to create a branching diagram from the group summary table.

RESIDUAL RECODE CONTROL STATEMENT OUTPUT

RECODE control statements to determine group numbers and residual values from raw data may be written to the file assigned to RESIDUAL (see the keywords GNUM and RESID­UALS). These state­ments may be used with LISTDATA to list the group numbers and residuals or with TRANS to create a permanent residuals dataset. They also may be used with SEARCH to perform a second stage search for structure. In running a second-stage SEARCH, place the RECODE statements generated by the first-stage SEARCH after any RECODE statements required to perform the first-stage SEARCH.

INPUT DATA

The dependent variable may be continuous or categorical. Predictor variables may be ordinal or nominal scales.

RESTRICTIONS

1. Maximum number of predictors: 200.

2. Maximum predictor value: 31.

3. Maximum number of categorical variable codes: 400.

4. Maximum number of predefined splits: 49.

5. To perform its analysis, SEARCH must write records to a scratch file with record length based on the number of predictor codes. To make this more efficient, always specify the list of codes if less than 0-9--see DEPV keyword description.

CONTROL STATEMENTS

Filter Statement (optional)

Job Title

Parameter Statement

ANALYSIS=MEAN|REGRESSION|CHI|TAU     Analysis type (see Searching for Structure).
MEAN       Means analysis.
REGR        Regression analysis
CHI           Chi analysis
TAU           Ranks
Default: ANALYSIS=MEAN. Note: ANALYSIS=CHI with a single dependent variable implies the default list of codes 0-9 within missing-data tests.

COV=variable number
The covariate variable number. Must be specified for REGR analyses.

DEPV=variable number|(variable list)|(Vn/list of codes)
The dependent variable or variables. If a list of variables is given the analysis is a done on the distribution if the variables (see Searching for Structure).
A list of codes or variable list may only be supplied for ANALYSIS=CHI or ANALYSIS=TAU. If a list of codes is supplied (e.g., DEPV= V7/1,2,4-7), no missing data tests are made for the dependent variable and only the codes listed are used in analysis.
Default: none, DEPV must be specified (see note under ANALYSIS keyword).

ESTIMATE=variable number|variable list
Variable(s) for estimates or expected values. For a categorical dependent variables or a distribution set of dependent variables, a ast of variables representing the expected distribution for the case.

EXPL=x           Minimum percentage increase in explanatory power required for a split.
Default: EXPL=0.8

GROUP=variable number
Variable number for final group number. Required if RESIDUALS specified; omit otherwise.

IDVAR=variable number
Identification variable to print with each case classified as an outlier.
Default: dependent variable.

MAX=n           Maximum number of partitions.
Default: MAX=25.

MIN=n            Minimum number of cases in one group.
Default: MIN=25.

NULL=n          Maximum probability that there is really no gain from the split.
Default: No significance test.

OUTDISTANCE=n           Number of standard deviations from the parent group mean defining an outlier.  Outliers are reported but not excluded from the analysis. Outliers could be excluded in subsequent runs by filtering. Only useful if PRINT=TRACE is also used.
Default: OUTD=5.0

PRINT=(DICT|CODEBK,TRACE,FULLTRACE,TABLE,FIRST, FINAL,TREE)

DICT          Print the input dictionary.

CODEBK   Print the input dictionary and codebook records.

TRACE:      Print the trace of splits for each predictor for each split.

FULLTR:     Print the full trace of splits for each predictor, including eligible but suboptimal splits.

TABLE:       Print all the predictor summary tables.

FIRST:        Print the predictor summary tables for the first group.

FINAL:       Print the predictor summary tables for the final groups.

TREE:         Print the hierarchical tree diagram.

RECODE=n     Use RECODE n, previously entered via the RECODE command.

RESIDUALS=variable number|variable list
If you want to generate residuals, specify the residuals variable number or numbers. For a multiple or categorical dependent variable, "residuals" consist of a set of variables repre­senting the deviation of the case from the expected pattern. (Note: A two-stage analysis can be performed by using the residuals from one analysis as the dependent variable(s) for a sub­sequent analysis.)

SYMMETRY=n
The amount of explanatory power one is willing to lose in order to have symmetry, expressed as a percentage.
Default: SYMMETRY=0.

WTVAR=n       Use variable n as a weight variable.

Predictor Statements

Supply one set of parameters for each group of predictors which may be described with the same parameter values.

VARS=(variable numbers)
Use the variables specified in the list. If you want RECODE R-type variables you must list them explicitly.
Default: none, VARS must be supplied.

M|F|S               The predictor constraint.

M:   Predictors are considered to be "monotonic," i.e., the codes of the predictors are to be kept adjacent during the partition scan.

F:     Predictor codes are considered to be "free."

S:     Predictor codes will be "selected" and separated from the remaining codes in forming trial partitions.

Default: M.

 CODES=maxcode|(list of codes)
Either the value of the largest acceptable code or a list of acceptable codes. Codes may range from 0 to 31. Cases outside the range 0 to 31 are discarded.
Default: CODES(0-9).

  RANK=n       Assigned rank. Rank 1 predictors are used before rank 2, rank 2 before rank 3, etc. A zero rank indicates that statistics are to be computed for the predictors, but they are not to be used in the partitioning.
Default: RANK=1.

 Predefined Split Statements

If predefined splits are desired, supply one set of parameters for each predefined split.

GNUM=n        Number of the group to be split. Groups are specified in ascending order, where the entire original sample is group 1. Each set of parameters forms two new groups.
Default: none, GNUM must be supplied.

VAR=variable number       
Predictor variable used to make the split.
Default: none, VAR must be supplied.

CODES=(list)  List of the predictor codes defining the first subgroup. All other codes will belong to the second subgroup.
Default: none, CODES must be specified.

Splitting criteria

There can be four splitting criteria, based on the dependent variable type:

*      Means

*      Regressions

*      Classifications

*      Ranks

The splitting criterion in each case is the reduction in ignorance (error variance, etc.) or increase in information. Terms like classification and regression trees should be replaced by binary segmentation or unrestricted analysis of variance components, or searching for structure. With rich bodies of data, many non-linearities and non-additivities possible, and many competing theories, the usual restrictions and assumptions that one is testing a single model are not appropriate. What does remain, however, is a systematic, pre-stated searching strategy that is reproducible, not a free ransacking.

Means. For means the splitting criterion is the reduction in error variance, that is, the sum of squares around the mean, using two subgroup means instead of one parent group mean.

Regressions. For regressions (y=a+bx) the splitting criterion is the reduction in error variance from using two regressions rather than one.

Classifications. For classifications (categorical dependent variable), the splitting criterion is the likelihood-ratio chi-square for dividing the parent group into two subgroups.

Ranks. For rankings (ordered dependent variable), the splitting criterion is Kendall's tau-b, a rank correlation measure.

Stopping Rules

There are four stopping rules, each with a default option:

*      Maximum number of splits. Default: 25.

*      Minimum number in any final group. Default: 25.

*      Minimum reduction in error, relative to the original total. Default: 0.8 percent.

*      Maximum null probability. Default: none, no significance test.

A combination of the minimum number in any final group and the minimum reduction in error is a primitive significance test, but a more formal test is possible. Assuming that the minimum in any final group is 15 or more, the degrees of freedom for any test will be over 30, large enough to assure reasonable normality, and a Z-ratio (ratio of the gain from a split relative to its standard error) would be 2.33 for a maximum probability that there is nothing there (null hypothesis) of .01. The loss from trying several splits is small if predictor order is maintained, or each class is only tried against all the others (k-1 or k).

For the tau-b option, we cannot define a minimum reduction in error, relative to the original total, so we use a minimum tau-b value for each split. Even with means a minimum error reduction can cause difficulty if the first few splits account for a large fraction of the variance, and the "significance level" however fraudulent, is perhaps a better stopping rule.

For the means and ranks criteria, the maximum null probability stopping rule is based on Z, the ratio of the gain from a split to its standard error, using the normal distribution for the null probabilities. For the regression option, we use an f-test to get the null probabilities, and for the chi option, we use the chi-squared distribution.

We do not multiply the null probabilities by the number of alternatives tried (the Bronferroni correction), since for monotonic predictors or select predictors with fewer than 10 categories, the alternatives are few enough and not really independent. We suggest not using the "free" option with more than three or four categories.

REFERENCES

Agresti, Alan (1996), Introduction to Categorical Data Analysis, New York: John Wiley & Sons, Inc.

Dunn, Olive Jean, and Virginia A. Clark (1974), Applied Statistics: Analysis of Variance and Regression, New York: Holt, Rinehart and Winston.

Chow, G. (1960), "Test of Equality between Sets of Coefficients in Two Linear Regressions," Econometrica, 29:591-605.

Gibbons, Jean Dickinson (1997), Nonparametric Methods for Quantitative Analysis, 3rd edition, Syracuse: American Sciences Press.

Hays, William (1988), Statistics, 4th edition, New York: Holt, Rinehart, & Winston.

Klem, Laura (1974), "Formulas and Statistical References," in Osiris III, Volume 5, Ann Arbor: Institute for Social Research.

Sonquist, J. A., E. L. Baker and J. N. Morgan (1974), Searching for Structure, revised edition, Ann Arbor: Institute for Social Research, The University of Michigan.

EXAMPLES

Example 1: Investigates income (V268) using ANALYSIS=MEANS

File assignments:             dictin=scf.dic datain=scf.dat

Page title:                       SEARCH SAMPLE SETUP for Means Analysis, Predefined split

parameter statement:      depv=v268 outd=2 idvar=v3 analysis=means expl=.1 min=25

predictor statements:      v=v32 codes=(0-8)

                                      v=v37,v251,v30

predefined split:              gnum=1 var=v37 codes=1

                                      end

 

 

                  *** SEARCH - SEARCHING FOR STRUCTURE ***

 

ANALYSIS TYPE: MEANS    

 

Using input dictionary:     D:\PROJECTS\TESTDATA\SCF.DIC

Using input data file:      D:\PROJECTS\TESTDATA\SCF.DAT

 

Number of variables:   6

 

Variables containing invalid characters will be assigned missing-data code 1

 

The data are not weighted

 

Dependent variables: 268

 

Predictor variables: 32 37 251 30

 

The number of cases rejected is 1:

 

        1 for code outside range

 

The number of cases is     326

 

 

The partitioning ends with 9 final groups

 

The variation explained is 38.2 percent

 

One-way Analysis of Final Groups

 

  Source       Variation          DF

 

  Explained  .701177E+10           8

  Error      .113438E+11         317

  Total      .183555E+11         325

 

 

Split Summary Table

 

Group 1, N=326

  Mean(Y)=10451.0, Var(Y)=.564786E+08, Variation=.183555E+11

  Split on V37: RACE, Var expl=.216040E+08, Significance=.544344

  Into Group 2, Codes 1

   And Group 3, Codes 0,2-9

 

Group 2, N=299

  Mean(Y)=10528.3, Var(Y)=.570540E+08, Variation=.170021E+11

  Split on V30: MARITAL STATUS, Var expl=.312812E+10, Significance=0.000100

  Into Group 4, Codes 1

   And Group 5, Codes 2-5

 

Group 4, N=221

  Mean(Y)=12449.9, Var(Y)=.571999E+08, Variation=.125840E+11

  Split on V32: EDUC OF HEAD, Var expl=.173944E+10, Significance=0.000100

  Into Group 6, Codes 1-5

   And Group 7, Codes 6-8

 

Group 6, N=171

  Mean(Y)=10932.9, Var(Y)=.430128E+08, Variation=.731217E+10

  Split on V251: OCCUPATION B, Var expl=.140900E+10, Significance=0.000100

  Into Group 8, Codes 0

   And Group 9, Codes 1-9

 

Group 9, N=142

  Mean(Y)=12230.1, Var(Y)=.402303E+08, Variation=.567247E+10

  Split on V251: OCCUPATION B, Var expl=.423362E+09, Significance=0.001380

  Into Group 10, Codes 1-3

   And Group 11, Codes 4-9

 

Group 11, N=115

  Mean(Y)=11393.4, Var(Y)=.249652E+08, Variation=.284603E+10

  Split on V251: OCCUPATION B, Var expl=.495146E+08, Significance=.156284

  Into Group 12, Codes 4-6

   And Group 13, Codes 7-9

 

Group 12, N=69

  Mean(Y)=11929.2, Var(Y)=.212965E+08, Variation=.144816E+10

  Split on V32: EDUC OF HEAD, Var expl=.571610E+08, Significance=0.097853

  Into Group 14, Codes 1-3

   And Group 15, Codes 4,5

 

Group 5, N=78

  Mean(Y)=5083.86, Var(Y)=.167531E+08, Variation=.128999E+10

  Split on V251: OCCUPATION B, Var expl=.183562E+09, Significance=0.000992

  Into Group 16, Codes 0

   And Group 17, Codes 1,2,4-9

 

 

Final Group Summary Table

 

Group 3, N=27

  Mean(Y)=9594.30, Var(Y)=.512249E+08, Variation=.133185E+10

 

Group 7, N=50

  Mean(Y)=17638.2, Var(Y)=.720890E+08, Variation=.353236E+10

 

Group 8, N=29

  Mean(Y)=4580.97, Var(Y)=.823915E+07, Variation=.230696E+09

 

Group 10, N=27

  Mean(Y)=15793.6, Var(Y)=.924261E+08, Variation=.240308E+10

 

Group 13, N=46

  Mean(Y)=10589.8, Var(Y)=.299634E+08, Variation=.134835E+10

 

Group 14, N=28

  Mean(Y)=13030.6, Var(Y)=.309307E+08, Variation=.835128E+09

 

Group 15, N=41

  Mean(Y)=11177.0, Var(Y)=.138968E+08, Variation=.555873E+09

 

Group 16, N=35

  Mean(Y)=3383.49, Var(Y)=.515942E+07, Variation=.175420E+09

 

Group 17, N=43

  Mean(Y)=6467.88, Var(Y)=.221668E+08, Variation=.931006E+09

 

 

Percent Total Variation Explained by Best Split for Each Group (*=Final Groups)

 

          1      2     3*      4      5      6     7*     8*      9    10*

V32   12.00  11.90   0.00   9.48   0.86   3.62   0.00   0.00   0.68   0.00

V37    0.12   0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00

V251  18.12  16.90   0.00   9.14   1.00   7.68   0.00   0.00   2.31   0.00

V30   17.92  17.04   0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00

 

Percent Total Variation Explained by Best Split for Each Group (*=Final Groups) - continued

 

         11     12    13*    14*    15*    16*    17*

V32    0.16   0.31   0.00   0.00   0.00   0.00   0.00

V37    0.00   0.00   0.00   0.00   0.00   0.00   0.00

V251   0.27   0.01   0.00   0.00   0.00   0.00   0.00

V30    0.00   0.00   0.00   0.00   0.00   0.00   0.00

 

Group TREE Structure

 

Group 1: All Cases

  N=326, Mean(Y)=10451.0

  Group 2 V37: RACE, Codes 1

    N=299, Mean(Y)=10528.3

    Group 4 V30: MARITAL STATUS, Codes 1

      N=221, Mean(Y)=12449.9

      Group 6 V32: EDUC OF HEAD, Codes 1-5

        N=171, Mean(Y)=10932.9

        Group 8 V251: OCCUPATION B, Codes 0

          N=29, Mean(Y)=4580.97

        Group 9 V251: OCCUPATION B, Codes 1-9

          N=142, Mean(Y)=12230.1

          Group 10 V251: OCCUPATION B, Codes 1-3

            N=27, Mean(Y)=15793.6

          Group 11 V251: OCCUPATION B, Codes 4-9

            N=115, Mean(Y)=11393.4

            Group 12 V251: OCCUPATION B, Codes 4-6

              N=69, Mean(Y)=11929.2

              Group 14 V32: EDUC OF HEAD, Codes 1-3

                N=28, Mean(Y)=13030.6

              Group 15 V32: EDUC OF HEAD, Codes 4,5

                N=41, Mean(Y)=11177.0

            Group 13 V251: OCCUPATION B, Codes 7-9

              N=46, Mean(Y)=10589.8

      Group 7 V32: EDUC OF HEAD, Codes 6-8

        N=50, Mean(Y)=17638.2

    Group 5 V30: MARITAL STATUS, Codes 2-5

      N=78, Mean(Y)=5083.86

      Group 16 V251: OCCUPATION B, Codes 0

        N=35, Mean(Y)=3383.49

      Group 17 V251: OCCUPATION B, Codes 1,2,4-9

        N=43, Mean(Y)=6467.88

  Group 3 V37: RACE, Codes 0,2-9

    N=27, Mean(Y)=9594.30

 

Example 2:  CHI analysis on variable V46.

 

File assignments:             dictin=scf.dic datain=scf.dat

Page title:                       SEARCH SAMPLE SETUP, No predefined split

parameter statement:      depv=v46 outd=2 idvar=v3 analysis=chi

predictor statements:      v=v32 codes=(0-8)

                                      v=v37,v251,v30 f

                                      end

 

 

                *** SEARCH - SEARCHING FOR STRUCTURE ***

 

ANALYSIS TYPE: CHI      

 

Using input dictionary:     D:\PROJECTS\TESTDATA\SCF.DIC

Using input data file:      D:\PROJECTS\TESTDATA\SCF.DAT

 

Number of variables:   6

 

Variables containing invalid characters will be assigned missing-data code 1

 

The data are not weighted

 

Dependent variables: 46

 

Predictor variables: 32 37 251 30

 

The number of cases rejected is 1:

 

        1 for code outside range

 

The number of cases is     326

 

 

The partitioning ends with 2 final groups

 

The variation explained is 2.4 percent

 

One-way Analysis of Final Groups

 

  Source       Variation          DF

 

  Explained      19.1247           3

  Error          775.167         320

  Total          794.292         323

 

 

Split Summary Table

 

Group 1, N=326, Variation=794.292

  Split on V32: EDUC OF HEAD, Var expl=19.1247, Significance=0.000512

  Into Group 2, Codes 1-3

   And Group 3, Codes 4-8

 

 

Final Group Summary Table

 

Group 2, N=146, Variation=362.606

 

Group 3, N=180, Variation=412.561

 

 

Percent Total Variation Explained by Best Split for Each Group (*=Final Groups)

 

          1     2*     3*

V32    2.41   0.18   0.56

V37    0.66   0.00   0.00

 

Percent Total Variation Explained by Best Split for Each Group (*=Final Groups) - continued

 

          1     2*     3*

V251   1.53   0.45   0.52

V30    0.56   0.61   0.33

 

 

DEPENDENT VARIABLE PERCENT DISTRIBUTION FOR EACH GROUP (* = FINAL GROUPS)

 

         1     2*     3*

     25.46   9.44   0.00

  1  49.39   8.33   0.00

  2  12.58   0.00   0.00

  3  12.58   0.00   0.00

  4  15.75   0.00   0.00

  5  50.00   0.00   0.00

  6  16.44   0.00   0.00

  7  17.81   0.00   0.00

  8  33.33   0.00   0.00

  9  48.89   0.00   0.00

 

Group TREE Structure

 

Group 1: All Cases

  N=326, Code(%)= 0(0.00) 1(0.25) 2(0.00) 3(0.49) 4(0.00) 5(0.13) 6(0.00)

    7(0.00) 8(0.13) 9(0.00)

  Group 2 V32: EDUC OF HEAD, Codes 1-3

    N=146, Code(%)= 0(0.00) 1(0.16) 2(0.00) 3(0.50) 4(0.00) 5(0.16) 6(0.00)

      7(0.00) 8(0.18) 9(0.00)

  Group 3 V32: EDUC OF HEAD, Codes 4-8

    N=180, Code(%)= 0(0.00) 1(0.33) 2(0.00) 3(0.49) 4(0.00) 5(0.09) 6(0.00)

      7(0.00) 8(0.08) 9(0.00)