Predict Directive
What are associated factors?
Sometimes there is a hierarchical structure to factors which needs to be recognised. Common examples are Genotypes
grouped into Families
and Locations grouped
by Region. We call these
associated factors. Special care is required with associated factors,
especially if prediction is required (see !ASSOCIATE qualifier).
The key characteristic of associated factors is that they are coded
such that the levels of one are uniquely nested in the levels of another.
If one is unknown (coded as missing), all associated factors must be unknown
for that data record.
It is typically unnecessary to interact associated factors except when required to adequately define the variance structure.
Predicting with associated factors
It is necessary to correctly associate the levels of associated
factors when predicting them or averaging over them.
!ASSOCIATE factors facilitates prediction when
the levels of one factor group or classify the levels of another.
factors is an list of factors in the model
which have this hierarchical relationship.
Typical examples are say 1000 individually named lines which represent
100 families typically with unequal numbers of lines per family, or a
total of 100 trials conducted across three regions in a total of 17 locations.
Declaring factors as associated allows ASReml to combine
the levels of the factors appropriately. For example, in the preceding example,
when predicting a trial mean, to add the effect of the location and region where the trial was conducted.
When identifying which levels are associated, ASReml checks that the association is strictly hierarchal.
That is, each location is associated with only one region, and each trial
with only one location.
If a level code is missing for one component, it must be missing for all.
Averaging of associated factors will generally give differing
results depending on the order in which the averaging is performed.
We explore this with the following extended example.
Consider the mean yields from 15 trials classified by region and location
in Table 1.
Table 1. Trial means classified by region and location.
trial | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15
|
region | 1 | 1 | 1 | 1 | 1 | 1 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2
|
location | 1 | 1 | 2 | 2 | 2 | 3 | 4 | 4 | 5 | 5 | 5 | 6 | 6 | 7 | 8
|
yield | 10 | 12 | 11 | 12 | 13 | 13 | 11 | 13 | 11 | 12 | 13 | 10 | 12 | 10 | 10
|
Assuming a simplified linear model
yield ~ mu region location trial
, the predict statement
predict trial !associate region location trial
will reconstruct the 15 trial means from the fitted trial, location and region
effects.
Given these trial means, it is fairly natural to form location
means by averaging the trials in each location to get
the location means in Table 2.
Table 2. Location means classified by region
|
location | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8
|
region | 1 | 1 | 1 | 2 | 2 | 2 | 2 | 2
|
yield | 11 | 12 | 13 | 12 | 12 | 11 | 10 | 10
|
These are given by
predict location !associate region location trial
or equivalently
predict location !associate region location trial !ASAVERAGE trial
Note that without the !associate clause, ASReml would add the average of all
the trial effects into all of the location means which is not appropriate.
With !associate, it knows which trials to average to form each location mean. We use the alternate spelling of the !AVERAGE qualifier name
to highlight that this is averaging by association and nor simple averaging.
However, for region means, we have a choice.
We can average the trial means in Table 1 according to region obtaining region means
of 11.83 and 11.33, or we can average the location means in Table 2
to get location means of 12 and 11.
The former is the default in ASReml produced by
predict region !associate region location trial
or equivalently by
predict region !present region location trial
We call this base averaging.
The latter implies sequential or hierarchical averaging and is given by
predict region !assoc region location trial !ASAVE location
Similarly, an overall heirarchical mean of 11.5 is given by
predict mu !assoc region location trial !ASAVE reg locat trial
while
predict mu !assoc region location trial !ASAVE reg
gives a value of 11.58 being the average of region means 11.83 and 11.33
obtained by averaging trials within regions from Table 1, and
predict mu !associate region location trial !ASAVE location
predicts mu as 11.38, the average of the 8 locations means in Table 2.
Further discussion of associated factors
The user may specify their own weights, using file input if necessary.
The statement
predict region ... !ASAVE location {1 2 3}/6 {1 1 1 2 1}/6
would give region predictions of 11.67 and 10.84 respectively derived from the
location predictions in Table 2.
Note that because location is nested in region, the location weights must sum
to 1.0 within levels of region.
The alternate form of the !AVE ( !ASAVE) qualifiers allows the
weights to be read from a file which the user can create
elsewhere. Thus the code
!ASAVERAGE trial 'Tweight.csv',2
will read the weights from the second field of file Tweight.csv.
Without the column specification, ASReml reads all the values in the file.
The user must ensure the weights are in the
coding order ASReml uses ( trial order in this instance,
given in the .sln file or by using the TABULATE command).
It was noted that all !ASSOCIATE factors are included in the hyper table.
If the lowest stratum is random, it may be appropriate to ignore it.
Omitting it from the !ASSOCIATE list will allow it to reenter the
Ignore set. Specifying it with the !IGNORE qualifier will
exclude its effects from the prediction but not ignore the structural
information implied by the association.
Normally it is not necessary for any model term to involve more than 1
of the associated factors. One exception is if an interaction is required
so that the variance can differ between sections. For example, fitting the
terms
at(region).trial
as random effects would allow the trials in region 1 to have a different variance component to those in region 2.
Prediction in these cases is more complicated and has only been
implemented for this specific case and the analagous region.trial
case. The associated factors must occur together in this order
for the prediction to give correct answers.
The !ASSOCIATE effect (with base averaging)
can usually be achieved with the !PRESENT
qualifier except when the factors have many levels so that
the product of levels exceeds 2147 000 000; it fails in this case because
the KEY for identifying the cells present is a simple combination of the
levels and is stored as a normal (32bit) integer. However, !ASSOCIATE
is preferred because it formally checks that there is a associated structure as well as allowing averaging
at a higher level.
Two !ASSOCIATE clauses may be specified for example
PRED entry !ASSOC family entry !ASSOC reg loc trial !ASAVE reg loc.
Only one member of an !ASSOCIATE list may
also appear in a !PRESENT list. If one member appears in the classify set,
only that member may appear in the !PRESENT list. For example
yield \sim region !r region.family entry
PREDICT entry !ASSOCIATE family entry !PRESENT entry region
Association averaging is used to form the cells in the PRESENT table and
PRESENT averaging is then applied.
See Also