Abstract
Variable selection in the context of a linear model or a linear mixed model is a fundamental but often contentious part in applied statistical model building. However, very little on the topic is available in statistical literature. In the current article, we propose a new algorithm for variable selection in the context of a linear mixed model that considers investigator preference and data availability along with other statistical considerations. The performance of the new algorithm is contrasted with the available automated variable selection using backward elimination via a real data set.
Keywords
Backward elimination; Regression analysis; Linear model; Linear mixed model; Variable selection; Investigator preference; Missing proportions
Introduction
Variable selection in the context of a linear model or a linear mixed model is a fundamental but often contentious part in the applied statistical model building. In most applied statistical research, the investigators often face the dilemma of selecting a small number of most "important" characteristics to be included in the final linear regression or logistic regression model. However, differences in characteristics selected have a direct impact on the results of the study and thus have practical consequences on how the results of the study would be interpreted and utilized. Studies in the area of social and behavioral sciences often collect a large number of relevant information on each subject. Longitudinal studies collect repeated data on these variables at different time intervals. However, because of obvious constraints such as money and time, the sample size utilized in these studies may not always be large enough. At the analysis stage, the investigators then have the difficult task of selecting a smaller subsample of variables that are available using some statistical criteria. The most common statistical methods for this purpose are known as forward selection, backward selection, and stepwise selection [1]. Also available are empirical Bayes' method [2], Lasso method [3], and Gibb's sampling method [4] to name a few. These methods use strictly statistical criteria such as AIC [5], BIC [6], or C_{p} [7] to identify the best possible set of predictors from among a much larger set. While such statistical criteria assure objectivity in model selection, they lack subjective input from the experts in the field. The subject area experts, from their experience in the area of study and focus of the analyses, may provide useful guidance in model building and thus make the end results more practically oriented towards the goals of the study. Statistical model building is a balanced combination of art and science, where the statistical criteria provide the science component and the subjective input from the experts serves as the art.
Missing data is essentially a part of any applied statistical study and a nuisance, but also can be a severe constraint on statistical analyses. Even given best efforts, it is not always possible to avoid missing data, especially in a longitudinal study. There are various types of missing data: missing completely at random (MCAR), missing at random (MAR), and nonignorable missing (NINR) [8]. Various authors discuss methods of data analyses that can be employed when some data are missing. See [9] for a review of available statistical techniques and their properties. A comparison of available statistical methods for incomplete data regression models and various software implementations of such techniques is provided in [10].
One common feature of the variable selection methods mentioned above is that all of them assume a complete data set. In most cases this results in a complete case analysis where the cases with missing variables are deleted or the missing data are imputed using one of several imputation methods. Imputation methods, though attractive in some specific situations, are complicated, subject to additional assumptions about the data generation process, and difficult to implement using standard statistical software. As a result, oftentimes, researchers use only the complete data cases.
In this article, we discuss a new algorithm for variable selection in the context of a linear mixed model that considers investigator preference and data availability along with other statistical considerations in statistical model building. The focus of the new algorithm is threefold: 1) to maximize the use of available data, 2) to incorporate subjective investigator input, and 3) to maintain statistical objectivity by utilizing statistical decision rules. The rest of the paper is organized as follows: Section 2 describes the new algorithm, Section 3 discusses an implementation using a real dataset, and Section 4 contains some concluding remarks.
Weighted Backward Selection Algorithm:
The weighted backward selection algorithm proposed in this article is essentially a backward selection algorithm where the initial model includes all variables of interest. This complete model is then reduced to a more parsimonious model by removing some redundant variables from the initial model. The usual backward selection model removes such redundant variables based on strict criteria of statistical significance. In the weighted backward selection algorithm other considerations are also used while deciding which one variable to drop at each step. In the present article, we will demonstrate the algorithm with two additional criteria: the amount of missing data on a variable and the investigator preference of a variable. However, one can easily incorporate other considerations in the variable selection process
The steps of the weighted backward selection algorithm where investigator preference and amount of missing data are considered along with statistical significance are described below:
 Step A: Compute the percent of missing data for each independent variable.
 Step B: List the investigator rankings of independent variables. The most important variable according to the investigator gets the lowest rank and so on.
 Step 1: Compute the missingness index by sorting the available variables by percent of missing observations. The variable with the lowest number of missing observations gets the lowest ranking and so on.
 Step 2: Estimate the model with available independent variables and create the pvalue ranking for each variable. The variable with the smallest pvalue gets the lowest rank and so on.
 Step 3: Create the combined ranks for the independent variables by combining the three sets of rankings. Variable with high pvalue rank, high missing observation rank, and low investigator rank is ranked highest.
 Step 4: Exclude the one ranked highest in the combined ranking. In case of tied combined rank, the variable with the higher pvalue would be dropped.
 Step 5: Repeat steps 14 until no other variable to exclude and/or relative change in AIC (BIC) is minimal.
Figure 1 shows the details of the algorithm graphically.
Each type of ranking could either be numerical or categorical. For example, the investigator may rank the available variables into two categories: low and high. Similarly, one can group the variables into one of the three categories with low, medium, and high percent of missing observations. These rankings can be easily replaced with numeric weights. However, to avoid an infinite number of possible weighting schemes, one should assign weights so that the sum of all assigned weights of each type is one.
Figure 1 Weighted backward selection algorithm.
Figure 1 Weighted backward selection algorithm.
×
An Example
We demonstrate the weighted backward selection algorithm using partial data from a clinical trial on depression [11]. Data on n = 156 patients with major depression are available for a number of repeated visits. In the example, we are using data from 5 of the visits. Along with the baseline demographic and clinical characteristics, we have data on the Hamilton Rating Scale for Depression (HRSD) [12] and Beck's Depression Inventory (BDI) [13]. For patients with a significant relationship, we also have data on Dyadic Adjustment Scale (DYS) [14]. Our objective is to explore the relationship between HRSD and BDI adjusted for other factors.
Table 1 shows the combined rankings of all variables along with the missingness index, the pvalue index, and the investigator rankings of the all variables at iteration one. In this example, we used categorical rankings for all three types of information under consideration. The investigator ranked the variables into one of two categories: Low (L) and High (H) while the pvalue and missingness index are grouped together into four categories: Low (L), Medium (M), MediumHigh (MH), and High (H). Variables with a high percent of missing observations are grouped in the high missingness category while variables with large pvalues are put in the high pvalue category. Thus, the variable with high missingness, high pvalue and low investigator rank is assigned the highest combined rank. If there is more than one variable in that category, then the variable with the higher pvalue among them is assigned the higher rank. In our example, the variable DYS and length of current episode has low investigator rank, mediumhigh pvalue index, and high missingness index and thus assigned the highest combined rank. However, the variable DYS has the higher pvalue between the two variables and thus would be dropped at the first iteration. The remaining variables would then be used to identify the next variable to be eliminated. The process will continue until either all candidate variables are eliminated or the absolute change in AIC is smaller than a prespecified minimum.
Table 1 Combined rankings of variables at iteration 1 along with three types of rankings.
Variable 
Pvalue index 
Missingness Index 
Investigator rank 
Combined Rank 
Age
Gender
Education
Ethnicity
Employment
Paired
RDC Primary
RDC Endogenous
Age of Onset
Length of Current Episode
Length of Illness
BDI
DYS 
L
H
M
H
M
M
MH
H
L
MH
L
L
MH 
MH
L
H
L
M
MH
L
L
M
H
M
MH
H 
H
H
L
L
H
H
L
L
L
L
L
H
L 
3
7
27
23
8
9
22
23
18
30
18
3
30 
Table 1 Combined rankings of variables at iteration 1 along with three types of rankings.
×
Table 2 shows the variables eliminated at each step along with the number of missing observations and pvalue of the eliminated variable, the effective sample size used, and the AIC value. It is evident that the weighted backward selection algorithm utilizes more of the available data by reevaluating the effective sample size after each elimination. It also recalculates the three indices as well as the combined rankings. The regular backward selection evaluates the effective sample size only at the beginning of the first iteration and thus uses a much smaller sample for all the iterations. Also, only the pvalue rankings are used to decide which variable to eliminate at each round leading to a completely different set of variables to be eliminated. See Table 3 for a sidebyside comparison of the two algorithms in terms of variables eliminated and the effective sample size at each iteration.
Table 2 Variables eliminated at each iteration.
Iteration 
Dropped Variable 
Pvalue 
Number missing 
N used 
AIC 
01
02
03
04
05
06
07
08
09
10
11
12
13
14 
DYS
Education
Length of Current Episode
Age of Onset
Length of Illness
RDC Primary
RDC Endogenous
Ethnicity
Paired
Employment
Age
Gender
BDI
… 
0.299
0.539
0.997
0.431
0.832
0.119
0.013
0.007
0.011
0.003
0.000
0.009
0.000
… 
306
30
20
10
7
5
0
0
15
12
15
0
18
… 
347
588
617
637
647
653
658
658
658
673
680
694
694
712 
1975
3353
3505
3610
3669
3691
3719
3724
3732
3829
3874
4031
4038
5872 
Table 2 Variables eliminated at each iteration.
×
Table 3 Weighted and Regular Backward Elimination Algorithms.
Iteration 
Weighted Backward Selection 
Regular Backward Selection 
Drop 
N Used 
AIC 
Drop 
N Used 
AIC 
1 
DYS 
347 
1975 
Gender 
347 
1975 
2 
Education 
588 
3353 
RDC Endo 
347 
1977 
3 
LOCE 
617 
3505 
Ethnicity 
347 
1978 
4 
AAO 
637 
3610 
LOCE 
347 
1979 
5 
L of Illness 
647 
3669 
RDC Primary 
347 
2032 
6 
RDC Primary 
653 
3691 
Employment 
347 
2034 
7 
RDC Endo 
658 
3719 
Paired 
347 
2062 
8 



Education 
347 
2065 
Table 3 Weighted and Regular Backward Elimination Algorithms.
×
The available variable selection algorithms such as backward, forward, or stepwise selection allow the investigator to force a variable to be included in the final model without any consideration of the statistical significance of the variable in question. The weighted backward selection algorithm, on the other hand, allows the investigator to place varying levels of importance on each variable via the weighting scheme. However, in this algorithm, the final decision to include or eliminate a variable relies on the statistical importance of the variable. In Table 1, with a low investigator ranking, DYS was eliminated in the first iteration. However, if the investigator ranking was changed from low to high, DYS would not have been eliminated until iteration eight.
Conclusions
The weighted backward elimination algorithm described here incorporates factors other than the pvalue in model. While consideration of pvalue alone brings objectivity to the model building process, it completely ignores other extraneous factors. In the weighted backward elimination algorithm, both missingness of observations and investigator preference is incorporated in the process. Other such factors could also be included by using more factors while computing the combined rank of variables. The proposed algorithm also maximizes the use of available data without resorting to imputing the missing data by evaluating effective sample size repeatedly. Thus the estimated model would be free of any additional assumptions required for the data imputation methods. It is also free of additional programming difficulties associated with data imputations.
Our goal in this article is to describe an automated variable selection algorithm. Thus, throughout the discussions in this article, we have assumed noninformative missing data. We have also assumed the appropriateness of the linear model as well as other simplistic assumptions required for such a model.