Bootstrapping Lasso estimators
A. Chatterjee∗
Stat-Math Unit
Indian Statistical Institute, New Delhi
S. N. Lahiri†
Department of Statistics
Texas A&M University
Abstract
In this paper, we consider bootstrapping the Lasso estimator of the regression parameter in a multiple
linear regression model. It is known that the standard bootstrap method fails to be consistent. Here, we
propose a modified bootstrap method, and show that it provides valid approximation to the distribution of the
Lasso estimator, for all possible values of the unknown regression parameter vector, including the case where
some of the components are zero. Further, we establish consistency of the modified bootstrap method for
estimating the asymptotic bias and variance of the Lasso estimator. We also show that the residual bootstrap
can be used to consistently estimate the distribution and variance of the Adaptive Lasso estimator. Using
the former result, we formulate a novel data based method for choosing the optimal penalizing parameter
for the Lasso using the modified bootstrap. A numerical study is performed to investigate the finite sample
performance of the modified bootstrap. The methodology proposed in the paper is illustrated with a real
data example.
AMS (2000) Subject Classification: Primary: 62J07; Secondary: 62G09, 62E20.
Keywords and Phrases: Penalized Regression, bootstrap variance estimation, regularization, shrinkage.
1 Introduction
Consider the following regression model
yi = x′iβ + �i, i = 1, . . . , n, (1.1)
∗email: cha@isid.ac.in
†email: snlahiri@stat.tamu.edu
1
where, yi is the response, xi = (xi,1, . . . , xi,p)
′ is a p dimensional covariate vector, β = (β1, . . . , βp)
′ is the
regression parameter and {�i : i = 1, . . . , n} are independent and identically distributed errors. We assume
that p is fixed. The Lasso estimator of β is defined as the minimizer of the l1-norm penalized least-square
criterion function,
β̂n = argmin
u∈Rp
n∑
i=1
(yi − x′iu)2 + λn
p∑
j=1
|uj |, (1.2)
where, λn is a regularization parameter. The Lasso estimator was introduced by Tibshirani (1996) as an
estimation and variable selection method. The Lasso estimator has two nice properties, namely, (i) the
nature of regularization used in the Lasso leads to sparse solutions and (ii) it is also computationally feasible
(see Efron et al. (2004), Osborne et al. (2000), Friedman et al. (2007)). The sparse solutions obtained by
using the Lasso automatically leads to model selection. In the finite dimensional case, many authors have
studied the model-consistency properties of the Lasso and investigated conditions under which the Lasso can
recover the true sparsity pattern (see Zhao and Yu (2006), Wainwright (2006), Zou (2006)). Other than the
linear model setup like (1.1), Yuan and Lin (2007) have studied the neighborhood selection properties of the
Lasso in graphical models. Recently Bach (2009) considered using bootstrap samples in order to improve
the model selection accuracy of the Lasso.
An important problem in this context is the estimation consistency of the Lasso. This was first studied
by Knight and Fu (2000) for the finite dimensional regression model (1.1). The asymptotic distribution was
found and it was shown that the Lasso was weakly consistent. They also showed that if λn was sufficiently
large, then some components of the Lasso estimate may be exactly zero. It was found that under appropriate
regularity conditions, the limiting distribution of the Lasso estimator assigns positive mass at zero for the
components where the true regression parameter has zero values. Since the limit distribution of the Lasso
estimator is complicated (cf. Knight and Fu (2000)), it is important to have alternative approximations to
the distribution of the Lasso estimator that can be used in practice to set confidence regions and to carry out
tests on the parameter vector. Knight and Fu (2000) considered using the bootstrap to generate alternative
approximations. More specifically, Knight and Fu (2000) considered the residual-based bootstrap method
(cf. Freedman (1981)) for the Lasso estimator and sketched out its asymptotic behavior. Recently, it is
further investigated rigorously by Chatterjee and Lahiri (2010), who show that the asymptotic distribution
of the bootstrapped Lasso estimator is a random measure on Rp and that the bootstrap is inconsistent
whenever one or more components of the regression parameter is zero. Thus, in situations where the limit
distribution of the Lasso estimator is most complicated and alternative approximations are needed the most,
the usual bootstrap fails drastically! In this paper, we construct a suitable modification to the residual based
bootstrap method and show that under mild regularity conditions, the modified version of the bootstrap is
2
indeed consistent in estimating the limiting distribution of the Lasso estimator, even when some components
of β are zero.
Another important issue that has eluded a satisfactory solution to date is the problem of attaching
standard error estimates to the Lasso estimates. Initially, Tibshirani (1996) suggested an approximation
that had the drawback of providing zero standard error estimates when the estimated coefficient was zero.
Osborne et al. (2000) suggested an improved alternative but as pointed out by Knight and Fu (2000), all
these methods suffered from the drawback of considering the Lasso as an approximately linear transformation.
Other related methods (like the ’sandwich formula’) for variance estimation in penalized regression setup
were suggested by Fan and Li (2001) and Fan and Peng (2004) which only provided variance estimates
for the underlying non-zero components. One of the main contributions of this paper is to show that the
modified bootstrap method, that we propose here, gives a consistent variance estimator for the Lasso, for
both zero and nonzero parameter values. In particular, the bootstrap based variance estimate overcomes the
drawbacks of some of the earlier variance estimation techniques, like producing zero variance estimates for
estimated zero coefficients. As an application to this result, we provide bootstrap based confidence regions
for the true parameter vector.
From Knight and Fu (2000)’s work, it is known that the asymptotic distribution of the Lasso estimator
depends on the regularization parameter through λ0 where λnn−1/2 → λ0. In particular, the accuracy of
the Lasso estimator β̂n critically depends on the choice of the regularization parameter λn. In this context,
we formulate a new data-based method for selection of the regularization parameter. Since the modified
bootstrap is consistent for the mean squared error (MSE) of the Lasso estimator, we define a criterion
function based on the modified bootstrap estimator of the MSE as a rescaled function of λn. In this context,
recently Hall et al. (2009) suggested using the the m-out-of-n bootstrap, to choose the optimal regularization
parameter in the Adaptive Lasso setup.
We also study the properties of the bootstrapped version of the Adaptive Lasso estimator (Zou (2006)).
We find that the residual bootstrap based version of the Adaptive Lasso estimator is consistent in estimating
the asymptotic distribution and variance of the Adaptive Lasso estimator. Similar to Lasso estimator,
the question about the validity of the bootstrap for the Adaptive Lasso has been unresolved till now and
our results show that the simple residual bootstrap can consistently estimate the distribution and provide
variance estimates for the Adaptive Lasso estimator. This is unlike the case for the Lasso, where simple
residual bootstrap fails.
We conclude this section with a brief literature review. Knight and Fu (2000) derived the asymptotic
distribution of the Lasso estimator under model (1.1) in the case where the dimension p of the regression
3
is fixed. Properties of the standard bootstrap method have been investigated by Knight and Fu (2000) and
Chatterjee and Lahiri (2010), in the same set up. In the finite dimensional case, Po¨tscher and co-authors
have interesting results on the impossibility of estimating the distribution function of a Lasso estimator in
a uniform sense. The relevance of their results in the context of our work is discussed in Section 3.1. In the
high-dimensional case, where p is allowed to grow with n, work on estimation consistency of the Lasso is
limited; among them Huang et al. (2008) considered the asymptotic properties of `q norm penalized regression
estimators (0 < q < 1) for high dimensional regression models. There is a large amount of literature on the
variable selection properties of the Lasso and we do not attempt to summarize all the work. We refer to the
recent paper by Zhang et al. (2008), who describe a two-step procedure for variable selection using the Lasso
in the high dimensional setup, and also provide a clear picture of recent developments on variable selection
in the high dimensional setting.
The rest of the paper is organized as follows. In Section 2.1, we briefly review the residual based bootstrap
and motivate the intuitive reasons behind its failure. We formulate the modified bootstrap method in Section
2.2. The main results on consistency of modified bootstrap are stated Section 3. In Section 4, the results on
the consistency of the residual bootstrap for the Adaptive Lasso estimator are presented. The data-based
method for the selection of the (MSE-optimal) regularization parameter is presented in Section 5. The finite
sample performance of the proposed modified bootstrapped Lasso estimator and the data-based methodology
of choosing the optimal regularization parameter is studied in Section 6 using a simulated data set. A real
data example is presented in Section 7. The proofs are provided in Section 8.
2 Formulation of the modified bootstrap method
2.1 Background and motivation
In a regression setup like (1.1), there are two approaches to bootstrapping depending on whether the xi’s are
assumed to be random or not. In the case where xi are random, it is of interest to study the joint distribution
of the covariates and the response and hence pairwise bootstrap is a relevant choice. In contrast, here we
assume that the xi’s are non-random. In this situation, the standard approach to bootstrapping is the
residual bootstrap (cf. Efron (1979), Freedman (1981)), which was considered by Knight and Fu (2000)
in the context of the Lasso estimator. To motivate the modified bootstrap method, we first give a brief
description of the residual bootstrap. Let β̂n denote the Lasso estimator of β given by (1.2). Define the
residuals
ei = yi − x′iβ̂n, i = 1, . . . , n.
4
Consider the set of centered residuals {ei − e¯n : i = 1, . . . , n}, where e¯n = n−1
∑n
i=1 ei. For the residual
bootstrap, one selects a with replacement sample of size n, {e∗i : i = 1, . . . , n}, from the set of centered
residuals and formulates the bootstrap version of (1.1) as
y∗i = x
′
iβ̂n + e
∗
i , i = 1, . . . , n.
Next, based on the bootstrap dataset {(y∗i ,xi) : i = 1, . . . , n}, the bootstrap version of the Lasso estimator
is defined as:
β∗n = argmin
u∈Rp
n∑
i=1
(y∗i − x′iu)2 + λn
p∑
j=1
|uj |, (2.1)
The bootstrap version of T n ≡ n1/2
(
β̂n − β
)
is T ∗n = n
1/2
(
β∗n − β̂n
)
. The residual bootstrap estimator
of the unknown distribution Gn (say) of T n is the (conditional) distribution Ĝn(·) (say) of T ∗n given the
observations {yi : i = 1, . . . , n}, i.e.,
Ĝn(B) = P∗ (T ∗n ∈ B) , B ∈ B(Rp), (2.2)
where P∗ denotes conditional probability given the error variables {�i : i = 1, . . . , n} and B(Rp) denotes the
Borel σ-field on Rp.
For the bootstrap approximation to be useful, one would expect Ĝn(·) to be close to Gn(·). However,
this is not the case; Chatterjee and Lahiri (2010) show that the residual bootstrap estimator Ĝn(·), instead
of converging to the deterministic limit of Gn given by Knight and Fu (2000), converges weakly to a random
probability measure and therefore, it fails to provide a valid approximation to Gn(·). To appreciate why
the residual bootstrap approximation have a random limit and why it is inconsistent, first observe that the
Lasso estimators of the non-zero components of β estimate their signs correctly with high probability but the
estimators of the zero-components take both positive and negative values with positive probabilities, thereby
erring to capture the target sign value (which is zero for such components) closely. A close examination of
the proof of the main result (cf. Theorem 3.1 in Chatterjee and Lahiri (2010)), shows that although the
formulation of the residual bootstrap mimics the main features of the regression model closely, it fails to
reproduce the sign of the zero-components of β with sufficient accuracy in the formulation of the bootstrap
Lasso estimation criterion (2.1), leading to the random limit.
5
2.2 A modified bootstrap method
Based on the discussion of the last paragraph, we now propose a modified version of the bootstrapped Lasso
estimator that more closely reproduces the sign-vector corresponding to the unknown parameter β. As
seen in Chatterjee and Lahiri (2010), the inconsistency of the standard residual bootstrap arises when some
components of β are zero. The key idea behind the modified bootstrap is to force components of the Lasso
estimator β̂n to be exactly zero whenever they are close to zero. Since the original Lasso estimator is root-n
consistent, its fluctuations are of the order n−1/2 around the true value. This suggests a neighborhood
of order larger than n−1/2 around the true value will contain the values of the Lasso estimator with high
probability. To that end, let {an} be a sequence of real numbers such that
an +
(
n−1/2 log n
)
a−1n → 0 as n→∞. (2.3)
For example, an = cn−δ satisfies (2.3) for all c ∈ (0,∞) and δ ∈ (0, 2−1). We threshold the components of
the Lasso estimator β̂n at an and define the modified Lasso estimator as
β˜n =
(
β˜n,1, . . . , β˜n,p
)′
, with
β˜n,j = β̂n,j1
(
|β̂n,j | ≥ an
)
, j = 1, . . . , p,
(2.4)
where β̂n is the usual Lasso estimate defined in (1.2) and 1 (·) denotes the indicator function. Note that for
a nonzero component βj , ∣∣β̂n,j∣∣ = |βj |+Op (n−1/2) > |βj |2 ≥ an
for n large, with high probability and therefore, β˜n,j = β̂n,j , for n large and with probability tending to 1.
Thus, this shrinkage does not have any significant effect on the non-zero components. However, for a zero
component, βj = 0, ∣∣β̂n,j∣∣ = |βj |+Op (n−1/2) = Op (n−1/2) ∈ [−an, an] ,
with probability tending to 1 as n→∞, and thus
β˜n,j = β̂n,j1
(∣∣β̂n,j∣∣ > an) = 0 for large n,
with probability tending to 1. In particular, the shrinkage by an accomplishes our main objective: namely,
to capture the signs of the zero components precisely with probability tending to 1, as the sample size n
goes to infinity.
6
Next, we define the modified residuals {ri : i = 1, . . . , n} based on this estimator β˜n by
ri = yi − x′iβ˜n, i = 1, . . . , n. (2.5)
Let r¯n = n−1
∑n
i=1 ri. We select a random sample {e∗∗1 , . . . , e∗∗n } of size n with replacement from the centered
residuals {ri − r¯n : i = 1, . . . , n} and set
y∗∗i = x
′
iβ˜n + e
∗∗
i , i = 1, . . . , n.
Then, the modified bootstrap Lasso estimator is
β∗∗n := argmin
u∈Rp
n∑
i=1
(y∗∗i − x′iu)2 + λn
p∑
j=1
|uj |. (2.6)
Let
T ∗∗n = n
1/2
(
β∗∗n − β˜n
)
, n ≥ 1,
and let G˜n(·) denote the conditional distribution of T ∗∗n given the observations (or the error variables {�i : i =
1, . . . , n}), i.e., G˜n(B) = P∗(T ∗∗n ∈ B), B ∈ B(Rp). Thus, G˜n(·) is the modified bootstrap approximation
to the unknown distribution Gn(·) of T n. The modified bootstrap estimator of a population parameter
θn = ϕ(Gn), defined through a functional ϕ(·) of Gn, is ϕ(G˜n). For example, the modified bootstrap
estimator of E (T n) = the scaled bias of β̂n is E∗(T
∗∗
n ) and similarly, that of Var(T n) is Var∗(T
∗∗
n ), where
E∗ and Var∗ denote the expectation and variance under P∗.
Remark 1: It should be noted that centering the residuals {ri : i = 1, . . . , n} is a must for the validity of
the residual bootstrap in general, as it ensures that the bootstrap analogue of the model condition E(�1) = 0.
Note that this is a sufficient condition to ensure E∗(
∑n
i=1 xi�
∗∗
i ) = 0. For centered xi’s, the ’centering of
residuals’ step can be bypassed, as the conditional mean of the sum
∑n
i=1 xi�
∗∗
i will still be zero. For real
data-sets, where the the responses and covariates are already known to be centered, the centering of residuals
is not a required step.
On the other hand, scaling the residuals is not as critical a condition: the residual bootstrap is known
to be consistent for the OLS (and also for the Lasso, as shown here) without any scale adjustments. In the
literature, the rescaling factor (1− p/n)−1/2 is sometimes used (see Efron (1982)) to improve finite sample
accuracy, but asymptotically this has negligible effect as p is fixed in this case.
Remark 2: In (2.4), the thresholded version of the Lasso estimator β˜n has been constructed by thresh-
olding the usual Lasso estimator β̂n. It is possible to replace β̂n by any other
√
n-consistent estimator of
7
β, and this fact follows from the proof of Theorem 3.1. For example, the usual least-squares estimator of β
can be used. In such a situation, β˜n can be similarly defined by thresholding each component of the selected
√
n-consistent estimator of β. In the context of computational efficiency, the choice of an alternative estima-
tor, like the usual least-squares estimator, does provide some computational advantage over using the Lasso
estimator β̂n. But, in practical situations, with the existence of extremely fast computational algorithms
(Efron et al. (2004), Friedman et al. (2007)), that can be used for computing Lasso estimators, the extra
effort in computing the Lasso estimator β̂n becomes a minor issue.
In the next section we show that under some mild conditions, the modified bootstrap estimators of the
distribution function of T n and of its bias and variance functionals are consistent.
3 Bootstrapping the Lasso estimator
3.1 Consistency of the distributional approximation
The first result shows that the modified bootstrap gives a valid approximation to the distribution of T n:
Theorem 3.1 (consistency of modified bootstrap). Suppose that the following assumptions hold:
(C.1) n−1
∑n
i=1 xix
′
i → C, where C is positive definite. Further, n−1
∑n
i=1 ‖xi‖3 = O(1).
(C.2) λnn−1/2 → λ0 ≥ 0.
(C.3) The errors {�i : i = 1, . . . , n} are independent and identically distributed with E(�1) = 0 and Var (�1) =
σ2 ∈ (0,∞).
Then
%
(
G˜n(·), Gn(·)
)
−→ 0 as n→∞, with probability 1,
where %(·, ·) denotes the Prohorov metric on the set of all probability measures on (Rp,B(Rp)).
Theorem 3.1 asserts strong consistency of the modified bootstrap distribution function estimator under
Assumptions (C.1) - (C.3). In contrast, for the standard version of the residual bootstrap, Chatterjee and
Lahiri (2010), shows that under the same set of regularity assumptions, if λ0 > 0 in Assumption (C.2) and
if β has at least one zero-component, then
%
(
Ĝn(·), Gn(·)
)
6→ 0 in probability, as n→∞,
8
where Ĝn is the residual bootstrap estimate of Gn (cf. (2.2)). Thus, while the standard residual bootstrap
has limited success in presence of zero-components, the modified bootstrap removes the limitation of the
residual bootstrap, and provides a valid approximation to the distribution of the centered and scaled Lasso
estimator for all values of the regression parameter β.
Next let G∞(·) denote the limit distribution of T n (cf. Knight and Fu (2000)). Then, from Theorem 3.1,
it follows that for any Borel set B ⊂ Rp with G∞(∂B) = 0,
P∗(T ∗∗n ∈ B)−P(T