Sharp Oracle Inequalities for Square Root Regularization - PDF Document

Presentation Transcript

  1. Journal of Machine Learning Research 18 (2017) 1-29 Submitted 9/15; Revised 6/16; Published 8/17 Sharp Oracle Inequalities for Square Root Regularization Benjamin Stucky Seminar for Statistics ETH Z¨ urich R¨ amistrasse 101 8092 Zurich, Switzerland stucky@stat.math.ethz.ch Sara van de Geer Seminar for Statistics ETH Z¨ urich R¨ amistrasse 101 8092 Zurich, Switzerland geer@stat.math.ethz.ch Editor: Nicolas Vayatis Abstract We study a set of regularization methods for high-dimensional linear regression models. These penalized estimators have the square root of the residual sum of squared errors as loss function, and any weakly decomposable norm as penalty function. This fit measure is chosen because of its property that the estimator does not depend on the unknown standard deviation of the noise. On the other hand, a generalized weakly decomposable norm penalty is very useful in being able to deal with different underlying sparsity structures. We can choose a different sparsity inducing norm depending on how we want to interpret the unknown parameter vector β. Structured sparsity norms, as defined in Micchelli et al. (2010), are special cases of weakly decomposable norms, therefore we also include the square root LASSO (Belloni et al., 2011), the group square root LASSO (Bunea et al., 2014) and a new method called the square root SLOPE (in a similar fashion to the SLOPE from Bogdan et al. 2015). For this collection of estimators our results provide sharp oracle inequalities with the Karush-Kuhn-Tucker conditions. We discuss some examples of estimators. Based on a simulation we illustrate some advantages of the square root SLOPE. Keywords: Square root LASSO, structured sparsity, Karush-Kuhn-Tucker, sharp oracale inequality, weak decomposability 1. Introduction and Model The recent development of new technologies makes data gathering not a big problem any more. In some sense there is more data than we can handle, or than we need. The problem has shifted towards finding useful and meaningful information in the big sea of data. An example where such problems arise is the high-dimensional linear regression model Y = Xβ0+ ?. (1.1) Here Y is the n−dimensional response variable, X is the n×p design matrix and ? is the iden- tical and independent distributed noise vector. The noise has E(?i) = 0,Var(?i) = σ2, ∀i ∈ {1,...,n}. Assume that σ is unknown, and that β0is the ”true” underlying p−dimensional c ?2017 Benjamin Stucky and Sara van de Geer. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v18/15-482.html.

  2. Stucky and van de Geer parameter vector of the linear regression model with active set S0:= supp(β0). While trying to explain Y through different other variables, in the high-dimensional linear regression model, we need to set less important explanatory variables to zero. Otherwise we would have overfitting. This is the process of finding a trade-off between a good fit and a sparse solution. In other words we are trying to find a solution that explains our data well, but at the same time only uses more important variables to do so. The most famous and widely used estimator for the high-dimensional regression model is the ‘1−regularized version of least squares, called LASSO (Tibshirani, 1996) ˆβL(σ) := argmin β∈Rp ?kY − Xβk2 ?. n+ 2λ1σkβk1 Here λ1is a constant called the regularization level, which regulates how sparse our solution should be. Also note that the construction of the LASSO estimator depends on the unknown noise level σ. We moreover let kak1:=Pp LASSO uses the ‘1−norm as a measure of sparsity. This measure as regulizer sets a number of parameters to zero. Let us rewrite the LASSO into the following form ?? i=1|ai| for any a ∈ Rpdenote the ‘1−norm and j/n, the ‘2− norm squared and divided by n. The n=Pn for any a ∈ Rnwe write kak2 j=1a2 ? ? ·2λ1σ λ 0(β)kβk1 ˆβL= argmin kY − Xβkn+ λ , 0(β) β∈Rp 0(β) := 0(β), a function of β, let us assume 2λ1σ where λ kY −Xβkn. Instead of minimizing with λ 0(β) a fixed constant. Then we get the Square Root LASSO method that we keep λ ˆβsrL:= argmin {kY − Xβkn+ λkβk1}. β∈Rp So in some sense the λ for the Square Root LASSO is a scaled version, scaled by an adaptive estimator of σ, of λ1from the LASSO. By the optimality conditions it is true that ˆβL(kY − XˆβsrLkn) =ˆβsrL. The Square Root LASSO was introduced by Belloni et al. (2011) in order to get a pivotal method. An equivalent formulation as a joint convex optimization program can be found in Owen (2007). This method has been studied under the name Scaled LASSO in Sun and Zhang (2012). Pivotal means that the theoretical λ does not depend on the unknown standard deviation σ or on any estimated version of it. The estimator does not require the estimation of the unknown σ. Belloni et al. (2014) also showed that under Gaussian noise the theoretical λ can be chosen of order Φ−1(1 − α/2p)/√n − 1, with Φ−1denoting the inverse of the standard Gaussian cumulative distribution function, and α being some small probability. This is independent of σ and achieves a near oracle inequality for the prediction norm of convergence rate σp(|S0|/n)logp. In contrast to that, the theoretical penalty level prediction norm. of the LASSO depends on knowing σ in order to achieve similar oracle inequalities for the 2

  3. Sharp Oracle Inequalities for Square Root Regularization The idea of the square root LASSO was further developed in Bunea et al. (2014) to the group square root LASSO, in order to get a selection of groups of predictors. The group LASSO norm is another way to describe an underlying sparsity, namely if groups of parameters should be set to zero, instead of individual parameters. Another extension for the the square root LASSO in the case of matrix completion was given by Klopp (2014). Now in this paper we go further and generalize the idea of the square root LASSO to any sparsity inducing norm. From now on we will look at the family of norm penalty regularization methods, which are of the following square root type ˆβ := argmin {kY − Xβkn+ λΩ(β)}, β∈Rp where Ω is any norm on Rp. This set of regularization methods will be called square root regularization methods. Furthermore, we introduce the following notations ˆ ? := Y − Xˆβ Ω∗(x) := the residuals, z,Ω(z)≤1zTx, x ∈ Rp βS= {βj: j ∈ S} max the dual norm of the norm Ω, and ∀S ⊂ {1,...,p} and all vectors β ∈ Rp. Later we will see that describing the underlying sparsity with an appropriate sparsity norm can make a difference in how good the errors will be. Therefore in this paper we extend the idea of the square root LASSO with the ‘1−penalty to more general weakly decomposable norm penalties. The theoretical λ of such an estimator will not depend on σ either. We introduce the Karush-Kuhn-Tucker conditions for these estimators and give sharp oracle inequalities. In the last two sections we will give some examples of different norms and simulations comparing the square root LASSO with the square root SLOPE. 2. Karush-Kuhn-Tucker Conditions As we already have seen before, these estimators need to calculate a minimum over β. The Karush-Kuhn-Tucker conditions characterize this minimum. In order to formulate these optimality conditions we need some concepts of convex optimization. For the reader who is not familiar with this topic, we will introduce the subdifferential, which generalizes the differential, and give a short overview of some properties, as can be found for example in Bach et al. (2012). For any convex function g : Rp→ R and any vector w ∈ Rpwe define its subdifferential as ∂g(w) := {z ∈ Rp; g(w0) ≥ g(w) + zT(w0− w) ∀w0∈ Rp}. The elements of ∂g(w) are called the subgradients of g at w. Let us remark that all convex functions have non empty subdifferentials at every point. Moreover by the definition of the subdifferential any subgradient defines a tangent space w07→ g(w)+zT·(w0−w), that goes through g(w) and is at any point lower than the function g. If g is differentiable at w, then its subdifferential at w is the usual gradient. Now the 3

  4. Stucky and van de Geer next lemma, which dates back to Pierre Fermat (see Bauschke and Combettes 2011), shows how to find a global minimum for a convex function g. Lemma 1 (Fermat’s Rule) For all convex functions g : Rp→ R it holds that v ∈ Rpis a global minimum of g ⇔ 0 ∈ ∂g(v). For any norm Ω on Rpwith ω ∈ Rpit holds true that its subdifferential can be written as (see Bach et al. 2012 Proposition 1.2) ( {z ∈ Rp;Ω∗(z) ≤ 1} {z ∈ Rp;Ω∗(z) = 1VzTw = Ω(ω)} if ω = 0 if ω 6= 0. ∂Ω(ω) = (2.1) We are able to apply these properties to our estimatorˆβ. Lemma 1 implies that ˆβ is optimal ⇔ −1 λ∇kY − Xˆβkn∈ ∂Ω(ˆβ). This means that, in the case kˆ ?kn > 0, for the square root regularization estimatorˆβ it holds true that ˆβ is optimal ⇔XT(Y − Xˆβ) nλkY − Xˆβkn By combining equation (2.1) with (2.2) we can write the KKT conditions as   What we might first remark about equation (2.3) is that in the case ofˆβ 6= 0 the second part can be written as ∈ ∂Ω(ˆβ). (2.2) Ω∗? Vˆ ?TXˆβ ?   ˆ ?TX nkˆ ?kn ˆ ?TX nkˆ ?kn nkˆ ?kn= λΩ(ˆβ) ifˆβ = 0   ≤ λ = λ    Ω∗? ? ˆβ is optimal ⇔ ifˆβ 6= 0. (2.3) ?ˆ ?TX ? ˆ ?TXˆβ/n = Ω(ˆβ) · Ω∗ . n This means that we in fact have equality in the generalized Cauchy-Schwartz Inequality for these two p−dimensional vectors. Furthermore let us remark that the equality ˆ ?TXˆβ/n = Ω(ˆβ)λkˆ ?kn trivially holds true for the case whereˆβ = 0. It is important to remark here that, in contrast to the KKT conditions for the LASSO, we have an additional kˆ ?knterm in the expression Ω∗? σ. With the KKT conditions we are able to formulate a generalized type of KKT conditions. This next lemma is needed for the proofs in the next chapter. ? ˆ ?TX nkˆ ?kn . This nice scaling leads to the property that the theoretical λ is independent of Lemma 2 For the square root type estimatorˆβ we have for any β ∈ Rpand when kˆ ?kn6= 0 1 kˆ ?knˆ ?TX(β −ˆβ)/n + λΩ(ˆβ) ≤ λΩ(β). 4

  5. Sharp Oracle Inequalities for Square Root Regularization Proof First we need to look at the inequality from the KKT’s, which holds in any case ?ˆ ?TX And by the definition of the dual norm and the maximum, we have with (2.4) ? Ω∗ ≤ λ. (2.4) nkˆ ?kn ˆ ?T kˆ ?knXβ/n ? 1 kˆ ?knˆ ?TXβ/n ≤ Ω(β) · max ?ˆ ?TX β∈Rp,Ω(β)≤1 = Ω(β) · Ω∗ ≤ Ω(β)λ. nkˆ ?kn (2.5) The second equation from the KKT’s, which again holds in any case, is 1 kˆ ?knˆ ?TXˆβ/n = λΩ(ˆβ). (2.6) Now putting (2.5) and (2.6) together we get the result. 3. Sharp Oracle Inequalities for the square root regularization estimators We provide sharp oracle inequalities for the estimatorˆβ with a norm Ω that satisfies a so called weak decomposability condition. An oracle inequality is a bound on the estimation and prediction errors. This shows how good these estimators are in estimating the parame- ter vector β0. This is an extension of the sharp oracle results given in van de Geer (2014) for LASSO type of estimators, which in turn was an generalization of the sharp oracle inequal- ities for the LASSO and nuclear norm penalization in Koltchinskii (2011) and Koltchinskii et al. (2011). 3.1 Concepts and Assumptions Let us first introduce all the necessary definitions and concepts. Some normed versions of values need to be introduced: f =λΩ(β0) k?kn λSc=ΩSc∗((?TX)Sc) nk?kn λS=Ω∗((?TX)S) nk?kn λm= max(λS,λSc) λ0=Ω∗(?TX) nk?kn . 5

  6. Stucky and van de Geer For example the quantity f gives the measure of the true underlying normalized sparsity. ΩScdenotes a norm on Rp−|S|which will shortly be defined in Assumption II. Furthermore λmwill take the role of the theoretical (unknown) λ. If we compare this to the case of the LASSO we see that instead of the ‘∞−norm we generalized it to the dual norm of Ω. Also remark that in λma term k?knappears. This scaling is due to the square root regularization, which will be the reason that λ can be chosen independently of the unknown standard deviation σ. Now we will give the two main assumptions that need to hold in order to prove the oracle inequalities. Assumption I deals with avoiding overfitting, and the main concern of Assumption II is that the norm has the desired property of promoting a structured sparse solutionˆβ. We will later see, that the structured sparsity norms in Micchelli et al. (2010) and Micchelli et al. (2013) are all of this form. Thus, Assumption II is quite general. 1 3.1.1 Assumption I (overfitting) If kˆ ?kn= 0, thenˆβ does the same thing as the Ordinary Least Squares (OLS) estimator βOLS, namely it overfits. That is why we need a lower bound on kˆ ?kn. In order to achieve this lower bound we make the following assumptions: ! Y ∈ {eY : Ω(β) ≤ k?kn} P min = 0. β,s.t.Xβ=eY λ0 λ ?1 + 2f?< 1. Theλ0 λterm makes sure that we introduce enough sparsity (no overfitting). 3.1.2 Assumption II (weak decomposability) Assumption II is fulfilled for a set S ⊂ {1,...,p} and a norm Ω on Rpif this norm is weakly decomposable, and S is an allowed set for this norm. This was used by van de Geer (2014) and goes back to Bach et al. (2012). It is an assumption on the structure of the sparsity inducing norm. By the triangle inequality we have: Ω(βSc) ≥ Ω(β) − Ω(βS). But we will also need to lower bound this by another norm evaluated at βSc. This is motivated by relaxing the following decomposability property of the ‘1-norm: kβk1= kβSk1+ kβSck1,∀ sets S ⊂ 1,...,p and all β ∈ Rp. This decomposability property is used to get oracle inequalities for the LASSO. But we can relax this property, and introduce weakly decomposable norms. Definition 3 (Weak decomposability) A norm Ω in Rpis called weakly decomposable for an index set S ⊂ {1,...,p}, if there exists a norm ΩScon R|Sc|such that ∀β ∈ RpΩ(β) ≥ Ω(βS) + ΩSc(βSc). 6

  7. Sharp Oracle Inequalities for Square Root Regularization Furthermore we call a set S allowed if Ω is a weakly decomposable norm for this set. Remark 4 In order to get a good oracle bound, we will choose the norm ΩScas large as possible. We will also choose the allowed sets S in such a way to reflect the active set S0. Otherwise we would of course be able to choose as a trivial example the empty set S = ∅. Now that we have introduced the two main assumptions, we can introduce other definitions and concepts also used in van de Geer (2014). Definition 5 For S an allowed set of a weakly decomposable norm Ω, and L > 0 a constant, the Ω−eigenvalue is defined as δΩ(L,S) := min?kXβS− XβSckn: Ω(βS) = 1,ΩSc(βSc) ≤ L?. Then the Ω−effective sparsity is defined as 1 Γ2 Ω(L,S) := Ω(L,S). δ2 The Ω−eigenvalue is the distance between the two sets (van de Geer and Lederer, 2013) {XβS: Ω(βS) = 1} and {XβSc : ΩSc(βSc) ≤ L}, see Figure 2. The additional discussion about these definitions will follow after the main theorem. The Ω−eigenvalue generalizes the compatibility constant (van de Geer, 2007). For the proof of the main theorem we need some small lemmas. For any vector β ∈ Rp the (L,S)−cone condition for a norm Ω is satisfied if ΩSc(βSc) ≤ LΩ(βS), with L > 0 a constant and S an allowed set. The proof of Lemma 6 can be found in van de Geer (2014). It shows the connection between the (L,S)−cone condition and the Ω−eigenvalue. We bound Ω(βS) by a multiple of kXβkn. Lemma 6 Let S be an allowed set of a weakly decomposable norm Ω and L > 0 a constant. Then we have that the Ω−eigenvalue is of the following form: ?kXβkn ? δΩ(L,S) = min Ω(βS),β satisfies the cone condition and βS6= 0 . We have Ω(βS) ≤ ΓΩ(L,S)kXβkn. We will also need a lower and an upper bound for kˆ ?kn, as already mentioned in Assumption I. The next Lemma 7 gives such bounds. Lemma 7 Suppose that Assumption I holds true. Then ≥1 −λ0 λ(1 + 2f) f + 2 1 + f ≥kˆ ?kn > 0. k?kn Proof The upper bound is obtained by the definition of the estimator kY − Xˆβkn+ λΩ(ˆβ) ≤ kY − Xβ0kn+ λΩ(β0). 7

  8. Stucky and van de Geer Therefore we get kˆ ?kn≤ k?kn+ λΩ(β0). Dividing by k?knand by the definition of f we get the desired upper bound. The main idea for the lower bound is to use the triangle inequality kˆ ?kn= k? − X(ˆβ − β0)kn≥ k?kn− kX(ˆβ − β0)kn, and then upper bound kX(ˆβ − β0)kn. With Lemma 2 we get an upper bound for kX(ˆβ − β0)kn, kX(ˆβ − β0)k2 ≤ λ0k?knΩ(ˆβ − β0) + λkˆ ?kn(Ω(β0) − Ω(ˆβ)) ≤ λ0k?kn(Ω(ˆβ) + Ω(β0)) + λkˆ ?kn(Ω(β0) − Ω(ˆβ)) ≤ λ0k?knΩ(ˆβ) + Ω(β0)(λ0k?kn+ λkˆ ?kn). In the second line we used the definition of the dual norm, and the Cauchy-Schwartz in- equality. Again by the definition of the estimator we have n≤ ?TX(ˆβ − β0)/n + λkˆ ?kn(Ω(β0) − Ω(ˆβ)) Ω(ˆβ) ≤k?kn + Ω(β0). λ And we are left with s ? ? λ0 λ 1 + 2λΩ(β0) ·λΩ(β0) k?kn λ0·kˆ ?kn +λ kX(ˆβ − β0)kn≤ k?kn . k?kn k?kn By the definition of f we get s s ? ? λ0 λ kˆ ?kn k?knf 1 + 2f +λ kX(ˆβ − β0)kn≤ k?kn . λ0 λ0 λ+ 2λ0 λf +kˆ ?kn = k?kn k?knf. Now we get kˆ ?kn≥ k?kn− kX(ˆβ − β0)kn s λ0 λ+ 2λ0 λf +kˆ ?kn ≥ k?kn− k?kn k?knf (3.1) kˆ ?kn k?kn< 1 ? −λ0 Let us rearrange equation (3.1) further in the case ?2 λ0 λ+ 2λ0 λf +kˆ ?kn 1 −kˆ ?kn k?kn k?knf ≥ +kˆ ?k2 kˆ ?kn k?knf ≥ 1 − 2kˆ ?kn n λ(1 + 2f) k?k2 k?kn n 8

  9. Sharp Oracle Inequalities for Square Root Regularization ≥ 1 −λ0 k?knf + 2kˆ ?kn kˆ ?kn k?kn kˆ ?kn λ(1 + 2f) k?kn λ(1 + 2f) f + 2 ≥1 −λ0 Assumption I > 0. 1−λ0 λ(1+2f) f+2 On the other hand ifkˆ ?kn k?kn> 1, we already get a lower bound which is bigger than . 3.2 Sharp Oracle Inequality Finally we are able to present the main theorem. This theorem gives sharp oracle inequalities on the prediction error expressed in the ‘2-norm, and the estimation error expressed in the Ω and ΩScnorms. Remark 8 Let us first briefly remark that in the Theorem 9 we need to assure that λ∗− λm> 0. The assumptionλm λ< 1/a, with a chosen as in Theorem 9, together with the fact that λ0≤ λmleads to the desired inequality λ∗ λ f + 2 =1 −λ0 ≥1 −λm >λm λ(1 + 2f) λ(1 + 2f) f + 2 λ. Theorem 9 Assume that 0 ≤ δ < 1, and also that aλm< λ, with the constant a = 3(1+f). We invoke also Assumption I (overfitting) and Assumption II (weak decomposability) for S and Ω. Here the allowed set S is chosen such that the active set Sβ:= supp(β) is a subset of S. Then it holds true that h i (λ∗+ λm)Ω(ˆβS− β) + (λ∗− λm)ΩSc(ˆβSc) h kX(ˆβ − β0)k2 ≤ kX(β − β0)k2 n+ 2δk?kn i2Γ2 (1 + δ)(˜λ + λm) n+ k?k2 Ω(LS,S), (3.2) n ˜λ+λm λ∗−λm1+δ with LS:= 1−δand ! 1 −λ0 λ(1 + 2f) f + 2 ˜λ λ∗:= λ , := λ(1 + f). Furthermore we get the two oracle inequalities n(1 + δ)2(˜λ + λSc kX(ˆβ − β0)k2 n≤ kX(β?− β0)k2 1 2δk?kn +(1 + δ)2k?kn n+ k?k2 ?)2· Γ2 Ω(LS?,S?) ·kX(β?− β0)k2 λ∗− λm ·(˜λ + λm)2 Ω(ˆβS?− β?) + ΩSc ?(ˆβSc n ?) ≤ + ... λ∗− λm· Γ2 Ω(LS?,S?). 2δ 9

  10. Stucky and van de Geer For all fixed allowed sets S define ? ? h i2Γ2 (1 + δ)(˜λ + λm) kX(β − β0)k2 n+ k?k2 β?(S) := argmin β: supp(β)⊆S Ω(LS,S) . n Then S?is defined as ? ? h i2Γ2 (1 + δ)(˜λ + λm) kX(β?(S) − β0)k2 n+ k?k2 S?:= argmin S allowed β?:= β?(S?) Ω(LS,S) , (3.3) n (3.4) it attains the minimal right hand side of the oracle inequality (3.2). An improtant special case of equation (3.2) is to choose β ≡ β0with S ⊇ S0allowed. The term kX(β − β0)k2 vanishes in this case and only the Ω−effective sparsity term remains for the upper bound. But it is not obvious in which cases and whether β? leads to a substantially lower bound than β0. n Proof Let β ∈ Rpand let S be an allowed set containing the active set of β. We need to distinguish 2 cases. The second case is the more substantial one. Case 1: Assume that h Here hu,vin:= vTu/n, for any two vectors u,v ∈ Rn. In this case we can simply use the following calculations to verify the theorem. i (λ∗+ λm)Ω(ˆβS− β) + (λ∗− λm)ΩSc(ˆβSc) hX(ˆβ − β0),X(ˆβ − β)in≤ −δk?kn . kX(ˆβ − β0)k2 + 2δk?kn n− kX(β − β0)k2 h = 2hX(ˆβ − β0),X(ˆβ − β)in− kX(β −ˆβ)k2 + 2δk?kn ≤ −kX(β −ˆβ)k2 ≤ 0 n+ ... i (λ∗+ λm)Ω(ˆβS− β) + (λ∗− λm)ΩSc(ˆβSc) n h i (λ∗+ λm)Ω(ˆβS− β) + (λ∗− λm)ΩSc(ˆβSc) n Now we can turn to the more important case. Case 2: Assume that h i (λ∗+ λm)Ω(ˆβS− β) + (λ∗− λm)ΩSc(ˆβSc) hX(ˆβ − β0),X(ˆβ − β)in≥ −δk?kn We can reformulate Lemma 2 with Y − Xˆβ = X(β0−ˆβ) + ?, then we get: hX(ˆβ − β0),X(ˆβ − β)in kˆ ?kn This is equivalent to . + λΩ(ˆβ) ≤h?,X(ˆβ − β)in + λΩ(β). kˆ ?kn hX(ˆβ − β0),X(ˆβ − β)in+ kˆ ?knλΩ(ˆβ) ≤ h?,X(ˆβ − β)in+ kˆ ?knλΩ(β). (3.5) 10

  11. Sharp Oracle Inequalities for Square Root Regularization By the definition of the dual norm and the generalized Cauchy-Schwartz inequality we have ? ≤ k?kn ? λSΩ(ˆβS− β) + λScΩSc(ˆβSc) λmΩ(ˆβS− β) + λmΩSc(ˆβSc) h?,X(ˆβ − β)in≤ k?kn ? ? Inserting this inequality into (3.5) we get ? ? λmΩ(ˆβS− β) + λmΩSc(ˆβSc) hX(ˆβ − β0),X(ˆβ − β)in+ kˆ ?knλΩ(ˆβ) ≤ k?kn + kˆ ?knλΩ(β). (3.6) Then by the weak decomposability and the triangle inequality in (3.6) ? ? Ω(ˆβS) + ΩSc(ˆβSc) hX(ˆβ − β0),X(ˆβ − β)in+ kˆ ?knλ ? ? ? ? λmΩ(ˆβS− β) + λmΩSc(ˆβSc) Ω(ˆβS− β) + Ω(ˆβS) ≤ k?kn + kˆ ?knλ . (3.7) By inserting the assumption of case 2 h i (λ∗+ λm)Ω(ˆβS− β) + (λ∗− λm)ΩSc(ˆβSc) hX(ˆβ − β0),X(ˆβ − β)in≥ −δk?kn , into (3.7) we get ? By assumption aλm< λ we have that λ∗> λm(see Remark 8) and therefore ˜λ + λm λ∗− λm ? ? ? ΩSc(ˆβSc) ≤ λkˆ ?kn+ λmk?kn+ δk?kn(˜λ + λm) Ω(ˆβS−β). λkˆ ?kn−λmk?kn−δk?kn(λ∗−λm) ! ·1 + δ ΩSc(ˆβSc) ≤ 1 − δ· Ω(ˆβS− β). We have applied Lemma 7 in the last step, in order to replace the estimate kˆ ?knwith k?kn. By the definition of LSwe have ΩSc(ˆβSc) ≤ LSΩ(ˆβS− β). (3.8) Therefore with Lemma 6 we get Ω(ˆβS− β) ≤ ΓΩ(LS,S)kX(ˆβ − β)kn. (3.9) Inserting (3.9) into (3.7), together with Lemma 7 and δ < 1, we get hX(ˆβ − β0),X(ˆβ − β)in+ δk?kn(λ∗− λm)ΩSc(ˆβSc) ≤ (1 + δ − δ)k?kn(λkˆ ?kn/k?kn+ λm)Ω(ˆβS− β) ≤ (1 + δ)k?kn(˜λ + λm)ΓΩ(LS,S)kX(ˆβ − β)kn− δk?kn(λ∗+ λm)Ω(ˆβS− β) Because ∀u,v ∈ R, 0 ≤ (u − v)2it holds true that uv ≤ 1/2(u2+ v2). 11

  12. Stucky and van de Geer Therefore with a = (1 + δ)k?kn(˜λ + λm)ΓΩ(LS,S) and b = kX(ˆβ − β)knwe have hX(ˆβ − β0),X(ˆβ − β)in+ δk?kn(λ∗− λm)Ω(ˆβSc)Sc+ δk?kn(λ∗+ λm)Ω(ˆβS− β) ≤1 Ω(LS,S) +1 n(˜λ + λm)2Γ2 2kX(ˆβ − β)k2 2(1 + δ)2k?k2 n. Since 2hX(ˆβ − β0),X(ˆβ − β)in= kX(ˆβ − β0)k2 n+ kX(ˆβ − β)k2 n− kX(β − β0)k2 n, we get ? ? (λ∗− λm)Ω(ˆβSc)Sc+ (λ∗+ λm)Ω(ˆβS− β) n(˜λ + λm)2Γ2 kX(ˆβ − β0)k2 n+ 2δk?kn ≤ (1 + δ)2k?k2 Ω(LS,S) + kX(β − β0)k2 n. (3.10) This gives the sharp oracle inequality. The two oracle inequalities mentioned are just a split up version of inequality (3.10), where for the second oracle inequality we need to see that λ∗− λm≤ λ∗+ λm. Remark that the sharpness in the oracle inequality of Theorem 9 is the constant one in front of the term kX(β − β0)k2 set Sc If we choose λ of the same order as λm(i.e. aλ = λm, with a > 0 a constant), then we can simplify the oracle inequalities. This is comparable to the oracle inequalities for the LASSO, see for example Bickel et al. (2009), Bunea et al. (2006), Bunea et al. (2007), van de Geer (2007) and further references can be found in B¨ uhlmann and van de Geer (2011). n. Because we measure a vector on S?by Ω and on the inactive ?by the norm ΩSc, we take here Ω(ˆβS?− β?) and ΩSc ?(ˆβSc ?) as estimation errors. 1 Corollary 10 Take λ of the order of λm(i.e. λm= Cλ, with 0 < C < Invoke the same assumptions as in Theorem 9. Here we also use the same notation of an optimal β?with S?as in equation (3.3) and (3.4). Then we have 3(f+1)a constant). kX(ˆβ − β0)k2 n≤ kX(β?− β0)k2 ?kX(β?− β0)k2 n+ C1λ2· Γ2 n Ω(LS?,S?) ? Ω(ˆβS?− β?) + ΩSc ?(ˆβSc + C1λ · Γ2 ?) ≤ C2 Ω(LS?,S?) . λ Here C1and C2are the constants: C1:= (1 + δ)2· k?k2 C2:= 2δk?kn n(f + C + 1)2, 1 p1 − 2C(1 + 2f) − C. (3.11) 1 · (3.12) First let us explain some of the parts of Theorem 9 in more detail. We can also study what happens to the bound if we additionally assume Gaussian errors, see Proposition 11. 12

  13. Sharp Oracle Inequalities for Square Root Regularization 3.3 On the two parts of the oracle bound The oracle bound is a trade-off between two parts, which we will discuss now. Let us first remember that if we set β ≡ β0in the sharp oracle bound, only the term with the Ω−effective sparsity will not vanish on the right hand side of the bound. But due to the minimization over β in the definition of β?we might even do better than that bound. The first part consisting of minimizing kX(β −β0)k2 to approximation, hence we call it the approximation error. If we fix the support S, which can be thought of being determined by the second part, then minimizing kX(β − β0)k2 is just a projection onto the subspace spanned by S, see Figure 1. So if S has a similar structure than the true unknown support S0of β0, this will be small. ncan be thought of the error made due n Xβ0 Xβ0 S S Figure 1: approximation error The second part containing Γ2 β will affect the set S. We have already mentioned that. It is one over the squared distance between the two sets {XβS: Ω(βS) = 1} and {XβSc : ΩSc(βSc) ≤ L}. Figure 2 shows this distance. This means that if the vectors in XSand XSc show a high correlation the distance will shrink and the Ω−effective sparsity will blow up, which we try to avoid. This distance depends also on the two chosen sparsity norms Ω and ΩSc. It is crucial to choose norms that reflect the true underlying sparsity in order to get a good bound. Also the constant LSshould be small. Ω(LS,S) is due to estimation errors. There, minimizing over 3.4 On the randomness of the oracle bound Until now, the bound still contains some random parts, for example in λm. In order to get rid of that random part we need to introduce the following sets ? We need to choose the constant d in such a way, that we have a high probability for this set. In other words we try to bound the random part by a non random constant with a ?Ω∗((?TX)W) ? ? ,ΩWc∗((?TX)Wc) nk?kn , where d ∈ R, and any allowed set W. T := ≤ d max nk?kn 13

  14. Stucky and van de Geer Figure 2: The Ω-eigenvalue very high probability. In order to do this we need some assumptions on the errors. Here we assume Gaussian errors. Let us also remark that normalization occurs due to the special form of the Karush-Kuhn-Tucker conditions. Thus the square root of the residual sum of squared errors is responsible for this normalization. In fact, this normalization is the main reason why λ does not contain the unknown variance. So the square root part of the estimator makes the estimator pivotal. Now in the case of Gaussian errors, we can use the concentration inequality from Theorem 5.8 in Boucheron et al. (2013) and get the following proposition. Define first: Ω∗((?TX)W) nk?kn is normalized by k?kn. This Ω∗((?TX)W) nk?kn ΩWc∗((?TX)Wc) nk?kn max(Z1,Z2) Z1k?kn/σ Z2k?kn/σ max(V1,V2) Z1:= V1:= Z2:= Z := V2:= V := Proposition 11 Suppose that we have i.i.d. Gaussian errors ? ∼ N(0,σ2I), and that the following normalization (XTX/n)i,i = 1,∀i ∈ {1,...,p} holds true. Let B := {z ∈ Rp: Ω(z) ≤ 1} be the unit Ω−ball, and B2:= supb∈BbTb an Ω−ball and ‘2−ball comparison. Then we have for all d > EV and ∆ > 1 P(T ) ≥ 1 − 2e−(d−E V )2∆2 − 2e−n 4(1−∆2)2. 2B2/n 14

  15. Sharp Oracle Inequalities for Square Root Regularization ?(?TX)Wb ?2 Proof Let us define Σ2:= sup b∈BE and calculate it nσ ?(?TX)Wb X X ? ?i nXwibw Σ2= sup b∈BVar nσ ! n X n X X 1 = sup b∈BVar σ2 ?! w∈W i=1 ??i 1 b2 X2 = sup b∈B wiVar w σ2 n w∈W i=1 n b∈BbT = sup X2 W· bW/n = sup wi/n i=1 b∈BbT W· bW/n ≤ B2/n. (3.13) These calculations hold true as well for Wcinstead of W. Furthermore in the subsequent inequalities we can subsitute W with Wcand use Z2,V2instead of Z1,V1to get an analogous result. We have σn ∼ N(0,b2 Gaussian process. Therefore we can apply Theorem 5.8 from Boucheron et al. (2013) (?TX)Wb W/n). This is an almost surely continuous centred c2 P(V1− EV1≥ c) ≤ e− 2B2/n. (3.14) Now to get to a probability inequality for Z1we use the following calculations ?V1σ ≤ P(V1− EV1∆ > d∆) + P(k?kn≤ σ∆) ≤ P(V1− EV1> d∆) + P(k?kn≤ σ∆) ≤ e−d2∆2 ? P(Z1− EV1≥ d) ≤ P − EV1≥ d ∧ k?kn> σ∆ + P(k?kn≤ σ∆) k?kn 2B2/n+ P(k?kn≤ σ∆). (3.15) The calculations above use the union bound and that a bigger set containing another set has a bigger probability. Furthermore we have applied equations (3.13) and (3.14). Now we are left to give a bound on P(k?kn/σ ≤ ∆). For this we use the corollary to Lemma 1 from Laurent and Massart (2000) together with the fact that k?kn/σ = R =Pn P?R ≤ n − 2√nx?≤ exp(−x) P  P σ pR/n with i=1(?i/σ)2∼ χ2(n). We obtain   ? s ?k?kn r rx R n≤ ≤ exp(−x) ≤ ∆ 1 − 2 n ≤ e−n 4(1−∆2)2. (3.16) 15

  16. Stucky and van de Geer Combining equations (3.15) and (3.16) finishes the proof: P(T ) = P(max(Z1,Z2) ≤ d) = P(Z1≤ d ∩ Z2≤ d) ≥ P(Z1≤ d) + P(Z2≤ d) − 1 ≥ 1 − P(Z1≥ d) − P(Z2≥ d) ≥ 1 − 2e−(d−E V )2∆2 − 2e−n 4(1−∆2)2. 2B2/n So the probability that the event T does not occur decays exponentially. This is what we mean by having a very high probability. Therefore we can take d = t · ∆2= 1 − t2 q 2 nB2 ∆2 + E[V ] with q log?4 ?and 2e−n/2< α to ensure ∆2> 0. With this we get P(T ) ≥ 1 − α. √n, where t = α (3.17) First remark that the term?T the whole point of the square root regularization. Here B2can be thought of comparing the Ω−ball in direction W to the ‘2−ball in direction W, because if the norm Ω is the ‘2−norm, then B2= 1. Moreover, for every norm there exists a constant D such that for all β it holds σis now of the right scaling, because ?i/σ ∼ N(0,1). This is kβk2≤ DΩ(β). Therefore the B2of Ω satisfies B2≤ D2sup b∈BΩ(bW)2≤ D2. Thus we can take r d = t ·D 2 n+ E[V ] ∆ s r ?4 ? 2 n, with t = ∆2= 1 − t log . α What is left to be determined is E[V ]. In many cases we can use a adjusted version of the main theorem in Maurer and Pontil (2012) for Gaussian complexities to obtain this expectation. All the examples below can be calculated in this way. So, in the case of Gaussian errors, we have the following new version of Corollary 10. q the same assumptions as in Theorem 9 and additionally assume Gaussian errors. Use the 2 n+E[V ], where t,δ,V and D are defined as above. Invoke Corollary 12 Take λ = t/∆·D 16

  17. Sharp Oracle Inequalities for Square Root Regularization notation from Corollary 10. Then with probability 1 − α the following oracle inequalities hold true kX(ˆβ − β0)k2 n≤ kX(β?− β0)k2 ?kX(β?− β0)k2 n+ C1λ2· Γ2 n Ω(LS?,S?) ? Ω(ˆβS?− β?) + ΩSc ?(ˆβSc + C1λ · Γ2 ?) ≤ C2 Ω(LS?,S?) . λ Now we still have a k?k2 order to handle this we need Lemma 1 from Laurent and Massart (2000). Which translates in our case to the probability inequality P?k?k2 Here x > 0 is a constant. Therefore we have that k?k2 decaying probability in n. We could also write this in the following form nterm in the constants (3.11), (3.12) of the oracle inequality. In n≤ σ2?1 + 2x + 2x2??≥ 1 − exp?−n · x2?. nis of the order of σ2with exponentially ? ? ?? C −√2C − 1 P?k?k2 n≤ σ2· C?≥ 1 − exp −n . 2 Here we can choose any constant C > 1 big enough and take the bound σ2· C for k?k2 the oracle inequality. A similar bounds can be found in Laurent and Massart (2000) for 1/k?k2 errors. nin n. This takes care of the random part in the sharp oracle bound with the Gaussian 17

  18. Stucky and van de Geer (a) ‘1-norm (b) Group Lasso norm with groups {x},{y,z} (c) sorted ‘1-norm with a λ sequence 1 > 0.5 > 0.3 (d) wedge norm Figure 3: Pictorial description of how the estimatorˆβ works, with unit balls of different sparsity inducing norms. 4. Examples Here we will give some examples of estimators where our sharp oracle inequalities hold. Figure 3 shows the unit balls of some sparsity inducing norms that we will use as examples. In order to give the theoretical λ for these examples we will again assume Gaussian errors. Theorem 9 still holds for all the examples even for non Gaussian errors. Some of the examples will introduce new estimators inspired by methods similar to the square root LASSO. 18

  19. Sharp Oracle Inequalities for Square Root Regularization 4.1 Square Root LASSO First we examine the square root LASSO, ? ? ˆβsrL:= argmin kY − Xβkn+ λkβk1 . β∈Rp Here we use the ‘1−norm as a sparsity measure. We know that the ‘1−norm has the nice property to be able to set certain unimportant parameters individually to zero. As already mentioned the ‘1−norm has the following decomposability property for any set S kβk1= kβSk1+ kβSck1,∀β ∈ Rp. Therefore we also have weak decomposability for all subsets S ⊂ {1,...,p} with ΩScbeing the ‘1−norm again. Thus Assumption II is fulfilled for all sets S and so we are able to apply Theorem 9. Furthermore for the square root LASSO we have that D = 1. This is because the ‘2−norm is bounded by the ‘1−norm without any constant. So in order to get the value of λ we need to calculate the expectation of the dual norm of?TX By Maurer and Pontil (2012), we also have nσ nσ σn. The dual norm of ‘1is the ‘∞−norm. "??(?TX)Sc # "??(?TX)S? #! ??∞ ??∞ r ? ? p 2 n ? ≤ log(|p|) max E ,E 2 + . Therefore the theoretical λ for the square root LASSO can be chosen as r ? ? p 2 n log(|p|) λ = t/∆ + 2 + . Even though this theoretical λ is very close to being optimal, it is not optimal, see for example van de Geer (2016). In the special case of the ‘1−norm penalization, we can simplify Corollary 12: q ? ? t/∆ + 2 +plog(|p|) 2 n Corollary 13 (Square Root LASSO) Take λ = 0 and ∆ > 1 are chosen as in (3.17). Invoke the same assumptions as in Corollary 12. Then for Ω(·) = k·k1, we have with probability 1 − α that the following oracle inequalities hold true: , where t > kX(ˆβsrL− β0)k2 kˆβsrL− β?k1≤ C2 n≤ kX(β?− β0)k2 ?kX(β?− β0)k2 n+ C1λ2· Γ2 n Ω(LS?,S?) ? + C1λ · Γ2 Ω(LS?,S?) . λ Remark that in Corollary 13 we have an oracle inequality for the estimation error kˆβsrL− β?k1in ‘1. This is due to the decomposability of the ‘1−norm. In other examples we will have the sum of two norms. 19

  20. Stucky and van de Geer 4.2 Group Square Root LASSO In order to set groups of variables simultaneously to zero, and not only individual variables, we will look at a different sparsity inducing norm. Namely a ‘1−type norm for grouped variables, called the group LASSO norm. The group square root LASSO was introduced by Bunea et al. (2014) as   q   g X ˆβgsrL:= argmin kY − Xβkn+ λ |Gj|kβGjk2 . β∈Rp j=1 Here g is the total number of groups, and Gj is the set of variables that are in the jth group. Of course the ‘1−norm is a special case of the group LASSO norm, when Gj= {j} and g = p. The group LASSO penalty is also weakly decomposable with ΩSc= Ω, for any S = S Gj, j∈J with any J ⊂ {1,...,g}. So here the sparsity structure of the group LASSO norm induces the sets S to be of the same sparsity structure in order to fulfil Assumption II. Therefore the Theorem 9 can also be applied in this case. How do we need to choose the theoretical λ? For the group LASSO norm we have B2≤ 1. One can see this due to the fact that√a1+...+√ag≥√a1+ ... + agfor g positive constants. And also |Gj| ≥ 1 for all groups. Therefore q Remark that the dual norm is Ω∗(β) = max v u g p X X u t β2 |Gj|kβGjk2≥ i. j=1 i=1 1≤j≤gkβGjk2/p|Gj|. With Maurer and Pontil (2012) " nσ n we have " # #! r Ω∗?(?TX)Sc nσ ? Ω∗?(?TX)S? ? ? ? p 2 ? ≤ max E ,E 2 + log(g) . That is why λ can be taken of the following form r ? ? p 2 n λ = t/∆ + 2 + log(g) . And we get a similar corollary for the group square root LASSO like the Corollary 13 for the square root LASSO. In the case of the group LASSO, there are better results for the theoretical penalty level available, see for example Theorem 8.1 in B¨ uhlmann and van de Geer (2011). This takes the minimal group size into account. 4.3 Square Root SLOPE Here we introduce a new method called the square root SLOPE estimator, which is also part of the square root regularization family. Let us thus take a look at the sorted ‘1norm with some decreasing sequence λ1≥ λ2≥ ... ≥ λp> 0 , Jλ(β) := λ1|β|(1)+ ... + λp|β|(p). 20

  21. Sharp Oracle Inequalities for Square Root Regularization This was shown to be a norm by Zeng and Figueiredo (2014). Let π be a permutation of {1,...,p}. The identity permutation is denoted by id. In order to show weak decomposability for the norm Jλwe need the following lemmas. Lemma 14 (Rearrangement Inequality) Let β1 ≥ ··· ≥ βp be a decreasing sequence of non-negative numbers. The sumPp i=1λiβπ(i)is maximized over all permutations π at π = id. Proof. The result is obvious when p = 2. Suppose now that it is true for sequences of length p − 1. We then prove it for sequences of length p as follows. Let π be an arbitrary permutation with j := π(p). Then p−1 X p X λiβπ(i)= λiβπ(i)+ λpβj. i=1 i=1 By induction p−1 X j−1 X X p X p X p X p X λiβπ(i)≤ λiβi+ λi−1βi i=1 i=1 i=j+1 (λi−1− λi)βi = λiβi+ i=j+1 i6=j (λi−1− λi)βi− λjβj. = λiβi+ i=1 i=j+1 Hence we have p p p X X p X p X X p X p X λiβπ(i)≤ (λi−1− λi)βi+ (λj− λp)βj λiβi+ i=1 i=1 i=j+1 p X (λi−1− λi)βi− (λi−1− λi)βj = λiβi+ i=1 i=j+1 i=j+1 (λi−1− λi)(βi− βj). = λiβi+ i=1 i=j+1 Since λi−1≥ λifor all 1 ≤ i ≤ p (defining λ0= 0) and βi≤ βjfor all i > j we know that p X (λi−1− λi)(βi− βj) ≤ 0. i=j+1 t u Lemma 15 Let p X λi|β|(i), Ω(β) = i=1 21

  22. Stucky and van de Geer and r X ΩSc(βSc) = λp−r+l|β|(l,Sc), l=1 where r = p − s and |β|(1,Sc)≥ ··· ≥ |β|(r,Sc)is the ordered sequence in βSc. Ω(β) ≥ Ω(βS) + ΩSc(βSc). Moreover ΩScis the strongest norm among all ΩScfor which Ω(β) ≥ Ω(βS) + ΩSc(βSc) Then Proof. Without loss of generality assume β1≥ ··· ≥ βp≥ 0. We have p X Ω(βS) + ΩSc(βSc) = λiβπ(i) i=1 for a suitable permutation π. It follows that Ω(βS) + ΩSc(βSc) ≤ Ω(β). To show ΩScis the strongest norm it is clear we need only to search among candidates of the form r X where {λp−r+l} is a decreasing positive sequence and where πSc(1),...,πSc(r) is a permu- tation of indices in Sc. This is then maximized by ordering the indices in Scin decreasing order. But then it follows that the largest norm is obtained by taking λp−r+l= λp−r+lfor all l = 1,...,r. The SLOPE was introduced by Bogdan et al. (2015) in order to better control the false discovery rate, and is defined as: ΩSc(βSc) = λp−r+lβπSc(l) l=1 t u ?kY − Xβk2 n+ λJλ(β)?. ˆβSLOPE:= arg min β∈Rp Now we are able to look at the square root SLOPE, which is the estimator of the form: ˆβsrSLOPE:= arg min β∈Rp{kY − Xβkn+ λJλ(β)}. The square root SLOPE replaces the squared ‘2−norm with a ‘2−norm. With Theorem 9 we have provided a sharp oracle inequality for this new estimator, the square root SLOPE. For the SLOPE penalty we have B2≤ 1 λp, if λp> 0. This is because λp|β|(1)+ ... +λp p X ≥ kβk2. Jλ(β) λp =λ1 λp|β|(p) ≥ |βi| = kβk1 i=1 22

  23. Sharp Oracle Inequalities for Square Root Regularization So the bound gets scaled by the smallest λ. The dual norm of the SLOPE is by Lemma 1 of Zeng and Figueiredo (2015)   Here β(k):= (β(1),...,β(k))Tis the vector which contains the k largest elements of β. Again by Maurer and Pontil (2012) we have nσ nσ Here we denote by R2:=P r n  ? ?−1   k X J∗ · kβ(k)k1 λ(β) = max λj , k=1,...,p j=1 " # " #! ! r ?(?TX)S? ? ?(?TX)Sc ? 2√2 + 1 √2 JSc λ ?∗ p J∗ 2 n λ log(|R2|) ? ≤ max E , + . 1 i. Therefore we can choose λ as λ2 i ! λp∆+2√2 + 1 p 2 t log(|R2|) √2 λ = + . Let us remark that the asymptotic minimaxity of SLOPE can be found in Su and Cand` es (2016). 4.4 Sparse Group Square Root LASSO The sparse group square root LASSO can be defined similarly to the sparse group LASSO, see Simon et al. (2013). This new method is defined as: ( ) T X p ˆβsrSGLASSO:= argmin kY − Xβkn+ λkβk1+ η kβItk2 |Gt| , β∈Rp t=1 TS where we have a partition as follows, Gt⊂ {1,...,p} ∀t ∈ 1,...,T , Gi∩ Gj= ∅ ∀i 6= j. This penalty is again a norm and it not only chooses sparse groups by the group LASSO penalty, but also sparsity inside of the groups with the ‘1−norm. Define R(β) := λkβk1+ ηPT R(βS) + RSc(βSc) ≤ R(β). This is due to the weak decomposability property of the ‘1−norm and kβSk2= rP sum two norms, it is again a norm. Then the dual of this added norm is, because of the supremum taken over the unit ball, smaller than dual norm of each one of the two norms individually. So we can invoke the same theoretical λ as with the square root LASSO r n Gt= {1,...,p} and t=1 p|Gt| and RSc(β) := λkβk1. Then we have weak t=1kβItk2 decomposability for any set S rP β2 j≤ j∈S P β2 j∈Scβ2 j= kβk2. Now in order to get the theoretical λ let us note that if we j+ j∈S ? ? p 2 log(|p|) λ = t/∆ + 2 + . 23

  24. Stucky and van de Geer And also the theoretical η like the group square root LASSO r ? ? p 2 n η = t/∆ + 2 + log(g) . But of course we will not get the same Corollary, because the Ω−effective sparsity will be different. 4.5 Structured Sparsity Here we will look at the very general concept of structured sparsity norms. Let A ⊂ [0,∞)p be a convex cone such that A ∩ (0,∞)p6= ∅. Then ! p β2 aj X 1 2 j Ω(β) = Ω(β;A) := min + aj , a∈A j=1 is a norm by Micchelli et al. (2010). Some special cases are for example the ‘1−norm or the wedge or box norm. Define AS:= {aS: a ∈ A}. Then van de Geer (2014) also showed that for any AS⊂ A we have that the set S is allowed and we have weak decomposability for the norm Ω(β) with ΩSc(βSc) := Ω(βSc,ASc). Hence the estimator  !   p β2 aj X 1 2 j ˆβs= argmin kY − Xβkn+ λmin + aj , a∈A β∈Rp j=1 has also the sharp oracle inequality. The dual norm is given by v u p X v u t Ω∗(ω;A) = max ω ∈ Rp, ajω2 j, a∈A(1) j=1 u p X u ΩSc∗(ω;ASc) = t j, ω ∈ Rp. ajω2 max a∈ASc(1) j=1 Here ASc(1) := {a ∈ ASc : kak1= 1} and A(1) := {a ∈ A : kak1= 1}. Then once again by Maurer and Pontil (2012) we have nσ nσ " # " #! r Ω∗?(?TX)S?;AS? ? Ω∗?(?TX)Sc ? ? ? p ?;ASc 2 n e n ≤ AS? log(|E(A)|) o . That is why λ can ? max E ,E 2 + . a Here E(A) are the extreme points of the closure of the set definitione r n kak1: a ∈ A ?;ASc . With the qPn ? ?pPn ? AS?:= max i=1Ω(Xi,S?;AS?), i=1Ω(Xi,Sc ?) be taken of the following form ? ?? p 2 tD/∆ +e AS? log(|E(A)|) λ = 2 + . 24

  25. Sharp Oracle Inequalities for Square Root Regularization Since we do not know S?we can either upper bounde square root LASSO. And we get similar corollaries for the structured sparsity norms like the Corollary 13 for the square root LASSO. AS?for a given norm, or use the fact that Ω(β) ≥ kβk1and Ω∗(β) ≤ kβk∞for all β ∈ Rp. Therefore use the same λ as for the 5. Simulation: Comparison between srLASSO and srSLOPE The goal of this simulation is to see how the estimation and prediction errors for the square root LASSO and the square root SLOPE behave under some Gaussian designs. We propose Algorithm 1 to solve the square root SLOPE: Algorithm 1: srSLOPE input : β0 λ Y X output:ˆβsrSLOPE= argmin a starting parameter vector, a desired penalty level with a decreasing sequence, the response vector, the design matrix. (kY − Xβkn+ λJλ(β)) β∈Rp 1 for i ← 0 to istopdo 2 σi+1← kY − Xβikn; 3 βi+1← argmin 4 end ?kY − Xβk2 n+ σi+1λJλ(β)?; β∈Rp Note that in Algorithm 1 Line 3 we need to solve the usual SLOPE. To solve the SLOPE we have used the algorithm provided in Bogdan et al. (2015). For the square root LASSO we have used the R-Package flare by Li et al. (2014). We consider a high-dimensional linear regression model: Y = Xβ0+ ?, with n = 100 response variables and p = 500 unknown parameters. The design matrix X is chosen with the rows being fixed i.i.d. realizations from N(0,Σ). Here the covariance matrix Σ has a Toeplitz structure Σi,j= 0.9|i−j|. We choose i.i.d. Gaussian errors ? with a variance of σ2= 1. For the underlying unknown parameter vector β0we choose different settings. For each such setting we calculate the square root LASSO and the square root SLOPE with the theoretical λ given in this paper and the λ from a 8-fold Cross-validation on the mean squared prediction error. We use r = 100 repetitions to calculate the ‘1−estimation error, the sorted ‘1−estimation error and the ‘2−prediction error. regular decreasing sequence from 1 to 0.1 with length 500. The results can be found in Table 1,2,3 and 4. As for the definition of the sorted ‘1−norm, we chose a 25

  26. Stucky and van de Geer Decreasing Case: Here the active set is chosen as S0= {1,2,3,...,7}, and β0 is a decreasing sequence. S0= (4, 3.6, 3.3, 3, 2.6, 2.3, 2)T Table 1: Decreasing β theoretical λ Jλ(β0−ˆβ) 0.21 0.19 Cross-validated λ Jλ(β0−ˆβ) 0.26 0.19 kβ0−ˆβk‘1 2.06 1.85 kX(β0−ˆβ)k‘2 4.12 5.51 kβ0−ˆβk‘1 2.37 1.78 kX(β0−ˆβ)k‘2 3.88 5.05 srSLOPE srLASSO Decreasing Random Case: The active set was randomly chosen to be S0= {154,129,276,29,233,240,402} and again β0 S0= (4, 3.6, 3.3, 3, 2.6, 2.3, 2)T. Table 2: Decreasing Random β theoretical λ Jλ(β0−ˆβ) 0.49 7.74 0.89 29.47 Cross-validated λ Jλ(β0−ˆβ) 1.09 0.85 kβ0−ˆβk‘1 4.50 8.48 kX(β0−ˆβ)k‘2 kβ0−ˆβk‘1 7.87 7.81 kX(β0−ˆβ)k‘2 7.68 9.19 srSLOPE srLASSO Grouped Case: Now in order to see if the square root SLOPE can catch grouped variables better than the square root LASSO we look at an active set S0= {1,2,3,...,7} together with β0 (4,4,4,3,3,2,2)T. S0= Table 3: Grouped β theoretical λ Jλ(β0−ˆβ) 0.29 0.31 Cross-validated λ Jλ(β0−ˆβ) 0.18 0.19 kβ0−ˆβk‘1 2.81 3.02 kX(β0−ˆβ)k‘2 6.43 8.37 kβ0−ˆβk‘1 1.71 1.83 kX(β0−ˆβ)k‘2 3.65 4.25 srSLOPE srLASSO Grouped Random Case: Again we take the same randomly chosen set S0 = {154,129,276,29,233,240,402} with β0 S0= (4,4,4,3,3,2,2)T. Table 4: Grouped Random β theoretical λ Jλ(β0−ˆβ) 0.66 12.84 1.77 66.68 Cross-validated λ Jλ(β0−ˆβ) 0.66 0.67 kβ0−ˆβk‘1 6.05 16.90 kX(β0−ˆβ)k‘2 kβ0−ˆβk‘1 5.80 6.14 kX(β0−ˆβ)k‘2 5.78 6.67 srSLOPE srLASSO The random cases usually lead to larger errors for both estimators. This is due to the correlation structure of the design matrix. The square root SLOPE seems to outperform 26

  27. Sharp Oracle Inequalities for Square Root Regularization the square root LASSO in the cases where β0is somewhat grouped (grouped in the sense that amplitudes of same magnitude appear). This is due to the structure of the sorted ‘1−norm, which has some of the sparsity properties of ‘1as well as some of the grouping properties of ‘∞, see Zeng and Figueiredo (2014). Therefore the square root SLOPE reflects the underlying sparsity structure in the grouped cases. What is also remarkable is that the square root SLOPE always has a better mean squared prediction error than the square root LASSO. This is even in cases, where square root LASSO has better estimation errors. The estimation errors seem to be better for the square root LASSO in the decreasing cases. 6. Discussion Sparsity inducing norms different from ‘1 may be used to facilitate the interpretation of the results. Depending on the sparsity structure we have provided sharp oracle inequali- ties for square root regularization. Due to the square root regularizing we do not need to estimate the variance, the estimators are all pivotal. Moreover, because the penalty is a norm the optimization problems are all convex, which is a practical advantage when imple- menting the estimation procedures. For these sharp oracle inequalities we only needed the weak decomposability and not the decomposability property of the ‘1−norm. The weak decomposability generalizes the desired property of promoting an estimated parameter vec- tor with a sparse structure. The structure of the Ω− and ΩSc−norms influence the oracle bound. Therefore it is useful to use norms that reflect the true underlying sparsity structure. Acknowledgments We would like to acknowledge financial support from the SNSF (Swiss National Science Foundation) with Grant number 200021 149145. We thank the editor and the referees for their very helpful suggestions and improvements. 27

  28. Stucky and van de Geer References F.R. Bach, R. Jenatton, J. Mairal, and G. Obozinski. Optimization with sparsity-inducing penalties. Foundations and Trends R ? in Machine Learning, 4(1):1–106, 2012. H. H. Bauschke and P. L. Combettes. Convex analysis and monotone operator theory in Hilbert spaces. CMS Books in Mathematics/Ouvrages de Math´ ematiques de la SMC. Springer, New York, 2011. With a foreword by H´ edy Attouch. A. Belloni, V. Chernozhukov, and L. Wang. Square-root lasso: pivotal recovery of sparse signals via conic programming. Biometrika, 98(4):791–806, 2011. A. Belloni, V. Chernozhukov, and L. Wang. Pivotal estimation via square-root Lasso in nonparametric regression. The Annals of Statistics, 42(2):757–788, 2014. P. J. Bickel, Y. Ritov, and A. B. Tsybakov. Simultaneous analysis of lasso and Dantzig selector. The Annals of Statistics, 37(4):1705–1732, 2009. M. Bogdan, E. van den Berg, C. Sabatti, W. Su, and E.J. Cand` es. SLOPE—adaptive variable selection via convex optimization. The Annals of Applied Statistics, 9(3):1103– 1140, 2015. URL http://statweb.stanford.edu/~candes/SortedL1/. S. Boucheron, M. Ledoux, G. Lugosi, and P. Massart. nonasymptotic theory of independence. Oxford university press, 2013. ISBN 978-0-19- 953525-5. Concentration inequalities : a F. Bunea, A. B. Tsybakov, and M. H. Wegkamp. Aggregation and sparsity via l1penalized least squares. In Learning theory, volume 4005 of Lecture Notes in Comput. Sci., pages 379–391. Springer, Berlin, 2006. F. Bunea, A. B. Tsybakov, and M. H. Wegkamp. Aggregation for Gaussian regression. The Annals of Statistics, 35(4):1674–1697, 2007. F. Bunea, J. Lederer, and Y. She. The group square-root lasso: Theoretical properties and fast algorithms. IEEE Transactions on Information Theory, 60(2):1313–1325, 2014. P. B¨ uhlmann and S. van de Geer. Statistics for High-Dimensional Data: Methods, Theory and Applications. Springer Publishing Company, Incorporated, 1st edition, 2011. ISBN 3642201911, 9783642201912. O. Klopp. Noisy low-rank matrix completion with general sampling distribution. Bernoulli, 20(1):282–303, 2014. V. Koltchinskii. problems, volume 2033 of Lecture Notes in Mathematics. Springer, Heidelberg, 2011. Lectures from the 38th Probability Summer School held in Saint-Flour, 2008,´Ecole d’´Et´ e de Probabilit´ es de Saint-Flour. [Saint-Flour Probability Summer School]. Oracle inequalities in empirical risk minimization and sparse recovery V. Koltchinskii, K. Lounici, and A. B. Tsybakov. Nuclear-norm penalization and optimal rates for noisy low-rank matrix completion. Ann. Statist., 39(5):2302–2329, 2011. 28

  29. Sharp Oracle Inequalities for Square Root Regularization B. Laurent and P. Massart. Adaptive estimation of a quadratic functional by model selec- tion. The Annals of Statistics, 28(5):pp. 1302–1338, 2000. X. Li, T. Zhao, L. Wang, X. Yuan, and H. Liu. flare: Family of Lasso Regression, 2014. URL https://CRAN.R-project.org/package=flare. R package version 1.5.0. A. Maurer and M. Pontil. Structured sparsity and generalization. Journal of Machine Learning Research, 13(1):671–690, 2012. C.A. Micchelli, J. Morales, and M. Pontil. A family of penalty functions for structured sparsity. In J.D. Lafferty, C.K.I. Williams, J. Shawe-Taylor, R.S. Zemel, and A. Culotta, editors, Advances in Neural Information Processing Systems 23, pages 1612–1623. Curran Associates, Inc., 2010. C.A. Micchelli, J. Morales, and M. Pontil. Regularizers for structured sparsity. Advances in Computational Mathematics, 38(3):455–489, 2013. A. B. Owen. A robust hybrid of lasso and ridge regression. In Prediction and discovery, volume 443 of Contemp. Math., pages 59–71. Amer. Math. Soc., Providence, RI, 2007. N. Simon, J. Friedman, T. Hastie, and R. Tibshirani. A sparse-group lasso. Journal of Computational and Graphical Statistics, 2013. W. Su and E. Cand` es. SLOPE is adaptive to unknown sparsity and asymptotically minimax. Ann. Statist., 44(3):1038–1068, 2016. T. Sun and C. Zhang. Scaled sparse linear regression. Biometrika, 99(4):879–898, 2012. R. Tibshirani. Statistical Society. Series B. Methodological, 58(1):267–288, 1996. Regression shrinkage and selection via the lasso. Journal of the Royal S. van de Geer. The deterministic lasso. JSM proceedings, 2007. S. van de Geer. Weakly decomposable regularization penalties and structured sparsity. Scand. J. Stat., 41(1):72–86, 2014. S. van de Geer. Estimation and Testing under Sparsity.´Ecole d’´Ete de Saint-Flour XLV. Springer (to appear), 2016. S. van de Geer and J. Lederer. The Lasso, correlated design, and improved oracle inequal- ities. 9:303–316, 2013. X. Zeng and M. A. T. Figueiredo. Decreasing weighted sorted l1 regularization. IEEE Signal Processing Letters, 21(10):1240–1244, June 2014. X. Zeng and M. A. T. Figueiredo. The ordered weighted l1 norm: Atomic formulation and conditional gradient algorithm. In Workshop on Signal Processing with Adaptive Sparse Structured Representations - SPARS, July 2015. 29