Introduction

Group SLOPE is a penalized linear regression method that is used for adaptive selection of groups of significant predictors in a high-dimensional linear model. It was introduced in Brzyski et. al. (2015) Group SLOPE — adaptive selection of groups of predictors and Gossmann et. al. (2015) Identification of Significant Genetic Variants via SLOPE, and Its Extension to Group SLOPE. A unique property of the Group SLOPE method is that it offers (group) false discovery rate control (i.e., control of the expected proportion of irrelevant groups among the total number of groups selected by the Group SLOPE method).

Data generation

We simulate a SNP-data-like model matrix.

Note: In fact, the Group SLOPE method is designed to work with a data matrix, where the columns corresponding to different groups of predictors are nearly uncorrelated. Only for the brevity of exposition we do not check for or enforce low between group correlations in this example.

set.seed(1)

p     <- 500
probs <- runif(p, 0.1, 0.5)
probs <- t(probs) %x% matrix(1,p,2)
X0    <- matrix(rbinom(2*p*p, 1, probs), p, 2*p)
X     <- X0 %*% (diag(p) %x% matrix(1,2,1))

Upper left \(10 \times 10\) corner of \(X\):

pander::pandoc.table(X[1:10, 1:10])
0 1 1 1 1 2 2 0 2 1
0 2 0 1 0 2 0 1 1 1
0 1 1 0 0 2 2 1 0 0
1 0 0 1 1 1 1 0 0 0
0 0 2 1 0 1 0 2 1 0
1 0 1 1 0 0 2 1 2 0
0 0 1 1 1 1 1 1 1 2
1 0 0 1 0 0 1 0 2 1
1 0 1 1 0 1 2 0 0 0
0 0 1 1 1 2 1 0 0 0

We divide the predictors into 100 groups of sizes ranging from 3 to 7.

group <- c(rep(1:20, each=3),
           rep(21:40, each=4),
           rep(41:60, each=5),
           rep(61:80, each=6),
           rep(81:100, each=7))
group.id <- grpSLOPE::getGroupID(group)
n.group <- length(group.id)
group.length <- sapply(group.id, FUN=length)

We randomly select 10 groups to be truly significant.

ind.relevant <- sample(1:n.group, 10)
print(sort(ind.relevant))
##  [1]  4 19 24 26 28 42 50 55 67 74

Then we generate the regression coefficient vector and the response vector according to a linear model.

b <- rep(0, p)
for (j in ind.relevant) {
  # generate effect sizes from the Uniform(0,1) distribution
  b[group.id[[j]]] <- runif(group.length[j])
}

# generate the response vector
y <- X %*% b + rnorm(p)

Fitting the Group SLOPE model

We fit the Group SLOPE model to the simulated data. The function argument fdr signifies the target (group-wise) false discovery rate of the variable selection procedure.

library(grpSLOPE)

result <- grpSLOPE::grpSLOPE(X=X, y=y, group=group, fdr=0.1)

Model fit results

We can display which groups were selected as significant by the Group SLOPE method.

result$selected
##  [1] "4"  "10" "19" "24" "26" "28" "42" "50" "55" "67" "74"

Similarly, we can look at the estimates of the noise level and the regression coefficients.

# estimated sigma (true sigma is equal to one)
result$sigma
## [1] 0.9751941
# first 14 entries of b estimate:
result$beta[1:14]
##  [1] 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
##  [8] 0.000000 0.000000 7.374246 2.030662 1.071715 0.000000 0.000000

It might also be interesting to plot the first few elements of the regularizing sequence \(\lambda\) used by the Group SLOPE method for the given inputs.

plot(result$lambda[1:10], xlab = "Index", ylab = "Lambda", type="l")

We check the performance of the method by computing the resulting (group) false discovery proportion (gFDP) and power.

n.selected    <- length(result$selected)
true.relevant <- names(group.id)[ind.relevant]
truepos       <- intersect(result$selected, true.relevant)

n.truepos  <- length(truepos)
n.falsepos <- n.selected - n.truepos

gFDP <- n.falsepos / max(1, n.selected)
pow <- n.truepos / length(true.relevant)

print(paste("gFDP =", gFDP))

[1] “gFDP = 0.0909090909090909”

print(paste("Power =", pow))

[1] “Power = 1”

We see that the method indeed did not exceed the target gFDR, while maintaining a high power.

Lambda sequence

Multiple ways to select the regularizing sequence \(\lambda\) are available.

If a group structure with little correlation between groups can be assumed (i.e., groups in the standardized model matrix are nearly orthogonal), then we suggest to use the sequence chiMean, which is the default.

When groups of predictors cannot be assumed to be nearly orthogonal then the Monte Carlo based sequences chiMC and gaussianMC might be advantageous. However this will increase the computation time significantly.

The sequences chiOrthoMean and chiOrthoMax can be used together with the options orthogonalize=FALSE and normalize=FALSE, when the columns of the model matrix are exactly orthogonal to each other.

Alternatively, any non-increasing sequence of appropriate length can be used. However, we do not suggest to use any other sequences unless you are an expert on the (Group) SLOPE method.