314 Handbook of Big Data
Clearly, the different p-values corresponding to the same predictor are not independent.
Nevertheless, we can aggregate them using an (arbitrary) prespecified γ -quantile, 0 < γ < 1,
leading to
Q
j
(γ)=min
emp. γ-quantile{P
[b]
corr,j
/γ; b =1,...,B}, 1
(17.6)
the so-called quantile aggregated p-values, see [21] for details. The price that we have
to pay for using a (potentially small) quantile is the factor 1/γ. For example, if we
choose the median, we have to multiply all p-values by the factor of 2. This is called the
multisample splitting algorithm [21]. It is loosely related to stability selection. For example,
for γ =0.5, we require a predictor to be selected in at least 50% of the sample splits with
a small enough p-value. Moreover, quantile aggregation as defined in Equation 17.6 is a
general (conservative) p-value aggregation procedure that works under arbitrary dependency
structures.
Aprioriit is not clear how to select the parameter γ. We can even search for the
best γ-quantile in a range (γ
min
, 1), for example, γ
min
=0.05, leading to the aggregated
p-value,
P
j
=min
(1 − log(γ
min
)) inf
γ∈(γ
min
,1)
Q
j
(γ), 1
j =1,...,p
The price for this additional search is the factor 1 −log(γ
min
). Under suitable assumptions,
the p-values P
j
are controlling the familywise error rate [21]. The smaller we choose γ
min
,
the more susceptible we are again to a specific realization of the B sample splits. Therefore,
we should choose a large value of B in situations where γ is small.
Multisample Splitting in R
We use the function multi.split in the R-package hdi.
> fit.multi <- multi.split(x, y)
We can use any model for which there is a model selection function (defined in argument
model.selector)andaclassical p-value function (argument classical.fit). The
default uses lasso (with cross-validation) and a linear model fit. Extensions to GLMs
and many more models are (from a technical point of view) straightforward.
The p-values are stored in pval.corr. Note that by construction the multisample
splitting algorithm (only) provides p-values for familywise error control.
Confidence intervals can also be obtained through the function confint; however,
the level has already to be set in the call of the function multi.split (argument
ci.level).
17.5 Hierarchical Approaches
In most applications we are faced with (strongly) correlated design matrices. Already in
the low-dimensional case, two strongly correlated predictors might have large (individual)
p-values, while the joint null hypothesis can be clearly rejected. Moreover, too strong
correlation in the design matrix might also violate assumptions of the previously discussed