The words you are searching are inside this book. To get more targeted content, please make full-text search by clicking here.

Home Explore 008_BD_[Peter_Bühlmann,_Petros_Drineas,_Michael_Kane,_Ma_470

View in Fullscreen

008_BD_[Peter_Bühlmann,_Petros_Drineas,_Michael_Kane,_Ma_470

Like this book? You can publish your book online for free in a few minutes!

Download PDF

Related Publications

Discover the best professional documents and content resources in AnyFlip Document Base.

Published by soedito, 2019-01-27 06:53:48

008_BD_[Peter_Bühlmann,_Petros_Drineas,_Michael_Kane,_Ma_470

Pages:

008_BD_[Peter_Bühlmann,_Petros_Drineas,_Michael_Kane,_Ma_470

434 Handbook of Big Data

where Z is the sum of the first four components of W . Finally, Y is drawn from a Bernoulli
distribution with success probability

1 + exp(1 + 1 − 0.3A) .
0.5Z

The first four components of W are related to both A and Y so they are confounders. In

general, failing to adjust for confounding variables will result in a biased estimate of the

ATE. The other components of W are not counfounders and just add noise to the data. The
true parameter ψ0 is approximately 0.0602. A naive estimate of ψ0 that fails to adjust for
W is approximately −0.026. In this simulation, confounding is strong enough that a naive
estimate of ψ0 that does not adjust for W will be quite biased.

23.5.2 SGD for Logistic Regression

We estimate both G0 and Q¯0 with main term logistic regression with parameters computed
with minibatch gradient descent.

We estimate G0 with a logistic regression model with main terms for each component of
W , and Q¯0 with a logistic regression model with main terms for A and each component of
W . To investigate the double robustness of the estimator, we misspecify models for G0 and
Q¯0 by leaving out W . The parameters for both Q¯0 and G0 are computed with minibatch
gradient descent.

For concreteness, we describe the SGD algorithm for logistic regression to estimate
Q¯0. Estimating G0 is analogous. Let Xi be a d-dimensional vector of predictor variables
including a constant, Ai, and when the correctly specified, Wi. Let θ be a vector of the
same dimension of Xi. For a particular θ, Q¯(Ai, Wi) is computed as

1 + 1 Xi) .
exp(−θ

Initializing θ0 to a vector of 0s, θj at minibatch j is computed as

1 nj 1
θj = θj−1 − ηj m − Yi − exp(−θj−1Xi) Xi
1 +
i=nj−1 +1

where ηj is the step size.
We use minibatches of size 100 and step size

ηj = 0.05
1 + j0.05α

for minibatch j where α is 0.01 for Q¯0 and 0.1 for G0.

23.5.3 Results

We constructed datasets of size n = 106. Using double precision floating point numbers for
storage, a single dataset is more than 15 GB uncompressed, so computing a batch estimate
of ψ0 on a typical personal computer is non-trivial. For each simulated dataset, the OLOS
estimator is run using either correctly or incorrectly specified initial estimators of Q¯0 and
G0. The current estimate ψnj is recorded at each minibatch j.

Online Estimation of the Average Treatment Effect 435

To investigate the performance of the estimator as a function of sample size, we

compute the observed bias and variance at each minibatch across 1000 simulated datasets.

Figure 23.1 shows the bias as a function of sample size scaled by the square root of the

accumulated sample size. √

For an estimator to be asymptotically linear, the bias should be oP (1/ n). This means

that we want to see that the scaled bias converge to zero as n increases. We see that in all

cases, bias is quite high when sample size is small. This is because estimates of Q¯0 and G0
are far from the truth for the first few hundred minibatches. When only one of Q¯0 or G0

is correctly specified, bias moves toward zero slowly, because we are relying on the double
robustness of the OLOS. When both Q¯0 and G0 are estimated by a correctly specified

model, the scaled bias approaches 0 reasonably quickly.

Bias is high in early minibatches and it takes many steps to recover. To avoid some of

the initial bias, it is sometimes helpful to discard estimates of ψ0 in early minibatches and
compute the final estimate of ψ0 as

1K
ψnj−1 :nj
K − K0
j =K0 +1

where K0 is the number of discarded minibatches. This allows estimates of Q¯0 and G0 to
get warmed up with the first K0 minibatches, and in experiments we see this can reduce
bias substantially. (Results not shown.)

Figure 23.2 plots the variance of the OLOS estimates scaled by accumulated sample

size. The red line denotes the variance of the efficient influence curve at P0, which is
approximately 0.95 in this simulation. If an estimator is efficient, its variance scaled by the
sample size will approach the variance of the efficient influence curve as n → ∞. As expected,
we see that when both Q¯0 and G0 are consistently estimated, the OLOS estimator is efficient.
When only one of Q¯0 or G0 is consistently estimated, the OLOS estimator is not efficient
in general, but in this simulation we see that the variance is close to the efficiency bound.

0

−2

−4

√n × bias

−6 Model
−Q correct, G correct
−8
−Q misspec, G correct
−10
0 −Q correct, G misspec

FIGURE 23.1√ 5.0 × 105 1.0 × 106
Bias scaled by n by sample size.

436 Handbook of Big Data

1.2

1.1

n × variance 1.0

0.9

0.8 Model
−Q correct, G correct
0.7
0 −Q misspec, G correct

−Q correct, G misspec

5.0 × 105 1.0 × 106

FIGURE 23.2
Variance scaled by n by sample size with a horizontal line at the optimal asymptotic
variance.

23.6 Discussion

The OLOS estimator gives us a way to estimate the ATE in a single pass through a dataset,
getting running estimates along the way. Although bias can be high when a relatively small
amount of data is processed, asymptotically we can do as well as we would with a batch
estimator that typically requires many passes through the dataset.

In the example in Section 23.5, we used main term logistic regression to estimate both
Q¯0 and G0. This can easily be extended to any user-specified basis function of W . However,
generalized linear models will not include the true Q¯0 and G0 in general, and more flexible
online estimators are needed.

SGD-based optimization is a natural way to turn batch algorithms into online algorithms
and is applicable to a wide range of estimation methods. One such class of estimators
is multilayer neural networks, which are incredibly flexible. Typically, because of the
computational complexity of neural networks, they are fit with SGD in the batch setting
with multiple passes through the dataset. In principle, one could fit a neural network as an
online estimator, though a single pass through even very large datasets may not be enough
for the neural network to reach the asymptotic regime.

There are other examples in the literature of estimators that have been extended to
the online setting that are not based on SGD. Examples include bagging and boosting [9],
random forests [1,13], and generalized additive models [19].

It is not clear when one estimator should be chosen over another. There are also typically
some choices of tuning parameters, for example level of regularization. For online methods,
fitting procedures may have their own set of choices. For example, there are a number of
variants of SGD for optimization, each of which has some tuning parameters. In the batch

Online Estimation of the Average Treatment Effect 437

setting, cross-validation can be used to select between many estimators, or a model stacking
approach such as the super learner algorithm [17] can be used to choose a combination of
estimators. A similar approach can be taken in an online setting by using each new minibatch
as an independent validation set to estimate out-of-sample performance before updating an
estimator with that minibatch. In this way, many estimators with different choices of tuning
parameters can be fit concurrently, and the best or a combination can be selected based on
the estimated out-of-sample performance.

References

1. Hanady Abdulsalam, David B. Skillicorn, and Patrick Martin. Streaming random
forests. In Database Engineering and Applications Symposium, IDEAS 2007. 11th
International, pp. 225–232. IEEE, Alberta, Canada, 2007.

2. Peter J. Bickel, Chris A.J. Klaassen, Ya’acov Ritov, and Jon A. Wellner. Efficient
and Adaptive Estimation for Semiparametric Models. Johns Hopkins University Press,
Baltimore, MD, 1993.

3. Antoine Bordes, Léon Bottou, and Patrick Gallinari. SGD-QN: Careful quasi-Newton
stochastic gradient descent. The Journal of Machine Learning Research, 10:1737–1754,
2009.

4. Léon Bottou. Large-scale machine learning with stochastic gradient descent. In
Proceedings of COMPSTAT, pp. 177–186. Springer, Berlin, Germany, 2010.

5. Léon Bottou. Stochastic gradient descent tricks. In Grégoire Montavon; Geneviève B.
Orr; Klaus-Robert Mu¨ller, editors, Neural Networks: Tricks of the Trade, pp. 421–436.
Springer, Berlin, Germany, 2012.

6. Léon Bottou and Yann L. Cun. On-line learning for very large data sets. Applied
Stochastic Models in Business and Industry, 21(2):137–151, 2005.

7. John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online
learning and stochastic optimization. The Journal of Machine Learning Research,
12:2121–2159, 2011.

8. Noboru Murata. A Statistical Study of On-Line Learning: Online Learning and Neural
Networks. Cambridge University Press, Cambridge, 1998.

9. Nikunj C. Oza. Online bagging and boosting. In IEEE International Conference on
Systems, Man and Cybernetics, volume 3, pp. 2340–2345. IEEE, 2005.

10. Boris T. Polyak and Anatoli B. Juditsky. Acceleration of stochastic approximation by
averaging. SIAM Journal on Control and Optimization, 30(4):838–855, 1992.

11. James M. Robins, Andrea Rotnitzky, and Lue P. Zhao. Estimation of regression
coefficients when some regressors are not always observed. Journal of the American
Statistical Association, 89(427):846–866, 1994.

12. Paul R. Rosenbaum and Donald B. Rubin. The central role of the propensity score in
observational studies for causal effects. Biometrika, 70(1):41–55, 1983.

438 Handbook of Big Data

13. Amir Saffari, Christian Leistner, Jakob Santner, Martin Godec, and Horst Bischof.
On-line random forests. In IEEE 12th International Conference on Computer Vision
Workshops, pp. 1393–1400. IEEE, Kyoto, Japan, 2009.

14. Janet M. Box-Steffensmeier, Henry E. Brady, and David Collier. The Oxford Handbook
of Political Methodology, Oxford University Press, Oxford, New York, 2008.

15. Mark J. van der Laan. Targeted maximum likelihood based causal inference: Part I.
The International Journal of Biostatistics, 6(2), 2010.

16. Mark J. van der Laan and Samuel D. Lendle. Online Targeted Learning. Working Paper
330, U.C. Berkeley Division of Biostatistics, University of California, Berkeley, CA,
September 2014.

17. Mark J. van der Laan, Eric C. Polley, and Alan E. Hubbard. Super learner. Statistical
Applications in Genetics and Molecular Biology, 6(1):1–23, 2007.

18. Mark J. van der Laan and Sherri Rose. Targeted Learning: Causal Inference for
Observational and Experimental Data. Springer-Verlag, New York, 2011.

19. Simon N. Wood, Yannig Goude, and Simon Shaw. Generalized additive models for
large data sets. Journal of the Royal Statistical Society: Series C (Applied Statistics),
139–155, 2014.

20. Wei Xu. Towards optimal one pass large scale learning with averaged stochastic gradient
descent. CoRR, abs/1107.2490, 2011.

21. Matthew D. Zeiler. Adadelta: An adaptive learning rate method. arXiv preprint
arXiv:1212.5701, 2012.

24

Mining with Inference: Data-Adaptive Target
Parameters

Alan Hubbard and Mark van der Laan

CONTENTS 439
439
24.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 441
24.1.1 Motivating Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443

24.2 General Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443
24.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444
447
24.3.1 Inference for Sample-Split Cluster-Specific Target Parameters, 450
Where the Data are Used to Define Clusters . . . . . . . . . . . . . . . . . . . . . . . . . 451
451
24.4 Data Analysis: ACIT Variable Importance Trauma Study . . . . . . . . . . . . . . . . . . . .
24.4.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

24.5 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

24.1 Introduction

There has been an explosion of empirical approaches to scientific questions driven by the
growing systematic collection of data. Though there has been great progress in developing
quantitative statistical methods well suited to exploratory, data-driven analysis, there still
remains much to be done for deriving estimators and robust inference of relevant parameters
in such a context. The growth of so-called big data has inspired pattern-finding procedures,
for instance, its use in evidenced-based methods for optimizing health care [1].

However, less emphasis has been given to formally defining the parameters such
procedures discover. Thus, an obvious first step necessary for driving theoretical results
is to explicitly define such data-adaptive parameters. The goal of this chapter is to address
this gap, by providing formal definitions of data-adaptive parameters, as well as general
methodology for estimation and statistical inference (see also [2]).

24.1.1 Motivating Example

Though the methodology described in this chapter is quite general, we illustrate the method
by a particularly challenging causal inference/variable importance estimation problem.
Consider data from the Activation of Coagulation and Inflammation in Trauma (ACIT)
study [3,4]. The ACIT study is designed to identify clinically important mechanisms by
which the inflammatory and coagulation pathways are altered following major trauma.

439

440 Handbook of Big Data

Detailed and somewhat messy data were collected from as near the original injury as
possible. These data are used to characterize coagulation and inflammation after injury,
and to identify variables important for impacting outcomes, like death. Beginning in the
emergency department (generally, soon after injury), blood was prospectively sampled
from severely injured trauma patients. In addition, comprehensive demographic and injury
variables were also collected. The result is a wide dataset, with detailed, though irregularly
collected longitudinal measurements of many patient-level factors that could contribute to
clinical outcomes, such as multiple organ failure and death. In our case, we examine early
factors and their impact on death within the first 24 hours after injury.

This chapter is about using the data to both define a target parameter, and yet estimate
and draw inferences about a well-defined parameter. As a concrete example, consider that
we have a large number of predictors, R (e.g., injury severity, measures of platelets, presence
of alcohol in blood, etc.) all measured from time of injury, an outcome death between 6 and
24 hours: Y = 1 for death, Y = 0 for alive at 24 hours. We want to know which of these
variables is most important for explaining (predicting) Y among those alive at 6 hours. Thus,
this is a variable importance measure (VIM) estimation problem. As opposed to deriving
VIMs as a byproduct of some parametric model or machine learning procedure [5].

We [6] have advocated for targeted estimation of parameters motivated by causal
inference, for example, see [7]. We have previously applied such techniques to look at variable
importance for related trauma data [8]. Define the data, for a particular variable of interest
(A ∈ R), as O = (W, Δ, Δ ∗ (A, Y )), where W is the original covariate vector with the
current variable of interest, A, removed (this definition is repeated for each variable that
comprises W ), and Δ is missingness indicator for either A or Y (=1 if both not missing, 0
otherwise). Assume for now that A is discrete and there is an a priori highest risk level (aH )
and a lowest risk level (aL). Then, a candidate for variable importance that would allow
comparisons across different candidate predictors (each with their own (aL, aH )) is:

EW {E(Y | A = aH , W ) − E(Y | A = aL, W )} (24.1)

or a weighted average of the stratified mean differences comparing, within strata W , subjects
with aH versus those with aL. Though we do not emphasize such identifiability conditions
for this application, under assumptions Equation 24.1 identifies E{Y (aH ) − Y (aL)}, where
Y (a) is the counterfactual outcome for patients if, possibly contrary to the fact, their A was
set to a. In any case, it is a VIM that can be compared across different covariates and is
a pathwise differentiable parameter, and thus one can derive semiparametrically efficient,
asymptotically normal estimators.

However, the reality is that candidates for A are not always discrete and even when
they are, there is no objectively defined high- and low-risk levels for many if not most
of the potential predictors. Thus, we need to use the data (data adaptively) to define

levels aH and aL. Let Pn define the empirical distribution of our sample; one could apply
an algorithm of some kind to Pn to define by these high- and low-risk sets in A or in
shorthand (aL(Pn), aH (Pn)). For instance, if A were discrete with arbitrary levels A =
(a1, a2, ..., ai, ..., aK ), we could define aL(Pn) as

aL(Pn) = argmin 1n Qn(a, Wi)

a∈A n i=1

aH (Pn) = argmin 1n Qn(a, Wi)

a∈A n i=1

where Qn(A, W ) is an estimate (based on Pn) of Q(A, W ) ≡ E(Y | A, W ). Simply, one
finds the levels (aL(Pn), aH (Pn)) that maximize the substitution estimate of Equation 24.1
according to some regression estimate Qn(A, W ) (we discuss below how to optimize Qn).

Mining with Inference 441

Thus, we have a data-adaptive target parameter: EW {E(Y | A = aH (Pn), W ) − E(Y | A =
aL(Pn), W )}, for which the substitution estimator

1 Qn(aH (Pn), Wi) − Qn(aL(Pn), Wi) (24.2)
n
i=1

can suffer from serious bias due to overfitting. For instance, under the statistical null that
EW {E(Y | A = aH (Pn), W ) − E(Y | A = aL(Pn), W )} = 0, the estimate (Equation 24.2)
will always be positively biased (it is always ≥ 0). Others have noted dangers of false
positives from the combination of high-dimensional data and flexible methodologies [9,10].
However, if one had done sample splitting such that (1) a training sample was used to
define (aL(Pn,tr), aH (Pn,tr)), and (2) an estimation sample was used to estimate EW {E(Y |
A = aL(Pn,tr), W ) − E(Y | A = aL(Pn,tr), W )}, then consistent estimation is possible.
However, estimation has lower power as the sample size is split into two. What one would
like is a data-adaptive procedure that can use the data to define the target parameter, but
still estimate it consistently and derive robust inference. Thus, we discuss methods that
use sample splitting to avoid bias from overfitting, but still can use the entire dataset to
estimate a data-adaptive parameter. These methods apply in circumstances where there is
little constraint on how the data are explored to generate potential parameters of interest.
Such methods can capitalize on the very large sample sizes and/or very high dimension
associated with big data.

We present the material in the following order: in Section 24.2, we will define our general
methodology, using sample splitting or not, present theorems establishing the asymptotic
statistical performance, and discuss its implications, as well as present influence curve-
based inference. In Section 24.3, we discuss an example of an interesting target parameter
generated via the data-adaptive methodology presented in Section 24.2. In Section 24.4, we
revisit the example of VIMs in the acute trauma study, and end with concluding a remark.

24.2 General Methodology

Let O1, . . . , On be i.i.d. with probability distribution P0, an element of the statistical model
M; let Pn represent the empirical distribution of a sample. We discuss our estimator in
the context of V -fold cross-validation. Let us assume that we have divided up the sample
into equal numbers of observations randomly of size n/V by randomly assigning integers
(1, 2, . . . , V ) of length n – we can call a particular permutation of these sets of integers, Z. In
this way, we can define random splits of the Oi into an estimation sample (Z = v) and into
the parameter-generating (training) sample (Z = v). Thus, this corresponds with a V -fold
cross-validation scheme: (1) {1, . . . , n} are divided into V equal size subgroups, (2) an
estimation sample is defined by one of the subgroups, and (3) the parameter-generating
sample is its complement resulting in V such splits of the sample.

For a given random split v, let Pn,vc be the empirical distribution of the parameter-
generating sample, and Pn,v be the empirical distribution of the estimation sample. Thus, for
a particular sample split, ΨPn,vc : M → R is the target parameter mapping indexed by the
parameter-generating sample Pn,vc , and Ψˆ Pn,vc : MNP → R the corresponding estimator
of this target parameter. Here MNP is the nonparametric model and an estimator is defined
as a mapping/algorithm from a nonparametric model, including the empirical distributions,

to the parameter space. Assume that the parameter is real valued.

442 Handbook of Big Data

The choice of target parameter mapping and corresponding estimator can be informed

by the data Pn,vc , but not by the corresponding estimation sample Pn,v. Define the sample-
split data-adaptive statistical target parameter as Ψn : M → R with

1V
Ψn(P ) = Ave{ΨPn,vc (P )} ≡ V ΨPn,vc (P )

v=1

and the statistical estimand of interest is thus

ψn,0 = Ψn(P0) = Ave{ΨPn,vc (P0)}. (24.3)

This target parameter mapping depends on the data, and thus we call it a data-adaptive
target parameter. A corresponding estimator of the data-adaptive estimand ψn,0 is given by

ψn = Ψˆ (Pn) = Ave{Ψˆ Pn,vc (Pn,v)}. (24.4)

√
We show below that n(ψn − ψn,0) converges to a mean zero normal distribution with
variance σ2 that can be consistently estimated, allowing the construction of confidence

intervals for ψn,0 and also allowing testing a null hypothesis such as H0 : ψn,0 ≤ 0. In
particular, this would hold if ψn = Ψˆ (Pn) is an asymptotically linear estimator of ψn,0 with

the influence curve IC(P0):

√
ψn − ψn,0 = (Pn − P0)IC(P0) + oP (1/ n)

where we use the notation P f ≡ f (o)dP (o) for the expectation of f (O) w.r.t. P . Because
(Pn−P0)IC(P0) = 1/n i IC(P0)(Oi) i√s a sum of mean zero-independent random variables,
this asymptotic linearity implies that n(ψn − ψn,0) converges to a mean zero normal
distribution with variance σ2 = P0IC(P0)2.

Theorem 24.1 Suppose that, given (Pn,vc ), Ψˆ Pn,vc is an asymptotically linear estimator
of ΨPn,vc (P0) at P0 with the influence curve ICv,Pn,vc (P0) indexed by (v, Pn,vc ):

Ψˆ Pn,vc (Pn,v) − ΨPn,vc (P0) = (Pn,v − P0)ICPn,vc (P0) + Rn,v

√
where (unconditional) Rn,v = oP (1/ n). Assume V -fold cross-validation, and for a given

split v, assume that P0 I C2 (P0 ) − P0 I Cv2 (P0 ) → 0 in probability, where I Cv (P0 ) is a

v,Pn0,v

limit influe√nce curve that can still √be indexed by the split v. √
Then, n(ψn − ψn,0) V n/V (Pn,v − P0)ICv,Pn0,v (P0) + (1/ n)
= 1 v oP converges
V

to a mean zero normal distribution with variance

σ2 = 1 V
V
σv2

v=1

where σ2v = P0ICv2(P0). A consistent estimator of σ2 is given by

σn2 = 1 V
V
Pn I Cv2,n

v=1

where ICv,n is an L2(P0)-consistent estimator of ICv(P0). Alternatively, one can use

1 V
V
σn2 = Pn,v I Cv,Pn,vc (Pn,v )2 (24.5)

v=1

Mining with Inference 443

where ICv,Pn,vc (Pn,v) is an L2(P0)-consistent estimator of ICv,Pn,vc (P0) based on the
sample Pn,v.

See [2] for proof.
Note that these results generalize to cases where the mapping also relies on the split

itself. For instance, the general theory covers the case that each training sample is sent to
separate research groups that are told to derive a mapping to estimate some parameter,
and then each of these mappings is applied to the corresponding training sample, where to
estimate the resulting data-adaptive parameter one does not need to know how the mapping
was done, and whether it was a common algorithm applied in all training sample. However,
we emphasize for the rest of the chapter, as we have implied by the notation above (by not
indexing Equation 24.3 by Z in addition to Pn,vc), those cases where the mapping is based on
a fixed algorithm and so only a function of the empirical distribution of the training sample.

We note that in some circumstances, one can improve the efficiency of the data-adaptive
parameter estimates by using a cross-validated, targeted maximum likelihood approach
(CV-TMLE; see Chapter 27 in [11]).

For advice on the number of splits, V , see [2]. In addition, we also derived results for the
classic data-mining procedure, where the entire sample is used both to define the parameter
and then subsequently to estimate it (Theorem 3 in [2]). The conditions of this theorem
imply greater restrictions (detailed in [2]) on the use of algorithms for defining data-adaptive
target parameters.

24.3 Examples

24.3.1 Inference for Sample-Split Cluster-Specific Target
Parameters, Where the Data are Used to Define Clusters

Suppose one has a very high dimensional set of variables, potentially correlated in relatively
distinct groups, but for which the definition of such groups is not known a priori, and thus
must be determined empirically. Furthermore, the summaries of the values of the variables in
these blocks represent meaningful summaries of their joint relationship with an explanatory
variable. One such situation might be genomic experiments, where the expression of highly
correlated genes represents distinct pathways that might be simultaneously triggered by
the same intervention (e.g., drug). Thus, one might derive significant variable reduction
by creating summaries of the highly correlated gene expression in these blocks, as well
as exploratory summaries of potential pathways to more efficiently measure the impact
of the intervention of interest. Because this involves an exploratory part (forming the
blocks/clusters), but still with the need for formal statistical inference to determine the
significance of the relationship of these clusters with the intervention, it is an ideal
application for the methodologies in this chapter.

Suppose that one observes on each subject a p-dimensional gene-expression profile Y ∈
IRp, a binary treatment/exposure A, and a vector of baseline characteristics W . Thus,
O = (W, A, Y ) and we observe n i.i.d. copies O1, . . . , On. Consider an algorithm that maps
a dataset O1, . . . , On into a cluster C ⊂ {1, . . . , p} of genes. Denote this cluster estimator
with Cˆ : MNP → C, where C is the space of possible cluster values. Given a realized cluster
C, let ΨC : M → IR be a desired parameter of interest such as the effect of treatment A on
Y (M (C)), controlling for the baseline covariates W , where M (C) is the medoid/center of
the cluster C, defined as

ΨC(P0) = E0{E0(Y (M (C)) | A = 1, W ) − E0(Y (M (C)) | A = 0, W )}

444 Handbook of Big Data

Alternatively, one might define ΨC(P0) as the average over all genes j in cluster C of the
effect of treatment A on Y (j), controlling for W , defined as

ΨC (P0 ) = | 1 | |C | (j) | A = 1, W ) − E0(Y (j) | A = 0, W )}
C
E0{E0(Y

j=1

Let Ψˆ C : MNP → IR be an estimator of ΨC (P0), for instance, inverse probability
of treatment weighted (IPTW [12]), or targeted maximum likelihood estimator (TMLE

[11,13]). If one assumes that the regularity conditions hold so that, for instance, the TMLE
estimator, Ψˆ C(Pn), is asymptotically linear with the influence curve ICC (P0), then

Ψˆ C (Pn) − ΨC (P0) = (Pn − P0)ICC (P0) + RC,n

√
where RC,n = oP (1/ n). We define ΨPn,vc : M → IR as ΨPn,vc = ΨCˆ(Pn,vc ), that is, the
causal effect of treatment on the data-adaptively determined cluster Cˆ(Pn,vc ). Similarly, we
define Ψˆ Pn,vc : MNP → IR as Ψˆ Pn,vc = Ψˆ Cˆ(Pn,vc ), that is, the TMLE of the W -controlled
effect of treatment of this data-adaptively determined cluster, treating the latter as given.
The estimand of interest is thus defined as ψn,0 = Ave{ΨPn,vc (P0)} and its estimator is
ψn = Ave{Ψˆ Pn,vc (Pn,v)}. Thus, for a given split, we use the parameter-generating sample
Pn,vc to generate a cluster Cˆ(Pn,vc ) and the corresponding TMLE of Ψˆ Cˆ(Pn,vc )(P0) applied
to the estimation sample Pn,v, and these sample-split specific estimators are averaged across
the V sample splits. By assumption we have for each split v:

Ψˆ Cˆ(Pn,vc )(Pn,v ) − ΨCˆ(Pn,vc )(P0) = (Pn,v − P0)I CCˆ(Pn,vc )(P0) + RCˆ(Pn,vc ),n
√

where we now assume that (unconditionally) RCˆ(Pn,vc ),n = oP (1/ n). In addition, we
assume that P0{ICCˆ(Pn,vc )(P0)}2 converges to P0{ICCˆ(P0)(P0)}2 for a limit cluster Cˆ(P0).
Application of Theorem 24.1 now proves that ψn − ψn,0 is asymptotically linear with the
influence curve ICCˆ(P0)(P0) so that it is asymptotically normally distributed with mean
zero and variance σ2 = P0ICCˆ(P0)(P0)2.

There are many examples of potential applications of this approach too numerous to list

(see [2] for more examples and extensive simulations). However, one particularly important

one is the estimation of the causal effect of particular treatment rules that are learned from

data. This one has obvious practical implications in all sorts of fields, including medicine,

political science, and any discipline where of interest is the impact of an approach that

tailors an intervention to the characteristics of the statistical units, and the data must be

used to learn the best approach.

24.4 Data Analysis: ACIT Variable Importance Trauma Study

We revisit estimating VIMs of the form 24.2 for the ACIT trauma data. The target
population are those patients that survive up to 6 hours after their injury, and the outcome
Y is the indicator of death from 6 to 24 hours, where, among the n = 1277 observations
(subjects) alive at 6 hours, the proportion of deaths in this next interval is 3.5%. We
estimate the variable importance from a combination of ordered continuous variables,
ordered discrete, and factors for a total of 108 potential predictors. Each of these 108
variables is treated as the variable of interest, A, and the remainder as covariates. This

Mining with Inference 445

is an extremely messy data (missing values, sparsity, etc.), and thus, for each variable
importance we undergo a number of automated data-processing steps to make estimation
viable. These include:

1. Drop variables missing for more than 50% of observations.
2. Automatically determine which variables are ordered and unordered factors.
3. Drop observations variables for which there distribution is too uneven so that

there is not enough experimentation across different levels to estimate a VIM.

Then, for each variable (when current variable of interest), we performed the following steps.

1. For a continuous variable, we constructed new ordered discrete variables (integers)
that map the original value into intervals defined by the empirical deciles of the
variable. This results in new ordered variables with values 1, ..., max(A), where
max(A) = 10 unless the original variable has fewer unique values. We then further
lump these values into groups by histogram density estimation using the penalized
likelihood approach [14,15] to bin values, avoiding very small cell sizes in the
distribution A.

2. For each variable, we generated basis functions that indicate which observations
have missing values, so that for each original variable, say Wj , there is a new basis
(Δj , Δj ∗ Wj), where Δj = 1 if Wj is not missing for an observation, 0 otherwise
(see below for how this is used in the algorithm).

3. Use hierarchical clustering routine HOPACH [16] to cluster adjustment variables
(the matrices of all other predictors besides the current A as well as the matrix of
associated missingness indicators) to reduce the dimension of the adjustment set.

Each of these processing steps is tunable depending on sample size to the point where the
data are very close to the original form. These steps allow a very general messy dataset
of mixed variable types that can be automatically processed for VIM analysis. Thus, after
processing, for each VIM analysis we can represent the data as i.i.d., O = (W, A, Y ) ∼ P0 ∈
M, W being all other covariates besides the current A as well as the basis functions related
to missingness observations for covariates.

We estimate our data-adaptive, VIM based on algorithms motivated by both theorems
above. Specifically, we use the approach of Theorem 24.1 by performing sample splitting
using two-fold cross-validation (splitting randomly in equal samples). The parameter is first
defined by using the training sample to define aL(Pn,vc ). To do so, we estimate, for each of
values, a, of the discretized A ∈ A, an estimand motivated by the causal parameter E(Ya).
Under identifiability assumptions (e.g., randomization, positivity; see Chapter 4 in [11]), we
can identify E(Ya) as

E0{E0(Y | A = a, W )} = E0{Q0(a, W )} (24.6)

where 0 indicated under P0, and Q0(a, W ) is the true regression (conditional mean) function
of Y on (A, W ). If we knew Q0, we would know the true levels of A to compare, that is,

aL ≡ argmina∈AE0{Q0(a, W)}
aH ≡ argmaxa∈AE0{Q0(a, W)}

so we could then estimate Equation 24.1 without having to discover values of (aL, aH ) using
the data. However, we do not know Q0 and must estimate it to estimate Equation 24.6 to
empirically define (aL, aH ). In the context of Theorem 24.1, we then use the training sample
to define these levels. We do so using the following algorithm:

446 Handbook of Big Data

1. Estimate θ0(a) ≡ E0,W {Q0(a, W )} for all levels a ∈ A using a semi-parametric,
locally efficient data-adaptive method TMLE [11].

a. To do so requires an initial estimate of the regression Q0(A, W ) and we use
the SuperLearner algorithm [17], an ensemble learner that itself uses cross-
validation to derive an optimal weighted combination of selected learners. For
this we used (1) LASSO via the glmnet package in R [18], (2) ADD OTHERS.

b. This also requires an estimate of the so-called treatment mechanism
(g0(a; W ) ≡ P0(A = a | W )) and the glmnet package was also used for this.

We can define these estimates for a specific training sample vc as θˆPn,vc (a).
2. Select the levels of A to compare in the corresponding training sample as

aL(Pn,vc ) = argmina∈AθˆPn,vc (a)
aH (Pn,vc ) = argmaxa∈AθˆPn,vc (a)

3. On the corresponding estimation sample, estimate the parameter

ΨPn,vc (P0) ≡ θ0(aH (Pn,vc )) − θ0(aL(Pn,vc )) (24.7)

We estimated ΨPn,vc (P0) on the corresponding estimation sample (say
Ψˆ Pn,vc (Pn,v)) using the same combination of SuperLearner and TMLE described
above as used for defining (aL(Pn,vc ), aH (Pn,vc )).

4. Derive the influence curve of these estimators on the validation sample. In this
case, the estimated influence curve is

I CPn,vc (Pn,v) =Y − −I{A=aH (Pn,vc )} I{A=aL(Pn,vc )} {Y − QPn,v (A, W )}
gPn,v (aL(Pn,vc );W )
gPn,v (aH (Pn,vc );W )

+ QPn,v (aH (Pn,vc ), W ) − QPn,v (aL(Pn,vc ), W ) − ΨPn,vc (Pn,v)

where QPn,v and gPn,v indicate that these functions were estimated on the
estimation sample Pn,v.

5. Derive the estimated sample variance of the influence curve: σn2,v ≡ Pn,v
I CPn,vc (Pn,v)2.

6. Repeat steps 1–5 above for every combination of estimation and validation

sample to get the entire set of parameter and variance estimates, (Ψˆ Pn,vc (Pn,v),
σ2
), v = 1, . . . , V .
Pn,v

7. Average the estimates to get the estimate (Equation 24.4) of the average
parameter: ψn = Ave{Ψˆ Pn,vc (Pn,v)}, as well as the average estimated variances,
σn2 = Ave{σ2n,v}.

8. Derive confidence intervals, and p-values based using quantities calculated in 7.

9. Repeat steps 1–8 for every variable to consider as an A by switching places with
the current A and one of the W .

10. Based on the p-value of the test of null: H0 : ψn,0 ≤ 0, adjust for multiple
comparisons by controlling the false discovery rate (FDR; [19])

The results are an ordered list (by statistical significance) of the VIM (the ψn), statistical
inference adjusted for multiple comparisons, along with information about the estimates and

the levels of (aL(Pn,vc ), aH (Pn,vc )) chosen for each combination of training and estimation
samples.

Mining with Inference 447

24.4.1 Results

The names of the potential predictors along with the type of variable (ordered or unordered
factor) are listed in Table 24.1. This table is ordered by the p-value of the test: H0 : ψn ≤ 0,
so most significant estimates at top. There were 108 original variables examined, but only
72 with sufficient data to estimate a VIM. Table 24.2 has the estimation sample-specific
results for the same ordered list. The definition of the (aL(Pn,vc ), aH (Pn,vc )) for the two
training samples is related to the original discretization, so, for instance, if it states that
aL(Pn,vc ) is (x, 1], then aL(Pn,vc ) is the indicator of being in the lowest decile. One can
see some shortcomings of the data from these results. The outcome is rare enough, that

TABLE 24.1
ACIT variables and description—note variables are ordered by statistical significance.

Name Var. type Description

tbi Factor Traumatic brain injury
hr6 basedefexc Ordered Hour 6 base deficit/excess
ortho Factor Orthopedic injury
alcabuse Factor Alcohol use
race Factor Race
hr0 factorviii Ordered Hour 0 factor VIII
hr0 ptt Ordered Hour 0 partial thromboplastin time
heightcm Ordered Height
edworsttemp Ordered ED worst temp
aisface2 Ordered Abbreviated injury scale: face
hr0 factorv Ordered Hour 0 factor V
hr0 atiii Ordered Hour 0 antithrombin III
aisextremity5 Ordered Abbreviated injury scale: extremity
male Factor Gender male?
edlowestsbp Ordered ED lowest SBP
latino Factor Latino
pbw Ordered Predicted body weight
edlowesthr Ordered ED lowest HR
hr0 temp Ordered Hour 0 temperature
blunt Factor Mechanism of injury—blunt?
hr0 map Ordered Hour 0 mean arterial pressure
aisabdomen4 Ordered Abbreviated injury scale: abdomen
iss Ordered Injury severity score
hr0 resprate Ordered Hour 0 respiratory rate
numribfxs Ordered Number of rib fractures
hr0 factorx Ordered Hour 0 factor X
patientbloodtype Factor Patient blood type
edadmittemp Ordered ED admit temp
edhighestsbp Ordered ED highest SBP
age Ordered AGE at time of injury
insurancesource Factor Insurance source
mechtype Factor Mechanism type
hr0 basedefexc Ordered Hour 0 base deficit/excess
hr0 ph Ordered Hour 0 pH
ali Factor Acute lung injury
hr0 pc Ordered Hour 0 protein C
hr0 factorix Ordered Hour 0 factor IX
edworstrr Ordered ED worst RR
aischest3 Ordered Abbreviated injury scale: chest

TABLE 24.2 448 Handbook of Big Data
ACIT variable importance results by the estimation sample.

Predictor Ψˆ Pn,(v=1)c Ψˆ Pn,(v=2)c aL(Pn,(v=1)c ) aH (Pn,(v=1)c ) aL(Pn,(v=2)c ) aH (Pn,(v=2)c )

tbi 0.0636 0.1130 No Yes No Yes
hr6 basedefexc 0.1283 0.2141 (1,10.1] (0.95,1] (1,10] (0.9,1]
ortho 0.0300 0.0554 Yes No Yes No
alcabuse 0.0565 0.0588 No Unknown No Unknown
race 0.0961 0.0407 White Asian White Asian
hr0 factorviii 0.1753 0.0947 (1,10] (0.9,1] (1,10.1] (0.9,1]
hr0 ptt 0.0895 0.0431 (1,9] (9,10] (0.9,6] (6,10]
heightcm 0.0800 0.0156 (1,10] (0.9,1] (1,10.1] (0.9,1]
edworsttemp 0.0506 0.0582 (1,10.1] (0.95,1] (3,10.1] (0.9,1]
aisface2 0.0085 0.0266 (1,4] (0.97,1] (1,4] (0.97,1]
hr0 factorv 0.0166 0.059 (1,10] (0.9,1] (1,10.1] (0.9,1]
hr0 atiii 0.0159 0.0322 (1,10] (0.9,1] (1,10] (0.9,1]
aisextremity5 0.0164 0.0189 (1,5] (0.96,1] (1,5] (0.96,1]
male 0.0648 −0.0159 Male Female Male Female
edlowestsbp 0.0334 0.0311 (1,10] (0.9,1] (1,10] (0.9,1]
latino 0.019 0.0154 Yes No Yes No
pbw 0.0577 0.0107 (1,10] (0.9,1] (1,10.1] (0.9,1]
edlowesthr 0.0317 0.0557 (1,10.1] (0.9,1] (1,10] (0.9,1]
hr0 temp 0.029 0.0401 (1,10.1] (0.9,1] (1,10.1] (0.9,1]
blunt 0.0293 −0.0018 Penetrating Blunt Penetrating Blunt
hr0 map 0.0506 0.0053 (0.9,1] (9,10] (0.9,1] (1,10]
aisabdomen4 0.0159 0.0059 (1,4] (0.97,1] (1,4] (0.97,1]
iss 0.0329 0.0634 (0.9,5] (5,9] (1,7] (7,9]
hr0 resprate −0.0059 0.0268 (1,10] (0.9,1] (2,10] (1,2]
numribfxs 0.0179 0.0103 (1,4] (0.97,1] (1,4] (0.97,1]
hr0 factorx 0.0461 −0.0087 (1,10] (0.9,1] (6,10] (0.9,1]
patientbloodtype 0.0219 0.0298 A+ A− A+ A−
edadmittemp −0.0008 0.0220 (1,10.1] (0.9,1] (3,10.1] (1,3]
edhighestsbp 0.0120 0.0150 (1,9] (9,10] (1,10] (0.9,1]
age −0.0095 0.0280 (1,10] (0.9,1] (1,9] (9,10]
insurancesource 0.0235 −0.0124 No insurance Medical No insurance Medicare
mechtype 0.0801 −0.0328 PVA Found down Found down PVA
hr0 basedefexc 0.0083 0.0095 (1,10.1] (0.9,1] (1,10] (0.9,1]
hr0 ph 0.0509 −0.0316 (1,10] (0.9,1] (0.9,1] (1,10]
ali −0.0275 0.0116 No Yes Yes No
hr0 pc 0.0058 −0.0477 (1,10] (0.9,1] (0.9,1] (1,10]
hr0 factorix −0.0293 −0.0192 (1,10] (0.9,1] (0.9,1] (1,10]
edworstrr −0.0106 −0.0644 (1,9.1] (0.9,1] (0.9,1] (1,9]
aischest3 −0.s0270 −0.0326 (1,5.1] (0.97,1] (0.96,1] (1,5]

Mining with Inference 449

given the modest sample size, for some variables there is inconsistency in how (aL, aH) are
defined from one training sample to the next. This will partially explain how the estimates
of the VIM will also vary, as they are comparing very different groups. We can use this to
subset the results further below.

The final estimates, ψn, with confidence intervals and both raw and adjusted (for
multiple testing by controlling FDR) p-values are shown in Table 24.3. We subset this table
by statistical significance using the FDR cutoff of 0.1, as well as exclude those variables for
which (aL, aH) were inconsistent across training samples (either not the same for the factors
or in opposite directions for ordered variables), resulting in Table 24.4. This list indicates

TABLE 24.3
ACIT variable importance results for combined estimates.

Name ψn 95% CI p -value FDF
(Q-value)
tbi 0.0883 (0.0577 − 0.119) 0.0000
hr6 basedefexc 0.1712 (0.102 − 0.241) 0.0000 0.0000
ortho 0.0427 (0.0254 − 0.06) 0.0000 0.0000
alcabuse 0.0576 (0.0331 − 0.0822) 0.0000 0.0000
race 0.0684 (0.0257 − 0.111) 0.0008 0.0000
hr0 factorviii 0.1350 (0.0379 − 0.232) 0.0032 0.0065
hr0 ptt 0.0663 (0.0154 − 0.117) 0.0053 0.0210
heightcm 0.0478 (0.00509 − 0.0905) 0.0141 0.0297
edworsttemp 0.0544 (0.000437 − 0.108) 0.0241 0.0689
aisface2 0.0176 (−0.000284 − 0.0354) 0.0269 0.1044
hr0 factorv 0.0382 (−0.00414 − 0.0806) 0.0385 0.1048
hr0 atiii 0.0240 (−0.00264 − 0.0507) 0.0387 0.1146
aisextremity5 0.0177 (−0.0022 − 0.0376) 0.0406 0.1146
male 0.0244 (−0.00355 − 0.0524) 0.0435 0.1146
edlowestsbp 0.0323 (−0.00502 − 0.0695) 0.0450 0.1146
latino 0.0176 (−0.00351 − 0.0388) 0.0510 0.1146
pbw 0.0342 (−0.00729 − 0.0757) 0.0531 0.1146
edlowesthr 0.0437 (−0.0101 − 0.0976) 0.0557 0.1146
hr0 temp 0.0350 (−0.00812 − 0.0781) 0.0558 0.1146
blunt 0.0137 (−0.00728 − 0.0347) 0.1001 0.1146
hr0 map 0.0279 (−0.0177 − 0.0736) 0.1154 0.1952
aisabdomen4 0.0109 (−0.0112 − 0.033) 0.1664 0.2143
iss 0.0481 (−0.0578 − 0.154) 0.1866 0.2950
hr0 resprate 0.0105 (−0.0163 − 0.0373) 0.2218 0.3165
numribfxs 0.0141 (−0.0266 − 0.0548) 0.2485 0.3604
hr0 factorx 0.0187 (−0.0368 − 0.0741) 0.2546 0.3745
patientbloodtype 0.0259 (−0.0527 − 0.104) 0.2593 0.3745
edadmittemp 0.0106 (−0.0249 − 0.0461) 0.2793 0.3745
edhighestsbp 0.0135 (−0.0351 − 0.0621) 0.2931 0.3890
age 0.0093 (−0.0281 − 0.0466) 0.3134 0.3941
insurancesource 0.0056 (−0.0246 − 0.0358) 0.3583 0.4074
mechtype 0.0236 (−0.111 − 0.158) 0.3655 0.4455
hr0 basedefexc 0.0089 (−0.0521 − 0.0699) 0.3875 0.4455
hr0 ph 0.0096 (−0.0772 − 0.0965) 0.4138 0.4579
ali −0.0080 (−0.0353 − 0.0193) 0.7161 0.4747
hr0 pc −0.0209 (−0.0571 − 0.0153) 0.8713 0.7980
hr0 factorix −0.0243 (−0.0627 − 0.0141) 0.8922 0.9405
edworstrr −0.0375 (−0.0767 − 0.00172) 0.9695 0.9405
aischest3 −0.0298 (−0.0469 − −0.0127) 0.9997 0.9951
0.9997

450 Handbook of Big Data

TABLE 24.4
Subset of significant and consistent results.

Name ψn 95% CI
0.0883
tbi 0.1712 (0.0577 − 0.119)
hr6 basedefexc 0.0427 (0.102 − 0.241)
ortho 0.0576 (0.0254 − 0.06)
alcabuse 0.0684 (0.0331 − 0.0822)
race 0.1350 (0.0257 − 0.111)
hr0 factorviii 0.0663 (0.0379 − 0.232)
hr0 ptt (0.0154 − 0.117)

both variables that are thought to be important for their association with coagulopathy
(e.g., Base Deficit, Factor VIII, and partial thromboplastin time), and injury characteristics
that either are relatively more (traumatic brain injury) or less (orthopedic injury) likely to
lead to death. One variable (alcohol use) appears to be chosen because a label of Unknown
is potentially a proxy for the seriousness of the condition of the patient (i.e., the inability to
collect the data at baseline could indicate a more serious patient at greater risk of death). It
is also noted that the potential known indicators of coagulopathy are generally the procedure
estimated to have high importance by our algorithm, also choose the relative levels (i.e., the
high versus low risk levels) consistent with the existing understanding of their associations
with potential for uncontrolled bleeding. Specifically, current understanding suggests that
smaller values of base deficit (hr6 basedefexc) and factor VIII (hr0 factorviii) results in
higher risk of poor outcomes, where as the association is expected to be in opposite direction
(higher value is higher risk) for thromboplastin time (hr0 ptt). Finally, the magnitude of
the estimates ψn indicates that these variables are potentially very important for predicting
death, for instance, hr6 basedefexc for the high risk (lower value) versus low risk (higher
value) suggests a near 20% increase in death. This is particularly large given the proportion
of subjects who died in the data was only 3.5%. In any case, the results are consistent with
the current understanding of survival and the dangers of coagulopahthy [20].

24.5 Concluding Remarks

Much scientific progress has been achieved by generating target parameters based on past
studies, and evaluating them for future studies. However, such costly splitting of a stream
of data is by no means necessary, and the proposed data-adaptive target parameter and the
corresponding statistical procedure studied in this chapter allow for general sample splits,
and averaging the results across such splits. Our formal results show that statistical inference
is preserved under minimal conditions, even though the estimators are now based on all the
data. To obtain valid finite sample inference, it is important to utilize our corresponding
variance estimator (Equation 24.5), and that the sample size for the estimation sample is
chosen large enough so that the second-order terms of a possible nonlinear estimator are
controlled. We also note that one can often gain greater efficiency by using the related
CV-TMLE procedure [21].

We have demonstrate that the data-adaptive target parameter framework provides a
formalized approach for estimating target parameters that are either very hard or impossible
to pre-specify. There are many examples of interest that have not been highlighted
in this chapter, where the motivation can come from dimension reduction, or complex

Mining with Inference 451

causal parameters. In our example, we derived comparable and interpretable VIMs, where
comparison groups defined by the variable of interest are data-adaptively determined by
a search algorithm using a combination of machine learning and causal inference. This
situation is particularly prone to over fitting, and thus the estimator that uses the same
data to both define the parameter and estimate it probably does not satisfy the conditions
of Theorem 3 in [2]. It is easy to propose other situations in a big data context where one
would like to similar types of aggressive data-mining procedures to implicitly define the
parameter while still deriving a consistent estimator and accurate inference. The general
methodology presented here puts few constraints on how one uses the data to define
interesting parameters, and we expect that there are many applications in big data situations
for which this approach is particularly well suited.

Acknowledgments

This work was partially supported through a Patient-Centered Outcomes Research Institute
(PCORI) Pilot Project Program Award (ME-1306-02735).

DISCLAIMER: All statements in this report, including its findings and conclusions,
are solely those of the authors and do not necessarily represent the views of the Patient-
Centered Outcomes Research Institute (PCORI), its Board of Governors or Methodology
Committee.

References

1. Sebastian Schneeweiss. Learning from big health care data. New England Journal of
Medicine, 370(23):2161–2163, 2014.

2. Mark J. van der Laan, Alan E. Hubbard, and Sara K. Pajouh. Statistical inference
for data adaptive target parameters. Technical Report 314, Division of Biostatistics
Working Paper Series, University of California, Berkeley, CA, 2013.

3. Mitchell J. Cohen, Karim Brohi, Carolyn S. Calfee, Pamela Rahn, Brian B. Chesebro,
Sarah C. Christiaans, Michel Carles et al. Early release of high mobility group box
nuclear protein 1 after severe trauma in humans: Role of injury severity and tissue
hypoperfusion. Critical Care, 13(6):R174, 2009.

4. Mitchell J. Cohen, Natasha Bir, Pamela Rahn, Rachel Dotson, Karim Brohi, Brian B.
Chesebro, Robert Mackersie et al. Protein c depletion early after trauma increases the
risk of ventilator-associated pneumonia. Journal of Trauma, 67(6):1176–1181, 2009.

5. Ulrike Grömping. Variable importance assessment in regression: Linear regression versus
random forest. The American Statistician, 63(4):308–319, 2009.

6. Mark J. van der Laan. Statistical inference for variable importance. The International
Journal of Biostatistics, 2(1), 2006, doi:10.2202/1557-4679.1008.

7. Stephan J. Ritter, Nicholas P. Jewell, and Alan E. Hubbard. R Package multiPIM:
A causal inference approach to variable importance analysis. Journal of Statistical
Software, 57(8):1–29, 2014.

452 Handbook of Big Data

8. Alan Hubbard, Ivan D. Munoz, Anna Decker, John B. Holcomb, Martin A. Schreiber,
Eileen M. Bulger, Karen J. Brasel et al. Time-dependent prediction and evaluation of
variable importance using superlearning in high-dimensional clinical data. Journal of
Trauma-Injury, Infection, and Critical Care, 75(1):S53–S60, 2013.

9. John P.A. Ioannidis. Why most discovered true associations are inflated. Epidemiology,
19(5):640–648, 2008.

10. David I. Broadhurst and Douglas B. Kell. Statistical strategies for avoiding false
discoveries in metabolomics and related experiments. Metabolomics, 2(4):171–196, 2006.

11. Mark J. van der Laan and Sherri Rose. Targeted Learning: Causal Inference for
Observational and Experimental Data. Springer, New York, 2011.

12. James M. Robins, Miguel A. Hernan, and Babette Brumback. Marginal structural
models and causal inference in epidemiology. Epidemiology, 11:550–560, 2000.

13. Mark J. van der Laan and Daniel Rubin. Targeted maximum likelihood learning.
International Journal of Biostatistics, 2(1):11, 2006.

14. Yves Rozenholca, Thoralf Mildenbergerb and Ursula Gatherb. Combining regular and
irregular histograms by penalized likelihood. Computational Statistics & Data Analysis,
54(12):3313–3323, 2010.

15. Thoralf Mildenberger, Yves Rozenholc, and David Zasada. Histogram: Construction of
regular and irregular histograms with different options for automatic choice of bins.
R package version 0.0-23, 2009.

16. Mark J. van der Laan and Katherine S. Pollard. Hybrid clustering of gene expression
data with visualization and the bootstrap. Journal of Statistical Planning and Inference,
117:275–303, 2003.

17. Mark J. van der Laan, Eric C. Polley, and Alan E. Hubbard. Super learner. Statistical
Applications in Genetics and Molecular Biology, 6:25, 2007.

18. Jerome Friedman, Trevor Hastie, and Rob Tibshirani. Regularization paths for gener-
alized linear models via coordinate descent. Journal of Statistical Software, 33(1):1–22,
2010.

19. Yoav Benjamini and Yosef Hochberg. Controlling the false discovery rate: A practical
and powerful approach to multiple testing. Journal of the Royal Statistical Society,
57:289–300, 1995.

20. Jana B.A. MacLeod, Mauricio Lynn, Mark G. McKenney, Stephen M. Cohn, and Mary
Murtha. Early coagulopathy predicts mortality in trauma. Journal of Trauma-Injury,
Infection, and Critical Care, 55(1):39–44, 2003.

21. Wenjing Zheng and Mark J. van der Laan. Asymptotic theory for cross-validated
targeted maximum likelihood estimation. Technical report 273, Division of Biostatis-
tics, University of California, Berkeley, CA, http://biostats.bepress.com/ucbbiostat/
paper273, 2010.

Statistics Chapman & Hall/CRC

Handbooks of Modern

Statistical Methods

Handbook of Big Data provides a state-of-the-art overview of the analysis of
large-scale datasets. Featuring contributions from well-known experts in statis-
tics and computer science, this handbook presents a carefully curated collection
of techniques from both industry and academia. Thus, the text instills a working
understanding of key statistical and computing ideas that can be readily applied
in research and practice.

Offering balanced coverage of methodology, theory, and applications, this hand-
book:

• Describes modern, scalable approaches for analyzing increasingly large
datasets

• Defines the underlying concepts of the available analytical tools and tech-
niques

• Details intercommunity advances in computational statistics and machine
learning

Handbook of Big Data also identifies areas in need of further development,
encouraging greater communication and collaboration between researchers in
big data sub-specialties such as genomics, computational biology, and finance.

K23707
ISBN: 978-1-4822-4907-1

90000

9 781482 249071

w w w. c r c p r e s s . c o m

Pages:

Click to View FlipBook Version