The words you are searching are inside this book. To get more targeted content, please make full-text search by clicking here.

Chap T. Le, Lynn E. Eberly - Introductory Biostatistics-Wiley (2016)

Discover the best professional documents and content resources in AnyFlip Document Base.
Search
Published by Perpustakaan Digital UKMC Palembang, 2022-10-27 22:27:33

Chap T. Le, Lynn E. Eberly - Introductory Biostatistics-Wiley (2016)

Chap T. Le, Lynn E. Eberly - Introductory Biostatistics-Wiley (2016)

CONDITIONAL LOGISTIC REGRESSION 479

After ˆ and its standard error have been obtained, a 95% confidence interval for
j
the odds ratio above is given by:

exp ˆ 1.96 SE ˆ .
j j

These results are necessary in the effort to identify important risk factors in matched
designs. Of course, before such analyses are done, the problem and the data have to
be examined carefully. If some of the variables are highly correlated, one or a few of
the correlated factors are likely to be as good at prediction as all of them; information
from similar studies also has to be incorporated to inform the decision of whether to
drop some of these correlated explanatory variables. The use of products such as
X1X2 and higher power terms such as X12 may be necessary and can improve the
goodness of fit (unfortunately, it is very difficult to tell). It is important to note that
we are assuming a linear regression model in which, for example, the odds ratio due
to a 1‐unit higher value of a continuous Xj (Xj = x + 1 versus Xj = x) is independent of
x. Therefore, if this linearity seems to be violated (again, it is very difficult to tell;
the only easy way is fitting a polynomial model as seen in a later example), the incor-
poration of powers of Xj should be considered seriously. The use of products will
help in the investigation of possible effect modifications. Finally, there is the messy
problem of missing data; most software programs will delete a subject if one or more
covariate values are missing.

Testing Hypotheses in Multiple Regression  Once we have fit a multiple conditional
logistic regression model and obtained estimates for the various parameters of
interest, we want to answer questions about the contributions of various factors to the
prediction of the binary response variable using matched designs. There are three
types of such questions:

1. Overall test. Taken collectively, does the entire set of explatory or independent
variables contribute significantly to the prediction of response?

2. Test for the value of a single factor. Does the addition of one particular variable
of interest add significantly to the prediction of response over and above that
achieved by other independent variables?

3. Test for contribution of a group of variables. Does the addition of a group of
variables add significantly to the prediction of response over and above that
achieved by other independent variables?

Overall Regression Test  We now consider the first question stated above concerning
an overall test for a model containing J factors. The null hypothesis for this test may
be stated as: “All J independent variables considered together do not explain the var-
iation in response any more than the size alone.” In other words,

H0 : 1 2 J 0.

Three statistics can be used to test this global null hypothesis; each has a symptotic
chi‐square distribution with J degrees of freedom under H0: the likelihood ratio test,


480 ANALYSIS OF SURVIVAL DATA AND DATA FROM MATCHED STUDIES

Wald’s test, and the score test. All three statistics are provided by most standard com-
puter programs such as SAS and R, and they are asymptotically equivalent (i.e., for
very large sample sizes), yielding identical statistical decisions most of the time.
However, Wald’s test is used much less often than the other two.

Example 13.13
Refer to the data for low birth weight babies in Example 13.11 (Table 13.14). With all
four covariates, we have the following test statistics for the global null hypothesis:

1. Likelihood test:

2 9.530 with 4 df; p 0.0491.
2. Wald’s test: LR

2 6.001 with 4 df; p 0.1991.
3. Score test: W

2 8.491 with 4 df; p 0.0752.
S

The results indicates a weak combined explanatory power; Wald’s test is not
even significant. Very often, this means implicitly that perhaps only one or two
covariates are associated significantly with the response of interest (a weak overall
correlation).

Test for a Single Variable  Let us assume that we now wish to test whether the
addition of one particular independent variable of interest adds significantly to the
prediction of the response over and above that achieved by factors already present in
the model (usually after seeing a significant result for the global hypothesis above).
The null hypothesis for this single‐variable test may be stated as: “Factor Xj does not
have any value added to the prediction of the response given that other factors are
already included in the model.” In other words,

H0 : j 0.
To test such a null hypothesis, one can use

zj ˆ j

SE ˆ
j

where ˆ j is the corresponding estimated regression coefficient and SE ( ˆj ) is the
estimate of the standard error of ˆj, both of which are printed by standard computer

programs such as SAS and R. In performing this test, we refer the value of the z score

to percentiles of the standard normal distribution; for example, we compare the abso-

lute value of z to 1.96 for a two‐sided test at the 5% level.


CONDITIONAL LOGISTIC REGRESSION 481

Table 13.17 Coefficient Standard error z Statistic p Value
0.0942
Variable −0.0191 0.0114 −1.673 0.9182
−0.0885 0.8618 −0.103 0.5975
Mother’s weight 1.1979 0.0745
Smoking 0.6325 1.1985 0.528
Hypertension 2.1376 1.784
Uterine irritability

Example 13.14
Refer to the data for low birth weight babies in Example 13.11 (Table 13.14). With
all four covariates, we have the results shown in Table  13.17. Only the mother’s
weight (p = 0.0942) and uterine irritability (p = 0.0745) are marginally significant. In
fact, these two variables are highly correlated: that is, if one is deleted from the
model, the other would become more significant. SAS and R code are provided after
Example 13.18.

The overall tests and the tests for single variables are implemented simultaneously
using the same computer program, and here is another example.

Example 13.15
Refer to the data for vaginal carcinoma in Example 13.10 (Table 13.13). An applica-
tion of a conditional logistic regression analysis yields the following results:

1. Likelihood test for the global hypothesis:

2 9.624 with 2 df; p 0.0081.
LR

2. Wald’s test for the global hypothesis:

2 6.336 with 2 df; p 0.0027.
W

3. Score test for the global hypothesis:

2 11.860 with 2 df; p 0.0421.
S

For individual covariates, we have the results shown in Table 13.18.

In addition to a priori interest in the effects of individual covariates, given a continuous
variable of interest, one can fit a polynomial model and use this type of test to check
for linearity. It can also be used to check for a single product representing an effect
modification.

Example 13.16
Refer to the data for low birth weight babies in Example 13.11 (Table 13.14), but this
time we investigate only one covariate, the mother’s weight. After fitting the second‐
degree polynomial model, we obtained a result which indicates that the curvature
effect is negligible (p = 0.9131).


482 ANALYSIS OF SURVIVAL DATA AND DATA FROM MATCHED STUDIES

Table 13.18 Coefficient Standard error z Statistic p Value
Variable
Bleeding 1.6198 1.3689 1.183 0.2367
Pregnancy loss 1.7319 0.8934 1.938 0.0526

Contribution of a Group of Variables  This testing procedure addresses the more
general problem of assessing the additional contribution of two or more factors to the
prediction of the response over and above that made by other variables already in the
regression model. In other words, the null hypothesis is of the form

H0 : 1 2 m 0.

To test such a null hypothesis, one can perform a likelihood ratio chi‐square test with m df:

2 2 ln L ˆ; all Xs
LR

ln L ˆ; all other Xs with the mXs under investigation omitted .

As with the tests above for individual covariates, this multiple contribution
procedure is very useful for assessing the importance of potential explanatory vari-
ables. In particular, it is often used to test whether a similar group of variables, such
as demographic characteristics, is important for prediction of the response; these var-
iables have some trait in common. Another application would be a collection of
powers and/or product terms (referred to as interaction variables). It is often of
interest to assess the interaction effects collectively before trying to consider individual
interaction terms in a model, as suggested previously. In fact, such use reduces the
total number of tests to be performed, and this, in turn, helps to provide better control
of overall type I error rates, which may be inflated due to multiple testing.

Example 13.17
Refer to the data for low birth weight babies in Example 13.11 (Table 13.14). With
all four covariates, we consider collectively three interaction terms: mother’s weight
× smoking, mother’s weight × hypertension, mother’s weight × uterine irritability.
The basic idea is to see if any of the other variables would modify the effect of the
mother’s weight on the response (having a low birth weight baby).

1. With the original four variables, we obtained ln L = −16.030.

2. With all seven variables, four original plus three products, we obtained ln L =
−14.199.

Therefore, we have

2 2 ln L ˆ; seven variables ln L ˆ; four original variables
LR

3.662;3 df, p value 0.10

indicating a rather weak level of interactions.


CONDITIONAL LOGISTIC REGRESSION 483

Stepwise Regression  In many applications our major interest is to identify important
risk factors. In other words, we wish to identify from many available factors a small
subset of factors that relate significantly to the outcome (e.g., the disease under inves-
tigation). In that identification process, of course, we wish to avoid a large type I
(false positive) error. In a regression analysis, a type I error corresponds to including
a predictor that has no real relationship to the outcome; such an inclusion can greatly
confuse interpretation of the regression results. In a standard multiple regression
analysis, this goal can be achieved by using a strategy that adds to or removes from a
regression model one factor at a time according to a certain order of relative impor-
tance. Therefore, the two important steps are as follows:

1. Specify a criterion or criteria for selecting a model.
2. Specify a strategy for applying the criterion or criteria chosen.

The process follows the outline of Chapter 10 for logistic regression, combining
the forward selection and backward elimination in the stepwise process, with selec-
tion at each step based on the likelihood ratio chi‐square test. SAS’s PROC PHREG
does have an automatic stepwise option to implement these features, but R’s coxph
function does not; pass the coxph results to the function stepaic instead to do step-
wise selection based on AIC values.

Example 13.18
Refer to the data for low birth weight babies in Example 13.11 (Table 13.14) with all
four covariates: mother’s weight, smoking, hypertension, and uterine irritability. This
time we perform a stepwise regression analysis in which we specify that a variable
has to be significant at the 0.1 level before it can enter into the model and that a var-
iable in the model has to be significant at 0.15 for it to remain in the model (most
standard computer programs allow users to make these selections; default values are
available). First, we get the individual test results for all variables (Table  13.19).
These indicate that uterine irritability is the most significant variable.

•• Step 1: Variable uterine irritability is entered. Analysis of variables not in the
model yields the results shown in Table 13.20.

•• Step 2: Variable mother’s weight is entered. Analysis of variables in the model
yields Table 13.21. Neither variable is removed. Analysis of variables not in the
model yields Table 13.22. No (additional) variables meet the 0.1 level for entry
into the model.

Table 13.19 Score χ2 p Value

Variable 3.9754 0.0462
0.0 1.0
Mother’s weight 0.2857 0.5930
Smoking 5.5556 0.0184
Hypertension
Uterine irritability


484 ANALYSIS OF SURVIVAL DATA AND DATA FROM MATCHED STUDIES

Table 13.20 Score χ2 p Value

Variable 2.9401 0.0864
0.0027 0.9584
Mother’s weight 0.2857 0.5930
Smoking
Hypertension

Table 13.21 Coefficient Standard error z Statistic p Value

Factor −0.0192 0.0116 −1.655 0.0978
2.1410 1.1983 1.787 0.0740
Mother’s weight
Uterine irritability

Table 13.22 Score χ2 p Value

Variable 0.0840 0.7720
0.3596 0.5487
Smoking
Hypertension

Note: An SAS program would include these instructions:

PROC PHREG DATA LOWWEIGHT;
MODEL DUMMYTIME*CASE(0) = MWEIGHT SMOKING HYPERT
UIRRIT/SELECTION = STEPWISE
SLENTRY = .10 SLSTAY = .15;
STRATA SET;

The default values for SLENTRY (p value to enter) and SLSTAY (p value to stay) are
0.05 and 0.1, respectively. The R program of Example 13.12 would be modified to
include the following instructions:

all.cox = coxph(Surv(Dummytime,Case) ~ MotherWeight +
Hypertension + Smoking + Uirritability +
strata(MatchedSet)

summary(all.cox)
library(MASS)
stepAIC(all.cox)

EXERCISES

Electronic copies of some data files are available at www.wiley.com/go/Le/Biostatistics.
13.1 Given the small data set
9, 13, 13+, 18, 23, 28+, 31, 34, 45+, 48, 161+

calculate and graph the Kaplan–Meier curve.


EXERCISES 485

13.2 A group of 12 hemophiliacs, all under 41 years of age at the time of HIV
s­ eroconversion, were followed from primary AIDS diagnosis until death (ide-
ally, we should take as a starting point the time at which a person contracts
HIV rather than the time at which the patient is diagnosed, but this information
is not available). Survival times (in months) from diagnosis until death of
these hemophiliacs were: 2, 3, 6, 6, 7, 10, 15, 15, 16, 27, 30, and 32. Calculate
and graph the Kaplan–Meier curve.

13.3 Suppose that we are interested in studying patients with systemic cancer who
subsequently develop a brain metastasis; our ultimate goal is to prolong their
lives by controlling the disease. A sample of 23 such patients, all of whom
were treated with radiotherapy, were followed from the first day of their
treatment until recurrence of the original tumor. Recurrence is defined as the
reappearance of a metastasis in exactly the same site, or in the case of patients
whose tumor never completely disappeared, enlargement of the original
lesion. Times to recurrence (in weeks) for the 23 patients were: 2, 2, 2, 3, 4, 5,
5, 6, 7, 8, 9, 10, 14, 14, 18, 19, 20, 22, 22, 31, 33, 39, and 195. Calculate and
graph the Kaplan–Meier curve.

13.4 A laboratory investigator interested in the relationship between diet and the
development of tumors divided 90 rats into three groups and fed them low‐fat,
saturated‐fat, and unsaturated‐fat diets, respectively. The rats were of the same
age and species and were in similar physical condition. An identical amount
of tumor cells was injected into a foot pad of each rat. The tumor‐free time is
the time from injection of tumor cells to the time that a tumor develops; all 30
rats in the unsaturated‐fat diet group developed tumors; tumor‐free times (in
days) were: 112, 68, 84, 109, 153, 143, 60, 70, 98, 164, 63, 63, 77, 91, 91, 66,
70, 77, 63, 66, 66, 94, 101, 105, 108, 112, 115, 126, 161, and 178. Calculate
and graph the Kaplan–Meier curve.

13.5 Data are shown in Table 13.3 for two groups of patients who died of acute
myelogenous leukemia (see Example 13.3). Patients were classified into the
two groups according to the presence or absence of a morphologic characteristic
of white cells. Patients termed AG positive were identified by the presence of
Auer rods and/or significant granulature of the leukemic cells in the bone
marrow at diagnosis. For AG‐negative patients these factors were absent.
Leukemia is a cancer characterized by an overproliferation of white blood
cells; the higher the white blood count (WBC), the more severe the disease.
Calculate and graph in the same figure the two Kaplan–Meier curves (one for
AG‐positive patients and one for AG‐negative patients). How do they
compare?

13.6 In Exercise 13.4 we described a diet study, and tumor‐free times were given
for the 30 rats fed an unsaturated‐fat diet. Tumor‐free times (days) for the
other two groups are as follows:

•• Low‐fat: 140, 177, 50, 65, 86, 153, 181, 191, 77, 84, 87, 56, 66, 73, 119,
140+, and 14 rats at 200+


486 ANALYSIS OF SURVIVAL DATA AND DATA FROM MATCHED STUDIES

•• Saturated‐fat: 124, 58, 56, 68, 79, 89, 107, 86, 142, 110, 96, 142, 86, 75,
117, 98, 105, 126, 43, 46, 81, 133, 165, 170+, and 6 rats at 200+

(140+ and 170+ were due to accidental deaths without evidence of tumor).
Calculate and graph the two Kaplan–Meier curves, one for rats fed a low‐fat
diet and one for rats fed a saturated‐fat diet. Put these two curves and the one
from Exercise 13.4 in the same figure and draw conclusions.

13.7 Consider the data shown in Table  E13.7 (analysis date, 01/90; A, alive; D,
dead). For each subject, determine the time (in months) to death (D) or to the
ending date (for survivors whose status was marked as A); then calculate and
graph the Kaplan–Meier curve.

Table E13.7 Starting Ending Status (A/D)

Subject 01/80 01/90 A
06/80 07/88 D
1 11/80 10/84 D
2 08/81 02/88 D
3 04/82 01/90 A
4 06/83 11/85 D
5 10/85 01/90 A
6 02/86 06/88 D
7 04/86 12/88 D
8 11/86 07/89 D
9
10

13.8 Given the small data set:

Sample 1: 24, 30, 42, 15 , 40 , 42

Sample 2 : 10, 26, 28, 30, 41, 12

compare them using both the log‐rank and generalized Wilcoxon tests.

13.9 Pneumocystis carinii pneumonia (PCP) is the most common opportunistic
infection in HIV‐infected patients and a life‐threatening disease. Many North
Americans with AIDS have one or two episodes of PCP during the course of
their HIV infection. PCP is a consideration factor in mortality, morbidity, and
expense; and recurrences are common. As shown in the partial data set given
in Table E13.9, we have:
•• Treatments, coded as A and B;
•• Patient characteristics: baseline CD4 count, gender (1, male; 0, female),
race (1, white; 2, black; 3, other), weight (lb), homosexuality (1, yes; 0, no);
•• PCP recurrence indicator (1, yes; 0, no), PDATE or time to recurrence
(months);
•• DIE or death indicator (1, yes; 0, no), DDATE or time to death (or to date
last seen for survivors; months).


EXERCISES 487

table E13.9

TRT CD4 GENDER RACE WT HOMO PCP PDATE DIE DDATE

B2 1 1 142 1 1 11.9 0 14.6
B 139 1 2 117 0 0 11.6 1 11.6
A 68 1 2 149 0 0 12.8 0 12.8
A 12 1 1 160 1 0 7.3 1 7.3
B 36 1 2 157 0 1 4.5 0 8.5
B 77 1 1 12 1 0 18.1 1 18.1
A 56 1 1 158 0 0 14.7 1 14.7
B 208 1 2 157 1 0 24.0 1 24.0
A 40 1 1 122 1 0 16.2 0 16.2
A 53 1 2 125 1 0 26.6 1 26.6
A 28 1 2 130 0 1 14.5 1 19.3
A 162 1 1 124 0 0 25.8 1 25.8

Note: The full dataset is available at www.wiley.com/go/Le/Biostatistics.

Consider each of these endpoints: relapse (treating death as censoring),
death (treating relapse as censoring), and death or relapse (whichever comes
first). For each endpoint:

(a) Estimate the survival function for homosexual white men.

(b) Estimate the survival functions for each treatment.

(c) Compare the two treatments; do they differ in the short and long terms?

(d) Compare men and women.

(e) Taken collectively, do the covariates contribute significantly to predic-
tion of survival?

(f) Fit the multiple regression model to obtain estimates of individual
regression coefficients and their standard errors. Draw conclusions
concerning the conditional contribution of each factor.

(g) Within the context of the multiple regression model in part (b), does
treatment alter the effect of CD4?

(h) Focus on treatment as the primary factor, taken collectively; was this
main effect altered by any other covariates?

(i) Within the context of the multiple regression model in part (b), is the
effect of CD4 linear?

(j) Do treatment and CD4, individually, fit the proportional hazards model?

13.10 It has been noted that metal workers have an increased risk for cancer of the
internal nose and paranasal sinuses, perhaps as a result of exposure to cutting
oils. A study was conducted to see whether this particular exposure also
increases the risk for squamous cell carcinoma of the scrotum (Rousch et al.,
1982). Cases included all 45 squamous cell carcinomas of the scrotum diag-
nosed in Connecticut residents from 1955 to 1973, as obtained from the
Connecticut Tumor Registry. Matched controls were selected for each case
based on the age at death (within eight years), year of death (within three


488 ANALYSIS OF SURVIVAL DATA AND DATA FROM MATCHED STUDIES

years), and number of jobs as obtained from combined death certificate and
directory sources. An occupational indicator of metal worker (yes/no) was
evaluated as the possible risk factor in this study; results are shown in
Table E13.10.

table E13.10

Controls

Cases Exposed Unexposed

Exposed 2 26
Unexposed 5 12

(a) Find a 95% confidence interval for the odds ratio measuring the strength
of the relationship between the disease and the exposure.

(b) Test for the independence between the disease and the exposure.

13.11 Ninety‐eight heterosexual couples, at least one of whom was HIV infected,
were enrolled in an HIV transmission study and interviewed about sexual
behavior (Padian, 1990). Table E13.11 provides a summary of condom use
reported by heterosexual partners. Test to compare the reporting results bet-
ween men and women.

Table E13.11

Woman Ever Man Total

Ever 45 Never 51
Never 7 47
6
Total 52 40 98

46

13.12 A matched case–control study was conducted to evaluate the cumulative
effects of acrylate and methacrylate vapors on olfactory function (Schwarts
et al., 1989). Cases were defined as scoring at or below the 10th percentile
on the University of Pennsylvania Smell Identification Test (UPSIT;
Table E13.12).

Table E13.12 Exposed Cases

Controls 25 Unexposed
Exposed 9
Unexposed 22
21


EXERCISES 489

(a) Find a 95% confidence interval for the odds ratio measuring the strength
of the relationship between the disease and the exposure.

(b) Test for the independence between the disease and the exposure.

13.13 A study in Maryland identified 4032 white persons, enumerated in a nonof-
ficial 1963 census, who became widowed between 1963 and 1974 (Helsing
and Szklo, 1981). These people were matched, one to one, to married per-
sons on the basis of race, gender, year of birth, and geography of residence.
The matched pairs were followed in a second census in 1975.

(a) We have the overall male mortality shown in Table E13.13, part a. Test
to compare the mortality of widowed men versus married men.

(b) The data for 2828 matched pairs of women are shown in Table E13.13,
part b. Test to compare the mortality of widowed women versus married
women.

Table E13.13 Married men
Part a. Data for men
Dead Alive
Widowed men
Dead 2 292
Alive 210 700

Part b. Data for women Married women

Widowed women Dead Alive
Dead
Alive 1 264
249 2314

13.14 Table E13.14 at the end of this chapter provides some data from a matched
case–control study to investigate the association between the use of x‐ray
and risk of childhood acute myeloidleukemia. In each matched set or pair,
the case and control(s) were matched by age, race, and county of residence.
The variables are:

•• Matched set (or pair);

•• Disease (1, case; 2, control);

•• Some chracteristics of the child: sex (1, male; 2, female), Down’s syn-
drome (a known risk factor for leukemia; 1, no; 2, yes), age;

•• Risk factors related to the use of x‐ray: MXray (mother ever had x‐ray dur-
ing pregnancy; 1, no; 2, yes), UMXray (mother ever had upper‐body x‐ray
during pregnancy; 0, no; 1, yes), LMXray (mother ever had lower‐body


490 ANALYSIS OF SURVIVAL DATA AND DATA FROM MATCHED STUDIES

table E13.14

M UM LM F C CN

Matched set Disease Sex Downs Age x‐ray x‐ray x‐ray x‐ray x‐ray x‐ray

1 1 2 1 01 0 0 1 1 1
1 2 2 1 01 0 0 1 1 1
2 1 1 1 61 0 0 1 2 3
2 2 1 1 61 0 0 1 2 2
3 1 2 1 81 0 0 1 1 1
3 2 2 1 81 0 0 1 1 1
4 1 1 2 11 0 0 1 1 1
4 2 1 1 11 0 0 1 1 1
5 1 1 1 42 0 1 1 1 1
5 2 1 1 41 0 0 1 2 2
6 1 2 1 92 1 0 1 1 1
6 2 1 1 91 0 0 1 1 1
7 1 2 1 17 1 0 0 1 2 2
7 2 2 1 17 1 0 0 1 2 2
8 1 2 1 51 0 0 1 1 1
8 2 1 1 51 0 0 1 1 1
9 1 2 2 01 0 0 1 1 1
9 2 2 1 02 1 0 2 1 1
9 2 2 1 01 0 0 1 1 1
10 1 2 1 7 1 0 0 2 1 1
10 2 1 1 7 1 0 0 1 1 1
11 1 1 1 15 1 0 0 1 1 1
11 2 1 1 15 1 0 0 1 2 2
12 1 1 1 12 1 0 0 1 2 2
12 2 1 1 12 1 0 0 1 1 1
13 1 1 1 4 1 0 0 1 1 1
13 2 2 1 4 1 0 0 1 1 1
14 1 1 1 14 1 0 0 1 2 2
14 2 2 1 14 1 0 0 1 1 1
14 2 1 1 14 1 0 0 1 1 1
15 1 1 1 7 1 0 0 2 1 1
15 2 1 1 7 1 0 0 2 1 1
15 2 1 1 7 1 0 0 1 2 2
16 And

so on

Note: The full dataset is available at www.wiley.com/go/Le/Biostatistics.

x‐ray during pregnancy; 0, no; 1, yes), FXray (father ever had x‐ray; 1, no;
2, yes), CXray (child ever had x‐ray; 1, no; 2, yes), CNXray (child’s total
number of x‐rays; 1, none; 2, 1–2; 3, 3–4; 4, 5 or more).

(a) Taken collectively, do the covariates contribute significantly to the
separation of cases and controls?


EXERCISES 491

(b) Fit the multiple regression model to obtain estimates of individual
regression coefficients and their standard errors. Draw conclusions
concerning the conditional contribution of each factor.

(c) Within the context of the multiple regression model in part (b), does
gender alter the effect of Down’s syndrome?

(d) Within the context of the multiple regression model in part (b), taken
collectively, does the exposure to x‐ray (by the father, or mother, or
child) relate significantly to the disease of the child?

(e) Within the context of the multiple regression model in part (b), is the
effect of age linear?

(f) Focus on Down’s syndrome as the primary factor, taken collectively;
was this main effect altered by any other covariates?


14

STUDY DESIGNS

Statistics is more than just a collection of long columns of numbers and sets of
formulas. Statistics is a way of thinking – thinking about ways to gather and analyze
data. The gathering part comes before the analyzing part; the first thing a statistician
or a learner of statistics does when faced with data is to find out how the data were
collected. Not only does how we should analyze data depend on how data were
c­ollected, but formulas and techniques may be misused by a well‐intentioned
researcher simply because data were not collected properly. In other cases, studies
were inconclusive because they were poorly planned and not enough data were
c­ ollected to accomplish the goals and support the hypotheses.

Study data may be collected in many different ways. When we want information,
the most common approach is to conduct a survey in which subjects in a sample are
asked to express opinions on a variety of issues. For example, an investigator ­surveyed
several hundred students in grades 7 through 12 with a set of questions asking the
date of their last physical checkup and how often they smoke cigarettes or drink
alcohol.

The format of a survey is such that one can assume there is an identifiable, ­existing
target population of subjects. We act as if the sample is obtained from the target
population according to a carefully defined technical procedure called random
­sampling. The basic steps and characteristics of a such a process were described in
detail in Section 3.1.2. However, in biomedical research, a sample survey is not a
common form of study; it may not be used at all. The laboratory investigator uses
animals in projects, but the animals are not selected randomly from a large population
of animals. The clinician, who is attempting to describe the results obtained with a
particular therapy, cannot say that he or she has obtained patients as a random sample
from a target population of patients.

Introductory Biostatistics, Second Edition. Chap T. Le and Lynn E. Eberly.
© 2016 John Wiley & Sons, Inc. Published 2016 by John Wiley & Sons, Inc.
Companion website: www.wiley.com/go/Le/Biostatistics


494 STUDY DESIGNS

14.1  TYPES OF STUDY DESIGNS

In addition to surveys that are cross‐sectional, as seen in many examples in earlier
chapters, study data may be collected in many different ways. For example, investiga-
tors are faced more and more frequently with the problem of determining whether a
specific factor or exposure is related to a certain aspect of health. Does air pollution
cause lung cancer? Do birth control pills cause thromboembolic death? There are
r­easons for believing that the answer to each of these and other questions is yes, but all
are controversial; otherwise, no studies are needed. Generally, biomedical research
data may come from different sources, the two fundamental designs being ­retrospective
and prospective. But strategies can be divided further into four different types:

1. Retrospective studies (of past events);

2. Prospective studies (of past events);

3. Cohort studies (of ongoing or future events);

4. Clinical trials.

Retrospective studies of past events gather past data from selected cases, per-
sons who have experienced the event in question, and controls, persons who have
not experienced the event in question, to determine differences, if any, in exposure
to a suspected risk factor under investigation. They are commonly referred to as
case–control studies; each case–control study is focused on a particular ­disease. In
a typical case–control study, cases of a specific disease are ascertained as they
arise from population‐based registers or lists of hospital admissions, and controls
are sampled either as disease‐free persons from the population at risk or as hospi-
talized patients having a diagnosis other than the one under study. An example is
the study of thromboembolic death and birth control drugs. Thromboembolic
deaths were identified from death certificates, and exposure to the pill was traced
by interview with each woman’s physician and a check of her various medical
records. Control women were women in the same age range under the care of the
same physicians.

Prospective studies of past events are less popular because they depend on the
existence of records of high quality. In these, samples of exposed subjects and
­unexposed subjects are identified in the records. Then the records of the persons
selected are traced to determine if they have ever experienced the event to the present
time. Events in question are past events, but the method is called prospective because
it proceeds from exposure forward to the event.

Cohort studies are epidemiological designs in which one enrolls a group of
­persons and follows them over certain periods of time; examples include occupational
mortality studies and clinical trials. The cohort study design focuses on a particular
exposure rather than a particular disease as in case–control studies. There have been
several major cohort studies that made significant contributions to our understanding
of important public health issues, but this form of study design is not very popular
because cohort studies are time‐ and cost‐consuming.


CLASSIFICATION OF CLINICAL TRIALS 495

In this chapter we focus on study designs. However, since in biomedical research
the sample survey is not a common form of study, and prospective studies of past
events and cohort studies are not often conducted, we put more emphasis on the
designs of clinical trials, which are important because they are experiments on human
beings, and of case–control studies, which are the most popular of all study designs.

14.2  CLASSIFICATION OF CLINICAL TRIALS

Clinical studies form a class of all scientific approaches to evaluating medical d­ isease
prevention, diagnostic techniques, and treatments. Among this class, trials, often
called clinical trials, form a subset of those clinical studies that evaluate ­investigational
drugs or devices.

Trials, especially cancer trials, are classified into phases:

•• Phase I trials focus on safety of a new investigational medicine or device. These
are the first human trials after successful animal trials.

•• Phase II trials are small trials to evaluate efficacy and focus more on a safety
profile.

•• Phase III trials are well‐controlled trials, the most rigorous demonstration of a
drug’s or a device’s efficacy prior to federal regulatory approval.

•• Phase IV trials are often conducted after a medicine or device is marketed to
provide additional details about the medicine’s efficacy and a more complete
safety profile.

In the context of cancer trials, phase I trials apply to patients from standard
treatment failure who are at high risk of death in the short term. As for the new med-
icine or drug to be tested, there is no ­efficacy at low doses; at high doses, there will
be unavoidable toxicity, which may be severe and may even be fatal. Little is known
about the dose range; animal studies may not be helpful enough. The goal in a phase
I trial is to identify a maximum tolerated dose (MTD), a dose that has reasonable
efficacy (i.e., is toxic enough, say, to kill cancer cells) but with tolerable toxicity
(i.e., not toxic enough to kill the patient).

Phase II trials, the next step, are often the simplest: The drug, at the optimal dose
(MTD) found in a phase I trial, is given to a small group of patients who meet
p­redetermined inclusion criteria. The most common form are single‐arm studies
where investigators are seeking to establish the antitumor activity of a drug usually
measured by a response rate. A patient responds when his or her cancer condition
improves (e.g., the tumor disappears or shrinks substantially). The response rate
is  the proportion or percentage of patients who respond. A phase II trial may be
­conducted in two stages (as will be seen in Section  14.6) when investigators
are concerned about severe side effects.

A second type of phase II trial consists of small comparative trials where we want
to establish the efficacy of a new drug against a control or standard regimen. In these


496 STUDY DESIGNS

phase II trials, with or without randomization, investigators often test their validity
by paying careful attention to inclusion and exclusion criteria. Inclusion criteria
focus on the definition of patient characteristics required for entry into a clinical trial.
These describe the population of patients that the drug is intended to serve. There are
exclusion criteria as well, to keep out patients that the drug is not intended to serve.

Phase III and IV trials are designed similarly. Phase III trials are conducted before
regulatory approval, and phase IV trials, which are often optional, are conducted
after regulatory approval. These are larger, controlled trials, whose control is achieved
by randomization. Patients enter the study sequentially and upon enrollment,
each patient is randomized to receive either the investigational drug (or device) or
a placebo (or ­standard therapy). As medication, the placebo is “blank,” that is,
without any active medicine. The use of a placebo, whose size and shape are similar
to those of the drug, is to control psychological and emotional effects (e.g., possible
prejudices on the part of the patient and/or investigator). Randomization is a tech-
nique to ensure that the two groups, the one receiving the real drug or device and the
one receiving the placebo, are more comparable, more similar with respect to known
as well as unknown factors (so that the conclusion is more valid). For example, the
new patient is assigned to receive the drug or the placebo by a process similar to that
of flipping a coin. Trials in phases III and IV are often conducted as double blind, that
is, blind to the patient (he or she does not know if a real drug is given so as to prevent
psychological effects; of course, the patient’s consent is required) and blind to the
investigator (so as to prevent bias in measuring/evaluating outcomes). Some member
of the investigation team, often designated a priori, keeps the code (the list of which
patients received drug and which patients received placebo) which is broken only at
the time of study completion and data analysis. The term triple blind may be used in
some trials to indicate the b­ linding of regulatory officers.

A phase III or IV trial usually consists of two periods: an enrollment period,
when patients enter the study and are randomized, and a follow‐up period. The latter
is very desirable if long‐term outcome is needed. As an example, a study may ­consist
of three years of enrollment and two years of follow‐up; no patients are enrolled
during the last two years. Figure 14.1 shows a description of a typical phase III or
IV clinical trial.

No new subjects enrolled
after this point

0 ↓ π2

Study initiation π1 Study termination

Enrollment period Follow-up period
(e.g., 3 years) (e.g., 2 years)

Figure 14.1  Phase III or IV clinical trial.


DESIGNING PHASE I CANCER TRIALS 497

14.3  DESIGNING PHASE I CANCER TRIALS

Different from other phase I clinical trials, phase I clinical trials in cancer have
s­ everal main features. First, the efficacy of chemotherapy or any cancer treatment is,
indeed, frequently associated with a nonnegligible risk of severe toxic effect, often
fatal, so that ethically, the initial administration of such drugs cannot be investigated
in healthy volunteers but only in cancer patients. Usually, only a small number of
patients are available to be entered in phase I cancer trials. Second, these patients are
at very high risk of death in the short term under all standard therapies, some of
which may already have failed for those patients. At low doses, little or no efficacy is
expected from the new therapy, and a slow intrapatient dose escalation is not pos-
sible. Third, there is not enough information about the drug’s activity profile. In
addition, clinicians often want to proceed as rapidly as possible to phase II trials with
more emphasis on ­efficacy. The lack of information about the relationship between
dose and probability of toxicity causes a fundamental dilemma inherent in phase I
cancer trials: the conflict between scientific and ethical intent. We need to reconcile
the risks of ­toxicity to patients with the potential benefit to these patients, with an
efficient design that uses no more patients than necessary. Thus, a phase I cancer trial
may be viewed as a problem in optimization: maximizing the dose–toxicity evalua-
tion while minimizing the number of patients treated.

Although recently their ad hoc nature and imprecise determination of maximum
tolerated dose (MTD) have been called into question, cohort‐escalation trial designs,
called standard designs, have been used widely for years. In the last several years a
competing design called fast track is getting more popular. These two cohort‐­
escalation trial designs can be described as follows.

Since a slow intrapatient dose escalation is either not possible or not practical,
investigators often use five to seven doses selected from “safe enough” to “effective
enough.” The starting dose selection of a phase I trial depends heavily on p­ harmacology
and toxicology from preclinical studies. Although the translation from animal to
human is not always a perfect correlation, toxicology studies offer an estimation
range of a drug’s dose–toxicity profile and the organ sites that are most likely to be
affected in humans. Once the starting dose is selected, a reasonable dose escalation
scheme needs to be defined. There is no single optimal or efficient escalation scheme
for all drugs. Generally, dose levels are selected such that the percentage increments
between successive doses diminish as the dose is increased. A modified Fibonacci
sequence, with increases of 100, 67, 50, 40, and 33%, is often employed, because it
follows a diminishing pattern but with modest increases.

The standard design uses three‐patient cohorts and begins with one cohort at the
lowest possible dose level. It observes the number of patients in the cohort who
­experience toxicity seriously enough to be considered a dose‐limiting toxicity (DLT).
The trial escalates through the sequence of doses until enough patients experience
DLTs to stop the trial and declare an MTD. The dose at which the toxicity threshold
is exceeded is designated the MTD. In a standard design, if no patients in a cohort
experience a DLT, the trial continues with a new cohort at the next higher dose; if two
or three patients experience a DLT, the trial is stopped as the toxicity threshold is


498 STUDY DESIGNS

exceeded and an MTD is identified; if exactly one patient experiences a DLT, a new
cohort of three patients is employed at the same dose. In this second cohort, ­evaluated
at the same dose, if no severe toxicity is observed, the dose is escalated to the next‐
highest level; otherwise, the trial is terminated and the dose in use at trial termination
is recommended as the MTD. Note that intrapatient escalation is not used to evaluate
the doses, to avoid the confounding effect of carryover from one dose to the next.
We can refer to the standard design as a three and three design because at each new
dose it enrolls a cohort of three patients with the option of enrolling an additional
three patients evaluated at the same dose. Some slight variations of the standard
design are also used in various trials.

The fast‐track design is a variation of the standard design. It was created by
m­ odifying the standard design to move through low toxicity rate doses using
fewer patients. The design uses a predefined set of doses and cohorts of one or
three patients, escalating through the sequence of doses using a one‐patient cohort
until the first DLT is observed. After that, only three‐patient cohorts are used.
When no DLT is observed, the trial continues at the next‐higher dose with a cohort
of one new patient. When a DLT is observed in a one‐patient evaluation of a dose,
the same dose is evaluated a second time with a cohort of three new patients, if no
patient in this cohort experiences a DLT, the design moves to the next‐higher dose
with a new cohort of three patients, and from this point, the design p­ rogresses as
a standard design. When one or more patients in a three‐patient cohort experi-
ences a DLT, the current dose is considered the MTD. If a one‐patient cohort is
used at each dose level throughout, six patients are often tested at the very last
dose. Similar to a standard design, no intrapatient escalation is allowed in a fast‐
track design.

There seems to be no perfect solution. The standard design is more popular and
more conservative (i.e., safer); very few patients are likely to be overtreated by doses
with undesirable levels of toxicity. However, in a standard design, many patients who
enter early in the trial are likely to be treated suboptimally, and only a few patients
may be left after an MTD is reached, especially if there were many doses below
MTD. Generally, the use of a fast‐track design seems very attractive because some
clinicians want to proceed to a phase II trial as fast as they can, to have a first look at
efficacy. The fast‐track design quickly escalates through early doses that have a low
expected risk of dose‐limiting toxicity, thereby reducing the number of patients
treated at the lowest toxicity selected in single‐patient cohorts. On the other hand, the
fast‐track design may allow a higher percentage of patients to be treated at very high
toxic doses; and the fact that it uses a single‐patient cohort until the first DLT is
observed seems too risky for some investigators. For more experienced investigators,
the fast‐track design presents an improved use of patient resources with a moderate
compromise of patient safety; but safety could be a problem with inexperienced
investigators who might select high doses to start with. The common problem for
both designs is the lack of robustness: that the expected rate of MTD selected is
strongly influenced by the doses used, and these doses may be selected arbitrarily by
investigators, which makes their experience a crucial factor.


SAMPLE SIZE DETERMINATION FOR PHASE II TRIALS AND SURVEYS 499

14.4  SAMPLE SIZE DETERMINATION FOR PHASE II TRIALS
AND SURVEYS

The determination of the size of a sample is a crucial element in the design of a survey
or a clinical trial. In designing any study, one of the first questions that must be answered
is: How large must the sample be to accomplish the goals of the study? Depending on
the study goals, the planning of sample size can be approached accordingly.

Phase II trials are the simplest. The drug, at the optimal dose (MTD) found from
a previous phase I trial, is given to a small group of patients who meet predetermined
inclusion criteria. The focus is often on the response rate. Because of this focus, the
planning of sample size can be approached in terms of controlling the width of a
desired confidence interval for the parameter of interest, the response rate.

Suppose that the goal of a study is to estimate an unknown response rate π. For the
confidence interval to be useful, it must be short enough to pinpoint the value of the
parameter reasonably well with a high degree of confidence. If a study is unplanned
or poorly planned, there is a real possibility that the resulting confidence interval will
be too long to be of any use to the researcher. In this case, we may decide to have an
estimate error not exceeding d, an upper bound for the margin of error since the 95%
confidence interval for the response rate π, a population proportion, is

p 1.96 p1 p
n

where p is the sample proportion. Therefore, our goal is expressed as

1.96 p1 p d
n

leading to the required minimum sample size:

n 1.96 2 p 1 p
d2

(rounded up to the next integer). This required sample size is affected by three factors:

1. The degree of confidence (i.e., 95% which yields the coefficient 1.96);

2. The maximum tolerated error or upper bound for the margin of error, d,
d­ etermined by the investigator(s) (a confidence interval’s half‐width);

3. The proportion p itself.

This third factor is unsettling. To find n so as to obtain an accurate value of the
proportion, we need the proportion itself. There is no perfect, exact solution for this.
Usually, we can use information from similar studies, past studies, or studies on


500 STUDY DESIGNS

­similar populations. If no good prior knowledge about the proportion is available, we
can replace p(1 − p) by 0.25 and use a conservative sample size estimate:

1.96 2 0.25
nmax d2

because nmax ≥ n regardless of the value of π. Most phase II trials are small; i­ nvestigators
often set the maximum tolerated error or upper bound for the margin of error, d, at
10% (0.10) or 15%; some even set it at 20%.

Example 14.1
If we set the maximum tolerated error d at 10%, the required minimum sample size is

1.96 2 0.25
nmax 0.1 2

or 97 patients, which is usually too high for a small phase II trial, especially in the
field of cancer research, where very few patients are available. If we set the maximum
tolerated error d at 15%, the required minimum sample size is

nmax 1.96 2 0.25
0.15 2


or 43 patients.
The same method for sample size determination as above applies to surveys as

well, except that for surveys we can afford to use much larger sample sizes. We can
set the maximum tolerated error at a very low level, resulting in very short confidence
intervals.

Example 14.2
Suppose that a study is to be conducted to estimate the smoking rate among National
Organization for Women (N.O.W.) members. Suppose also that we want to estimate
this proportion to within 3% (i.e., d = 0.03) with 95% confidence.

a)  Since the current smoking rate among women in general is about 27%
(0.27), we can use this figure in calculating the required sample size. This
results in

n 1.96 2 0.27 0.73
0.03 2

841.3

or a sample of size 842 is needed.


SAMPLE SIZES FOR OTHER PHASE II TRIALS 501

b)  If we do not want or have the figure of 27%, we still can conservatively take

nmax 1.96 2 0.25
0.03 2

1067.1

(i.e., we can sample 1068 members of N.O.W.). Note that this conservative
sample size is adequate regardless of the true value π of the unknown population
proportion; values of n and nmax are closer when π is near 0.5.

14.5  SAMPLE SIZES FOR OTHER PHASE II TRIALS

As pointed out in previous sections, most phase II trials are single‐arm studies where
we are seeking the antitumor activitity of a drug measured by response rate. But there
are also a variety of other phase II trials.

Some phase II trials are randomized comparative studies. These are most likely to
be cases where we have established activity for a given drug (from a previous o­ ne‐
arm nonrandomized trial) and wish to add another drug to that regimen. In these
randomized phase II trials, the goal is to select the better treatment (the sample sizes
for these phase II trials are covered in Section 14.7).

Some phase II trials deal with assessing the activity of a biologic agent where
tumor response is not the main endpoint of interest. We may be attempting to
­determine the effect of a new agent: for example, on the prevention of a toxicity.
The  primary endpoint may be measured on a continuous scale. In other trials, a
­pharmacologic- or biologic-to-outcome correlative objective may be the target.

14.5.1  Continuous Endpoints

When the primary outcome of a trial is measured on a continuous scale, the focus
is on the mean. Because of this focus, the planning of sample size can be approached
in terms of controlling the width of a desired confidence interval for the parameter
of interest, the (population) mean. The sample size determination is similar to the
case when the focus is the response rate. That is, for the confidence interval to be
useful, it must be short enough to pinpoint the value of the parameter, the mean,
reasonably well with a high degree of confidence, say 95%. If a study is unplanned
or poorly planned, there is a real possibility that the resulting confidence interval
will be too long to be of any use to the researcher. In this case, we may decide to
have an estimate error not exceeding d, an upper bound for the margin of error.
With a given level of the maximum tolerated error d, the minimum required sample
size is given by

n 1.96 2 s2
d2


502 STUDY DESIGNS

(rounded up to the next integer). This required sample size is also affected by three
factors:

1. The coefficient 1.96. As mentioned previously, a different coefficient is used
for a different degree of confidence, which is set arbitrarily by the investigator;
95% is a conventional choice.

2. The maximum tolerated error d, which is also set arbitrarily by the investigator.
3. The variability of the population measurements, the variance. This seems like a

circular problem. We want to find the size of a sample so as to estimate the mean
accurately, and to do that, we need to know the variance before we have the data!
Of course, the exact value of the variance is also unknown. However, we can use
information from similar studies, past studies, or some reasonable upper bound.
If nothing else is available, we may need to run a preliminary or pilot study.
One‐fourth of the range may serve as a rough estimate for the standard deviation.

Example 14.3
Perhaps it is simpler to see the sample size determination concerning a continuous
endpoint in the context of a survey. Suppose that a study is to be conducted to estimate
the average birth weight of babies born to mothers addicted to cocaine. Suppose also
that we want to estimate this average to within 0.5 lb with 95% confidence. This goal
specifies two quantities:

d 0.5

coefficient 1.96.

What value should be used for the variance? Information from normal babies may be
used to estimate s. The rationale here is that the addiction affects every baby almost
uniformly; this may result in a smaller average, but the variance is unchanged.
Suppose that the estimate from normal babies is σ  2.5 lb, then the required sample
size is approximately

n 1.96 2 2.5 2
0.5 2

97.

14.5.2  Correlation Endpoints

When the parameter is a coefficient of correlation, the planning of sample size is
approached differently because it is very difficult to come up with a meaningful
maximum tolerated error for estimation of the coefficient of correlation. Instead of
controlling the width of a desired confidence interval, the sample size determination
is approached in terms of controlling the risk of making a type II error. The decision
is concerned with testing a null hypothesis,

H0 : 0


ABOUT SIMON’S TWO‐STAGE PHASE II DESIGN 503

against an alternative hypothesis,

HA : A

in which ρA is the investigator’s hypothesized value for the coefficient of correlation
ρ of interest. With a given level of significance α (usually, 0.05) and a desired
statistical power (1 − β; β is the size of type II error associated with the alternative
HA), the required sample size is given by

n 3F A z1 z1
where F 1 ln 1 .
21


wThitehqaucahnotiitcyezo1–fα α(z1(–ββ));isfothr eexpaemrcpelnet,ilze1–o0.f5 the standard normal distribution associated
= 1.96. The transformation from ρ to F(ρ)

is often referred to as Fisher’s transformation, the same transformation used in form-

ing confidence intervals in Chapter  4. Obviously, to detect a true correlation ρA
greater than 0.5, a small sample size would suffice, which is suitable in the context

of phase II trials.

Example 14.4
Suppose that we decide to preset α = 0.05. To design a study such that its power to
detect a true correlation ρA = 0.6 is 90% (or β = 0.10), we would need only

FA 1 ln 1 A
2 1 A

1 ln 1.6
2 0.4

0.693

n 3F A z1 z1

n 3 0.693 1.96 1.28

or n = 25 subjects.

14.6  ABOUT SIMON’S TWO‐STAGE PHASE II DESIGN

Phase I trials treat only three to six patients per dose level according to the standard
design. In addition, those patients may be diverse with regard to their cancer
­diagnosis; consequently, phase I trials provide little or no information about efficacy.
A phase II trial is the first step in the study of antitumor effects of an investigational
drug. The aim of a phase II trial of a new anticancer drug is to determine whether the
drug has sufficient antitumor activity to warrant further development. Further
development may mean combining the drug with other drugs, or initiation of a phase


504 STUDY DESIGNS

III trial. However, these patients are often at high risk of dying from cancers if not
treated effectively. Therefore, it is desirable to use as few patients as possible in a
phase II trial if the regimen under investigation is, in fact, low antitumor activity.
When such ethical concerns are of high priority, investigators often choose the Simon
two‐stage design:

1. A group of n1 patients are enrolled in the first stage. If r1 or fewer of these n1
patients respond to the drug, the drug is rejected and the trial is terminated; if
more than r1 responses are observed, investigators proceed to stage II and
enroll n2 more patients.

2. After stage II, if r or fewer responses are observed, including those in stage I, the
drug is rejected; if more than r responses are observed, the drug is recommended
for further evaluation.

Simon’s design is based on testing a null hypothesis, H0: π ≤ π0, that the true
response rate π is less than some low and uninteresting level π0 against an
alternative hypothesis HA: π ≥ πA, that the true response rate π exceeds a certain
desirable target level πA, which, if true, would allow us to consider the drug to have
sufficient a­ ntitumor activity to warrant further development. The design ­parameters
n1, r1, n2, and r are determined so as to minimize the number of patients n = n1 + n2
if H0 is true: The drug, in fact, has low antitumor activity. The option that allows
early termination of the trial satisfies high‐priority ethical concerns. The derivation
is more advanced and there are no closed‐form formulas for the design parameters
n1, r1, n2, and r. Beginning users can look for help to Simon’s two‐stage design, if
appropriate.

14.7  PHASE II DESIGNS FOR SELECTION

Some randomized phase II trials do not fit the framework of tests of significance.
In performing statistical tests or tests of significance, we have the option to declare
a trial not significant when data do not provide enough support for a treatment
difference. In those cases we decide not to pursue the new treatment, and we do
not choose the new treatment because it does not prove any better than the placebo
effect or that of a standard therapy. In some cancer areas we may not have a
s­ tandard therapy, or if we do, some subgroups of patients may have failed using
standard therapies. Suppose further that we have established activity for a given
drug from a previous one‐arm nonrandomized trial, and the only remaining
question is scheduling: for example, daily versus one every other day schedules.
Or we may wish to add another drug to that regimen to improve it. In these cases
we do not have the option to declare the trial not significant because: (i) one of the
treatments or schedules has to be chosen (because patients have to be treated), and
(ii) it is inconsequential to choose one of the two treatments/schedules even if they
are equally efficacious. The aim of these randomized trials is to choose the better
treatment.


PHASE II DESIGNS FOR SELECTION 505

14.7.1  Continuous Endpoints

When the primary outcome of a trial is measured on a continuous scale, the focus is
on the mean. At the end of the study, we select the treatment or schedule with the
larger sample mean. But first we have to define what we mean by better treatment.
Suppose that treatment 2 is said to be better than treatment 1 if

2 1 d

where d is the magnitude of the difference between μ2 and μ1 that is deemed to be
important; the quantity d is often called the minimum clinical significant difference.
Then we want to make the correct selection by making sure that at the end of the
study, the better treatment will be the one with the larger sample mean. This goal is
achieved by imposing a condition,

Pr x2 x1 | 2 1 d 1 .

For example, if we want to be 99% sure that the better treatment will be the one with
the larger sample mean, we can preset α = 0.01. To do that, the total sample size must
be at least

2
2
N 4 z1
d2

assuming that we conduct a balanced study with each group consisting of n = N/2
subjects. To calculate this minimum required total sample size, we need the variance
σ2. The exact value of σ2 is unknown; we may depend on prior knowledge about one
of the two arms from a previous study or use some upper bound.

Example 14.5
Suppose that, for a certain problem, d = 5 and it is estimated that σ 2 = 36. Then if we
want to be 95% sure that the better treatment will be the one with the larger sample
mean, we would need

N 4 1.96 2 36
52

 24

with 12 subjects in each of the two groups. It will be seen later that with similar
­specifications, a phase III design would require a larger sample to detect a treatment
difference of d = 5 using a statistical test of significance.

14.7.2  Binary Endpoints

When the primary outcome of a trial is measured on a binary scale, the focus is on
a proportion, the response rate. At the end of the study, we select the treatment or
schedule with the larger sample proportion. But first we have to define what we


506 STUDY DESIGNS

mean by better treatment. Suppose that treatment 2 is said to be better than
treatment 1 if

2 1 d

where d is the magnitude of the difference between π2 and π1 that is deemed to be
important; the quantity d is often called the minimum clinical significant difference.
Then we want to make the correct selection by making sure that at the end of the
study, the better treatment will be the one with the larger sample proportion. This
goal is achieved by imposing a condition,

Pr p2 p1 | 2 1 d 1

where the ps are sample proportions. For example, if we want to be 99% sure that
the better treatment will be the one with the larger sample proportion, we can preset
α = 0.01. To do that, the total sample size must be at least

N 4 z1 2 1 2
2 1

assuming that we conduct a balanced study with each group consisting of n = N/2
subjects. In this formula, is the average proportion:

1 2.
2

It is obvious that the problem of planning sample size is more difficult, and a good
solution requires a deeper knowledge of the scientific problem: a good idea of the
magnitude of the proportions π1 and π2 themselves. In many cases, that may be
impractical at this stage.

14.8  TOXICITY MONITORING IN PHASE II TRIALS

In a clinical trial of a new treatment, severe side effects may be a problem, and the
trial may have to be stopped if the incidence is too high. For example, bone marrow
transplantation is a complex procedure that exposes patients to a high risk of a
variety of complications, many of them fatal. Investigators are willing to take these
risks because they are in exchange for much higher risks associated with the leu-
kemia or other ­disease for which the patient is being treated; for many of these
patients, standard and safer treatments have failed. Investigators often have to
face this problem of severe side effects and contemplate stopping phase II trials.
Phase I trials focus on safety of a new investigational medicine, and phase II trials
are small trials to evaluate efficacy. However, phase I trials are conducted with a
small number of patients; therefore, safety is still a major concern in a phase II


TOXICITY MONITORING IN PHASE II TRIALS 507

trial. If either the accrual or the treatment occurs over an extended period of time,
we can anticipate the need for a decision to halt the trial if an excess of severe side
effects occurs.

The following monitoring rule was derived using a more advanced statistical
method called the sequential probability ratio test. Basically, patients are enrolled,
randomized if needed, and treated; and the trial proceeds continuously until
the  number of patients with severe side effects meets the criterion judged as
e­ xcessive and the trial is stopped. The primary parameter is the incidence rate π of
severe side effects as defined specifically for the trial: for example, toxicity grade
III or IV. As with any other statistical test of significance, the decision is concerned
with testing a null hypothesis

H0 : 0
against an alternative hypothesis

HA : A

in which π0 is the investigator’s hypothesized value for the incidence rate π of severe
side effects, often formulated based on knowledge from previous phase I trial
results. The other figure, πA, is the maximum tolerated level for the incidence rate π
of severe side effects. The trial has to be stopped if the incidence rate π of severe
side effects exceeds πA. In addition to the null and alternative parameters, π0 and πA,
a stopping rule also depends on the chosen level of significance α (usually, 0.05)
and the desired statistical power (1 − β); β is the size of type II error associated with
the alternative HA: π = πA. Power is usually preset at 80 or 90%. With these
­specifications, we monitor for the side effects by sequentially counting the number
of events e (i.e., number of patients with severe side effects) and the number of
evaluable patients n(e) at which the eth event is observed. The trial is stopped when
this condition is first met:

ln 1 ln e ln 1 A ln 1 0 ln A ln 0 .
ne ln 1 A ln 1 0



In other words, the formula above gives us the maximum number of evaluable
patients n(e) at which the trial has to be stopped if e events have been observed.

Some phase II trials may be randomized; however, even in these randomized
trials, toxicity monitoring should be done separately for each study arm. That is, if
the side effect can reasonably occur in only one of the arms of the study, probably the
arm treated by the new therapy, the incidence in that group alone is considered.
Otherwise, the sensitivity of the process to stop the trial would be diluted by inclusion
of the other group. Sometimes the goal is to compare two treatments according to
some composite hypothesis that the new treatment is equally effective but has less
toxicity. In those cases, both efficacy and toxicity are endpoints, and the analysis
should be planned accordingly, but the situations are not that of monitoring in order
to stop the trial as intended in the rule above.


508 STUDY DESIGNS

Example 14.6
Suppose that in the planning for a phase II trial, an investigator (or clinicians in a
study committee) decided that π0 = 3% (0.03) based on some prior knowlege and that
πA = 15% (0.15) should be the upper limit that can be tolerated (as related to the risks
of the disease itself). For this illustrative example, we find that n(1) = −7, n(2) = 5,
n(3) = 18, n(4) = 31, and so on, when we preset the level of significance at 0.05 and
statistical power at 80%. In other words, we stop the trial if there are two events
among the first five evaluable patients, three events among the first 18 patients, four
events among the first 31 patients, and so on. Here we use only positive solutions and
the integer proportion of each solution from the equation above. The negative ­solution
n(1) = −7 indicates that the first event will not result in stopping the trial (because it
is judged as not excessive yet).

Example 14.7
The stopping rule would be more stringent if we want higher (statistical) power or if
incidence rates are higher. For example:

a)  With π0 = 3% and πA = 15% as set previously, but if we preset the level of
s­ ignificance at 0.05 and statistical power at 90%, the results become n(1) = −8,
n(2) = 4, n(3) = 17, n(4) = 30, and so on. That is, we would stop the trial if there
are two events among the first four evaluable patients, three events among the
first 17 patients, four events among the first 30 patients, and so on.

b)  On the other hand, if we keep the level of significance at 0.05 and statistical
power at 80%, but if we decide on π0 = 5% and πA = 20%, the results become
n(1) = −7, n(2) = 2, n(3) = 11, n(4) = 20, and so on. That is, we would stop the
trial if there are two events among the first two evaluable patients, three events
among the first 11 patients, four events among the first 20 patients, and so on.

It can be seen that the rule accelerates faster with higher rates than with higher
power.

14.9  SAMPLE SIZE DETERMINATION FOR PHASE III TRIALS

The determination of the size of a sample is a crucial element in the design of a study,
whether it is a survey or a clinical trial. In designing any study, one of the first
q­ uestions that must be answered is: How large must a sample be to accomplish the
goals of the study? Depending on the study goals, the planning of sample size can be
approached in two different ways: either in terms of controlling the width of a desired
confidence interval for the parameter of interest, or in terms of controlling the risk of
making type II errors. In Section 14.4, the planning of sample size was approached
in terms of controlling the width of a desired confidence interval for the parameter of
interest, the response rate in a phase II trial. However, phase III and IV clinical trials
are conducted not for parameter estimation but for the comparison of two treatments
(e.g., a new therapy versus a placebo or a standard therapy). Therefore, it is more


SAMPLE SIZE DETERMINATION FOR PHASE III TRIALS 509

suitable to approach the planning of sample size in terms of controlling the risk of
making type II errors. Since phase I and II trials, especially phase I trials, are specific
for cancer research but phase III and IV trials are applicable in any field, we will
cover this part in more general terms and include examples in fields other than
cancers.

Recall that in testing a null hypothesis, two types of errors are possible. We might
reject H0 when in fact H0 is true, thus committing a type I error. However, this type of
error can be controlled in the decision‐making process; conventionally, the proba-
bility of making this mistake is set at α = 0.05 or 0.01. A type II error occurs when
we fail to reject H0 even though it is false. In the drug‐testing example above, a type
II error leads to our inability to recognize the effectiveness of the new drug being
studied. The probability of committing a type II error is denoted by β, and 1 − β is
called the power of a statistical test. Since the power is the probability that we will be
able to support our research claim (i.e., the alternative hypothesis) when it is correct,
studies should be designed to have high power. This is achieved through the planning
of sample size. The method for sample size determination is not unique; it depends
on the endpoint and its measurement scale.

14.9.1  Comparison of Two Means

In many studies, the endpoint is on a continuous scale. For example, a researcher is
studying a drug that is to be used to reduce the cholesterol level in adult males aged
30 and over. Subjects are to be randomized into two groups, one receiving the new
drug (group 1), and one a look‐alike placebo (group 2). The response variable con-
sidered is the change in cholesterol level before compared to after the intervention.
The null hypothesis to be tested is

H0: 1 2
versus

HA: 2 1 or HA: 2 1.

Data would be analyzed using, for example, the two‐sample t test of Chapter  7.
However, before any data are collected, the crucial question is: How large a total
sample should be used to conduct the study?

In the comparison of two population means, μ1 versus μ2, the required minimum
total sample size is calculated from

N 4 z1 z1 22

d2

assuming that we conduct a balanced study with each group consisting of n = N/2
subjects. This required total sample size is affected by four factors:

1. The size α of the test. As mentioned previously, this is set arbitrarily by the
investigator; conventionally, α = 0.05 is often used. The quantity z1−α in the


510 STUDY DESIGNS

f­ormula above is the percentile of the standard normal distribution associated

wp­ riothceasscohfosicaemopfleαs;izfeordeetxearmmpinlea,tioz1n−,α = 1.96 when α = 0.05 is chosen. In the
statistical tests, such as two‐sample t test,

are usually planned as two‐sided. However, if a one-sided test is planned, this

step is changed slightly; for example, we use z = 1.65 when α = 0.05 is chosen.

2. The desired power (1 − β ); or the probability of committing a type II error β.
This value is also selected by the investigator; a power of 80 or 90% is often
used.

3. The quantity

d 2 1

which is the magnitude of the difference between μ1 and μ2 that is deemed to
be important. The quantity d is often called the minimum clinical significant
difference and its determination is a clinical decision, not a statistical decision.

4. The variance σ2 of the population. This variance is the only quantity that is
d­ ifficult to determine. The exact value of σ2 is unknown; we may use information
from similar studies or past studies or use an upper bound. Some investigators
may even run a preliminary or pilot study to estimate σ2; but an estimate from
a small pilot study may only be as good as any guess.

Example 14.8
Suppose that a researcher is studying a drug which is used to reduce the cholesterol
level in adult males aged 30 or over and wants to test it against a placebo in a b­ alanced
randomized study. Suppose also that it is important that a reduction difference of 5 be
detected (d = 5). We decide to preset α = 0.05 and want to design a study such that its
power to detect a difference between means of 5 is 95% (or β = 0.05). Also, the
­variance of cholesterol reduction (with placebo) is known to be about s2  36.

0.05 z1 1.96 two-sided test

0.05 z1 1.65

leading to the required total sample size:

N 4 1.96 1.65 2 36
52

 76.

Each group will have 38 subjects.

Example 14.9
Suppose that in Example 14.8, the researcher wanted to design a study such that its
power to detect a difference between means of 3 is 90% (or β = 0.10). In addition,
the variance of cholesterol reduction (with placebo) is not known precisely, but it


SAMPLE SIZE DETERMINATION FOR PHASE III TRIALS 511

is reasonable to assume that it does not exceed 50. As in Example 14.8, let us set
α = 0.05, leading to

0.05 z1 1.96 two-sided test

0.10 z1 1.28.

Then using the upper bound for variance (i.e., 50), the required total sample size is

N 4 1.96 1.28 2 50
32

 234.

Each group will have 117 subjects.
Suppose, however, that the study was actually conducted with only 180 subjects,

90 randomized to each group (it is a common situation that studies are ­underenrolled).
From the formula for sample size, we can solve and obtain

z1 N d2 z1
42

180 32 1.96
4 50

0.886

corresponding to a power 1-β of approximately 81%.

14.9.2  Comparison of Two Proportions

In many other studies, the endpoint may be on a binary scale; so let us consider a
similar problem where we want to design a study to compare two proportions.
For example, a new vaccine will be tested in which subjects are to be randomized
into two groups of equal size: a control (nonimmunized) group (group 1), and an
experimental (immunized) group (group 2). Subjects in both control and e­ xperimental
groups will be challenged by a certain type of bacteria and we wish to compare the
infection rates. The null hypothesis to be tested is:

H0: 1 2
versus HA: 1 2 or HA: 1 2.



How large a total sample should be used to conduct this vaccine study?
Suppose that it is important to detect a reduction of infection rate

d 2 1.


512 STUDY DESIGNS

If we decide to preset the size of the study at α = 0.05 and want the power (1 − β) to
detect the difference d, the required sample size is given by the formula

N 4 z1 z1 2 1 2
2 1

or the power for a given sample size is determined from

z1 N2 1 z1 .
21


In this formula the quantities z1−α and z1−β are defined as in Section 14.9.1 and is the
average proportion:

1 2.
2

It is obvious that the problem of planning sample size is more difficult and that a
good solution requires a deeper knowledge of the scientific problem: a good idea of
the magnitude of the proportions π1 and π2 themselves.

Example 14.10
Suppose that we wish to conduct a clinical trial of a new therapy where the rate of
successes in the control group was known to be about 5%. Further, we would ­consider
the new therapy to be superior – cost, risks, and other factors considered – if its rate
of successes is about 15%. Suppose also that we decide to preset α = 0.05 and want
the power to be about 90% (i.e., β = 0.10). In other words, we use

z1 1.96
z1 1.28.

From this information, the total sample size required is

N 4 1.96 1.28 2 0.10 0.90
0.15 0.05 2

 378 or 189 patients in each group.

Example 14.11
A new vaccine will be tested in which subjects are to be randomized into two groups
of equal size: a control (unimmunized) group and an experimental (immunized)
group. Based on prior knowledge about the vaccine through small pilot studies, the
following assumptions are made:

1. The infection rate of the control group (when challenged by a certain type of
bacteria) is expected to be about 50%:

2 0.50.


SAMPLE SIZE DETERMINATION FOR PHASE III TRIALS 513

2. About 80% of the experimental group is expected to develop adequate a­ ntibodies
(i.e., at least a twofold increase). If antibodies are inadequate, the infection rate
is about the same as for a control subject. But if an experimental subject has
adequate antibodies, the vaccine is expected to be about 85% effective (which
corresponds to a 15% infection rate against the challenged bacteria).

Putting these assumptions together, we obtain an expected value of π1:

1 0.80 0.15 0.20 0.50

0.22.

Suppose also that we decide to preset α = 0.05 and want the power to be about
95% (i.e., β = 0.05). In other words, we use

z1 1.96
z1 1.65.

From this information, the total sample size required is

N 4 1.96 1.65 2 0.36 0.64
0.50 0.22 2

154

so that each group will have 77 subjects. In this solution we use

0.36
the average of 22% and 50%.

14.9.3  Survival Time as the Endpoint

When patient survivorship is considered as the endpoint of a trial, the problem may look
similar to that of comparing two proportions. For example, one can focus on a conven-
tional time span, say five years, and compare the two survival rates. The comparision of
the five‐year survival rate from an experimental treatment versus the five‐year survival
rate from a standard regimen, say in the analysis of results from a trial of cancer
­treatments, fits the framework of a comparison of two population proportions as seen in
Section 14.9.2. The problem is that, for survival data, studies have staggered entry and
subjects are followed for varying lengths of time; they do not have the same probability
for the event to occur. Therefore, similar to the process of data analysis, the process of
sample size determination should be treated differently for trials where the endpoint is
binary from trials when the endpoint is survival time.

As seen in Chapter  13, the log‐rank test has become commonly used in the
a­nalysis of clinical trials where event or outcome become manifest only after a
p­rolonged time interval. The method for sample size determination, where the
difference in survival experience of the two groups in a clinical trial is tested using
the log‐rank test, proceeds as followed. Suppose that the two treatments give rise to


514 STUDY DESIGNS

survival rates P1 (for the experimental treatment) and P2 (for the standard regimen),
respectively, at some conventional time point: say, five years. If the ratio of the two
hazards (or risk functions) in the two groups are assumed not changing with time in
a proportional hazards model (PHM) and is θ:1, the quantities P1, P2, and θ are
related to each other by

ln P1 .
ln P2

1. The total number of events d from both treatment arms needed to be observed
in the trial is given by

d z1 z1 21 2
1
.


I n this formula the quantities z1−α and z1−β are defined as in Section 14.9.1, and
θ is the hazards ratio.

2. Once the total number of events d from both treatment arms has been e­ stimated,

the total number of patients N required in the trial can be calculated from

N 2d
2 P1 P2

assuming equal numbers of patients randomized into the two treatment arms,
with N/2 in each group. In this formula the quantities P1 and P2 are five‐year
(or two‐ or three‐year) survival rates.

Example 14.12
Consider the planning of a clinical trial of superficial bladder cancer. With the current
method of treatment (resection of tumor at cystoscopy), the recurrence‐free rate [i.e.,
the survival rate when the event under investigation is recurrency (the tumor comes
back)] is 50% at two years. Investigators hope to increase this to at least 70% using
intravesical chemotherapy (treatment by drug) immediately after surgery at the time
of cystoscopy. This alternative hypothesis is equivalent to a hypothesized hazards
ratio of:

ln 0.5

ln 0.7
1.94.

Suppose also that we decide to preset α = 0.05 (two‐sided) and want the power to be
about 90% (i.e., β = 0.10). In other words, we use

z1 1.96
z1 1.28.


SAMPLE SIZE DETERMINATION FOR CASE–CONTROL STUDIES 515

From this information, the required total sample size is

21 2
1
d z1 z1

1.96 1.28 2 1 1.94 2
1 1.94

103 events

N 2d
2 P1 P2
2 103

2 0.5 0.7
258 patients, 129 in each group.

Example 14.13
In the example above, suppose that the study was underenrolled and was actually
conducted with only 200 patients, 100 randomized to each of the two treatment arms.
Then we have

d N 2 P1 P2
2

80 events

z1 z1 2 d 2
1 /1

8.178

z1 8.178 1.96
0.90

corresponding to a power 1-β of approximately 84%.

14.10  SAMPLE SIZE DETERMINATION FOR CASE–CONTROL
STUDIES

In a typical case–control study, cases of a specific disease are ascertained as they
arise from population‐based registers or lists of hospital admissions, and controls
are sampled either as disease‐free people from the population at risk, or as
­hospitalized patients having a diagnosis other than the one under study. Then in
the analysis, we compare the exposure histories of the two groups. In other
words, a typical case–control study fits the framework of a two‐arm randomized


516 STUDY DESIGNS

phase III trials. However, the sample determination is somewhat more compli-
cated, for three reasons:

1. Instead of searching for a difference of two means or proportions as in the case
of a phase III trial, the alternative hypothesis of a case–control study is postulated
in the form of a relative risk.

2. It must be decided whether to design a study with equal or unequal sample
sizes because in epidemiologic studies, there are typically a small number of
cases and a large number of potential controls to select from.

3. It must be decided whether to design a matched or an unmatched study.

For example, we may want to design a case–control study to detect a relative risk,
due to a binary exposure, of 2.0, and the size of the control group is twice the number
of the cases. Of course, the solution also depends on the endpoint and its measurement
scale; so let us consider the two usual categories one at a time and some simple
configurations.

14.10.1  Unmatched Designs for a Binary Exposure

As mentioned previously, the data analysis is similar to that of a phase III trial
where we want to compare two proportions. However, in the design stage, the
alternative hypothesis is formulated in the form of a relative risk θ. Since we cannot
estimate or investigate relative risk using a case–control design, we would treat the
given number θ as an odds ratio, the ratio of the odds of being exposed for a case
π1/(1 − π1) divided by the odds of being exposed for a control π0 /(1 − π0 ). In other
words, from the information given, consisting of the exposure rate of the control
group π0 and the approximated odds ratio due to exposure θ, we can obtain the two
proportions π0 and π1. Then the process of sample size determination can proceed
similar to that of a phase III trial. For example, if we want to plan for a study with
equal sample size, N/2 cases and N/2 controls, the total sample size needed should
be at least

N 4 z1 z1 2 1 2
1 0

where

11 0.
1 0

The problem is more complicated if we plan for groups with unequal sample sizes.
First, we have to specify the allocation of sizes:

n1 w1N
n0 w0 N
w1 w0 1.0


SAMPLE SIZE DETERMINATION FOR CASE–CONTROL STUDIES 517

where N is the total sample size needed. For example, if we want the size of the
c­ ontrol group to be three times the number of the cases, w1 = 0.25 and w0 = 0.75.
Then the the total sample size needed N can be obtained from the formula

N1 0 z1 1 w1 1 w0 1

z1 1 1 1 w1 1 01 0 w0 1
where
is the weighted average:

w1 1 w0 0 .

Example 14.14
Suppose that an investigator is considering designing a case–control study of a
­potential association between congenital heart defect and the use of oral ­contraceptives.
Suppose also that approximately 30% of women of childbearing age use oral
­contraceptives within three months of a conception, and suppose that a relative risk
of θ = 3 is hypothesized. We also decide to preset α = 0.05 and want to design a study
with equal sample sizes so that its power to detect the hypothesized relative risk of
θ = 3 is 90% (or β = 0.10).

First, the exposure rate for the cases, the percent of women of childbearing age
who use oral contraceptives within three months of a conception, is obtained from

11 0

10

3 0.3

1+ 3 1 0.3

0.5625
and from

0.3 0.5625

2
0.4313

0.05 z1 1.96 two-sided test
0.10 z1 1.28

we obtain a required total sample size of

N 4 z1 z1 2 1 2

10

4 1.96 1.28 2 0.4313 0.5687
0.2625 2

 146

or 73 cases and 73 controls are needed.


518 STUDY DESIGNS

Example 14.15
Suppose that all specifications are the same as in Example 14.14, but we design a study
in which the size of the control group is four times the number of the cases. Here we have

0.3 0.8 0.5625 0.2

0.3525

leading to a required total sample size N, satisfying

N 0.5625 0.3 1.96 0.3525 0.6475 5 1.25

1.28 0.3 0.7 1.25 0.5625 0.4375 5

N  222

or 45 cases and 177 controls. It can be seen that the study requires a larger number of
subjects, 222 as compared to 146 subjects in Example 14.14; however, it may be
easier to implement because it requires fewer cases, 45 as compared to 73.

14.10.2  Matched Designs for a Binary Exposure

The design of a matched case–control study of a binary exposure is also specified by
the very same two parameters: the exposure rate of the control group π0 and the
relative risk associated with the exposure θ. The problem, however, becomes more
complicated because the analysis of a matched case–control study of a binary
exposure uses only discordant pairs, pairs of subjects where the exposure histories of
the case and his or her matched control are different. We have to go through two
steps, first to calculate the number of discordant pairs m, then the number of pairs M;
the total number of subjects is N = 2M, M cases and M controls.

The exposure rate of the cases is first calculated using the same previous formula:

11 0.
1 0

Then, given specified levels of type I and type II errors α and β, the number of
­discordant pairs m required to detect a relative risk θ, treated as an approximate odds
ratio, is obtained from

z1 / 2 z1 P 1 P 2
P 0.5 2
m


where

P1 .


Finally, the total number of pairs M is given by

Mm .
01 1 11 0


SAMPLE SIZE DETERMINATION FOR CASE–CONTROL STUDIES 519

Example 14.16
Suppose that an investigator is considering designing a case–control study of a
­potential association between endometrial cancer and exposure to estrogen (whether
ever taken). Suppose also that the exposure rate of controls is estimated to be about
40% and that a relative risk of θ = 4 is hypothesized. We also decide to preset α = 0.05
and to design a study large enough so that its power regarding the hypothesized
relative risk above is 90% (or β = 0.10). We also plan a 1:1 matched design; matching
criteria are age, race, and county of residence.

First, we obtain the exposure rate of the cases and the z values using specified
levels of type I and type II errors:

11 0

10

4 0.4

1 3 0.4

0.7373

0.05 z1 1.96 two-sided test
0.10 z1
1.28.

The number of discordant pairs is given by

P
1
0.80

z1 / 2 z1 P 1 P 2
P 0.5 2
m

2

1.96 / 2 1.28 0.80 0.20

0.302

 25

Mm 11 0
01 1

38

0.4 0.2627 0.7373 0.6

 46

that is pairs, 46 cases and 46 matching controls.


520 STUDY DESIGNS

14.10.3  Unmatched Designs for a Continuous Exposure

When the risk factor under investigation in a case–control study is measured on a
continuous scale, the problem is similar to that of a phase III trial where we want to
compare two population means as seen in Section 14.7.1. Recall that in a comparison
of two population means, μ1 versus μ2, the required minimum total sample size is
calculated from

N 4 z1 z1 22

d2

assuming that we conduct a balanced study with each group consisting of n = N/2
subjects. Besides the level of significance α and the desired power (1 − β ), this
required total sample size is affected by the variance σ2 of the population and the
quantity

d 1 0

which is the magnitude of the difference between μ1, the mean for the cases, and μ0,
the mean for the controls, that is deemed to be important. To put it in a different way,
besides the level of significance and the desired power (1 − β), the required total
sample size depends on the ratio d/σ. You will see this similarity in the design of a
case–control study with a continuous risk factor. When the risk factor under investi-
gation in a case–control study is measured on a continuous scale, the data are ana-
lyzed using the method of logistic regression (Chapter 10). However, as pointed out
in Section 10.3, when the cases and controls are assumed to have the same variance
σ2, the logistic regression model can be written as

logit ln 1 px
px

constant 1 0 x.

2

Under this model, the log of the odds ratio associated with a 1‐unit higher value
of the risk factor is

1 0.

1 2

Therefore, the log of the odds ratio associated with a 1‐standard deviation higher
value of the risk factor is

1 0

1

which is the same as the ratio d/σ above.


SAMPLE SIZE DETERMINATION FOR CASE–CONTROL STUDIES 521

In the design of case–control studies with a continuous risk factor, the key parameter
is the log of the odds ratio θ associated with a 1‐standard deviation higher value of the
covariate. Consider a level of significance α and statistical power 1 − β.

1. If we plan a balanced study with each group consisting of n = N/2 subjects, the
total sample size N is given by

N 4 2 z1 z1 2 .
ln


2. If we allocate different sizes to the cases and the controls,

n1 w1N
n0 w0 N
w1 w0 1.0

the total sample size N is given by

N ln1 2 w11 w10 z1 z1 2 .

For example, if we want the size of the control group to be three times the
number of the cases, w1 = 0.25 and w0 = 0.75.

Example 14.17
Suppose that an investigator is considering designing a case–control study of a
p­ otential association between coronary heart disease and serum cholesterol level.
Suppose further that it is desirable to detect an odds ratio of θ = 2.0 for a person with
a cholesterol level 1 standard deviation above the mean for his or her age group using
a two‐sided test with a significance level of 5% and a power of 90%. From

0.05 z1 1.96 two-sided test

0.10 z1 1.28

the required total sample size is:

N 4 z1 z1 2

2

4 1.96 1.28 2
ln 2 2

 62

if we plan a balanced study with each group consisting of 31 subjects.


522 STUDY DESIGNS

Example 14.18
Suppose that all specifications are the same as in Example 14.17 but we design a
study in which the size of the control group is three times the number of cases. Here
we have

N 1 1 1 z1 z1 2
ln 2 w1 w0

1 2 1 1 1.96 1.28 2
ln 2 0.25 0.75

 84

or 21 cases and 63 controls.

EXERCISES

14.1 Some opponents of the randomized, double‐blinded clinical trial, especially
in the field of psychiatry, have argued that a necessary or at least important
c­ omponent in the efficacy of a psychoactive drug, for example a tranquilizer,
is the confidence that the physician and the patient have in this efficacy. This
factor is lost if one of the drugs in the trial is a placebo and the active drug and
placebo cannot be identified. Hence active drug that would be efficacious if
identified appears no better than placebo, and therefore is lost to medical
p­ ractice. Do you agree with this position? Explain.

14.2 Suppose that we consider conducting a phase I trial using the standard design
with only three prespecified doses. Suppose further that the toxicity rates of
these three doses are 10%, 20%, and 30%, respectively. What is the ­probability
that the middle dose will be selected as the MTD? (Hint: Use the binomial
probability calculations of Chapter 3.)

14.3 We can refer to the standard design as a three‐and‐three design because at
each new dose, it enrolls a cohort of three patients with the option of enrolling
an additional three patients evaluated at the same dose. Describe the dose‐­
escalation plan for a three‐and‐two design and describe its possible effects on
the toxicity rate of the resulting maximum tolerated dose (MTD).

14.4 Repeat Exercise 14.2 but assuming that the toxicity rates of the three doses
used are 40, 50, and 60%, respectively.

14.5 Refer to Example 14.9, where we found that 234 subjects are needed for a
­predetermined power of 90% and that the power would be 81% if the study
enrolls only 180 subjects. What would be the power if the study was
­substantially underenrolled and conducted with only 120 patients?

14.6 Refer to Example 14.10, where we found that 378 subjects are needed for a
predetermined power of 90%. What would be the power if the study was
­actually conducted with only 300 patients, 150 in each group?


EXERCISES 523

14.7 Refer to Example 14.14, where we found that 73 cases and 73 controls
are needed for a predetermined power of 90%. What would be the power
if the study was actually conducted with only 100 subjects, 50 in each
group?

14.8 Refer to Example 14.18, where we found that 21 cases and 63 controls are
needed for a predetermined power of 90%. What would be the power if the
study was actually conducted with only 15 cases but 105 controls?

14.9 Refer to Example 14.14. Find the total sample size needed if we retain all
the specifications except that the hypothesized relative risk is increased
to 3.0.

14.10 The status of the axillary lymph node basin is the most powerful predictor
of long‐term survival in patients with breast cancer. The pathologic analysis
of the axillary nodes also provides essential information used to determine
the administration of adjuvant therapies. Until recently, an axillary lymph
node dissection (ALND) was the standard surgical procedure to identify
nodal metastases. However, ALND is associated with numerous side effects,
including arm numbness and pain, infection, and lymphedema. A new
procedure, sentinal lymph node (SLN) biopsy, has been proposed as a sub-
stitute and it has been reported to have a successful identification rate of
about 90%. Suppose that we want to conduct a study to estimate and con-
firm this rate to identify nodal metastases among breast cancer patients
because previous estimates were all based on rather small samples. How
many patients are needed to confirm this 90% success rate with a margin of
error of ±5%? Does the answer change if we do not trust the 90% figure and
­calculate a conservative sample size estimate?

14.11 Metastasic melanoma and renal cell carcinoma are incurable malignancies
with a median survival time of less than a year. Although these malignancies
are refractory to most chemotherapy drugs, their growth may be regulated
by immune mechanisms and there are various strategies for development
and administration of tumor vaccines. An investigator considers conducting
a phase II trial for such a vaccine for patients with stage IV melanoma. How
many patients are needed to to estimate the response rate with a margin of
error of ±10%?

14.12 Suppose that we consider conducting a study to evaluate the efficacy of
­prolonged infusional paclitaxel (96‐hour continuous infusion) in patients
with recurrent or metastatic squamous carcinoma of the head and neck. How
many patients are needed to to estimate the response rate with a margin of
error of ±15%?

14.13 Normal red blood cells in humans are shaped like biconcave disks.
Occasionally, hemoglobin, a protein that readily combines with oxygen, is
formed imperfectly in the cell. One type of imperfect hemoglobin causes the
cells to have a caved‐in, or sickle-like appearance. These sickle cells are less
efficient carriers of oxygen than normal cells and result in an oxygen defi-
ciency called sickle cell anemia. This condition has a significant prevalence


524 STUDY DESIGNS

among blacks. Suppose that a study is to be conducted to estimate the
­prevalence among blacks in a certain large city.

(a) How large a sample should be chosen to estimate this proportion to
within 1 percentage point with 99% confidence? With 95% confidence?
(Use a conservative estimate because no prior estimate of the preva-
lence in this city is assumed available.)

(b) A similar study was recently conducted in another state. Of the 13 573
blacks sampled, 1085 were found to have sickle cell anemia. Using this
information, resolve part (a).

14.14 A researcher wants to estimate the average weight loss obtained by patients
at a residential weight‐loss clinic during the first week of a controlled diet
and exercise regimen. How large a sample is needed to estimate this mean
to within 0.5 lb with 95% confidence? Assume that past data indicate a
­standard deviation of about 1 lb.

14.15 Suppose that a study is designed to select the better of two treatments.
The endpoint is measured on a continuous scale and a treatment is said to
be better if the true mean is 10 units larger than the mean associated with the
other treatment. Suppose also that the two groups have the same variance,
which is estimated at about σ2 = 400. Find the total sample size needed if
we want 99% certainty of making the right selection.

14.16 Suppose that in planning for a phase II trial, an investigator believes that the
incidence of severe side effects is about 5% and the trial has to be stopped if
the incidence of severe side effects exceeds 20%. Preset the level of signifi-
cance at 0.05 and design a stopping rule that has a power of 90%.

14.17 A study will be conducted to determine if some literature on smoking will
improve patient comprehension. All subjects will be administered a pretest,
then randomized into two groups: without or with a booklet. After a week,
all subjects will be administered a second test. The data (differences b­ etween
prescore and postscore) for a pilot study without a booklet yielded (score is
on a scale from 0 to 5 points)

n 44; x 0.25, s 2.28.

How large should the total sample size be if we decide to preset α = 0.05 and
that it is important to detect a mean difference of 1.0 with a power of 90%?

14.18 A study will be conducted to investigate a claim that oat bran will reduce serum
cholesterol in men with high cholesterol levels. Subjects will be randomized to
diets that include either oat bran or cornflakes cereals. After two weeks, LDL
cholesterol level (in mmol/L) will be measured and the two groups will be
compared via a two‐sample t test. A pilot study with ­cornflakes yields

n 14; x 4.44, s 0.97.

How large should a total sample size be if we decide to preset α = 0.01
and that it is important to detect an LDL cholesterol level reduction of 1.0
mmol/L with a power of 95%?


EXERCISES 525

14.19 Depression is one of the most commonly diagnosed conditions among hospi-
talized patients in mental institutions. The primary measure of depression is
the CES‐D scale developed by the Center for Epidemiologic Studies, in which
each person is scored on a scale of 0 to 60. The following results were found
for a group of randomly selected women: n = 200, x = 10.4, and s = 10.3.
A study is now considered to investigate the effect of a new drug aimed at
lowering anxiety among hospitalized patients in similar mental i­nstitutions.
Subjects would be randomized to receive either the new drug or placebo, then
averages of CES‐D scores will be compared using the two‐sided two‐sample
t test at the 5% level. How large should the total sample size be if it is impor-
tant to detect a CES‐D score reduction of 3.0 with a power of 90%?

14.20 A study will be conducted to compare the proportions of unplanned
­pregnancy between condom users and pill users. Preliminary data show that
these proportions are approximately 10 and 5%, respectively. How large
should the total sample size be so that it would be able to detect such a
difference of 5% with a power of 90% using a statistical test at the two‐sided
level of significance of 0.01?

14.21 Suppose that we want to compare the use of medical care by black and white
teenagers. The aim is to compare the proportions of teenagers without
physical checkups within the last two years. Some recent survey shows that
these rates for blacks and whites are 17 and 7%, respectively. How large
should the total sample be so that it would be able to detect such a 10%
difference with a power of 90% using a statistical test at the two‐sided level
of significance of 0.01?

14.22 Among ovarian cancer patients treated with cisplatin, it is anticipated that
20% will experience either partial or complete response. If adding paclitaxel
to this regimen can increase the response by 15% without undue toxicity,
that would be considered as clinically significant. Calculate the total sample
size needed for a randomized trial that would have a 80% chance of detect-
ing this magnitude of treatment difference while the probability of type I
error for a two‐sided test is preset at 0.05.

14.23 Metastatic breast cancer is a leading cause of cancer‐related mortality and
there has been no major change in the mortality rate over the past few
decades. Therapeutic options are available with active drugs such as pacli-
taxel. However, the promising response rate is also accompanied by a high
incidence of toxicities, especially neurotoxicity. An investigator considers
testing a new agent that may provide significant prevention, reduction, or
mitigation of drug‐related toxicity. This new agent is to be tested against a
placebo in a double‐blind randomized trial among patients with metastatic
breast cancer who receive weekly paclitaxel. The rate of neurotoxicity over
the period of the trial is estimated to be about 40% in the placebo group, and
the hypothesis is that this new agent lowers the toxicity rate by one‐half, to
20%. Find the total sample size needed using a two‐sided level of signifi-
cance of 0.05 and the assumption that the hypothesis would be detectable
with a power of 80%.


526 STUDY DESIGNS

14.24 In the study on metastatic breast cancer in Exercise 14.23, the investigator
also focuses on tumor response rate, hoping to show that this rate is
comparable in the two treatment groups. The hypothesis is that addition of
the new agent to a weekly paclitaxel regimen would reduce the incidence of
neurotoxicity without a compromise in its efficacy. At the present time, it is
estimated that the tumor response rate for the placebo group, without the
new agent added, is about 70%. Assuming the same response rate for the
treated patients, find the margin of error of its estimate using the sample size
obtained from Exercise 14.23.

14.25 The primary objective of a phase III trial is to compare disease‐free survival
among women with high‐risk operable breast cancer following surgical
resection of all known disease and randomized to receive as adjuvant
therapy either CEF or a new therapy. CEF has been established success-
fully as a standard adjuvant regimen in Canada; the aim here is to deter-
mine whether the addition of a taxane (to form the new regimen) can
improve survival outcome over CEF alone. The five‐year disease‐free
survival rate for women receiving CEF alone as adjuvant therapy was esti-
mated at about 60%, and it is hypothesized that the newly formed regimen
would improve that rate from 60 to 70%. Find the total sample size needed
using a two‐sided test at the 0.05 level of significance and a statistical
power of 80%.

14.26 Ovarian cancer is the fourth most common cause of cancer deaths in women.
Approximately 75% of patients present with an advanced stage, and because
of this, only a minority of patients will have surgically curable localized
d­ isease, and systematic chemotherapy has become the primary treatment
modality. A randomized phase III trial is considered to compare paclitaxel–
carboplatin versus docetaxel–carboplatin as first‐line chemotherapy in stage
IV epithelial ovarian cancer. Suppose that we plan to use a statistical test at
the two‐sided 5% level of significance and the study is designed to have
80% power to detect the alternative hypothesis that the two‐year survival
rates in the paclitaxel and docetaxel arms are 40 and 50%, respectively. Find
the total sample size required.

14.27 A phase III double‐blind randomized trial is planned to compare a new drug
versus placebo as adjuvant therapy for the treatment of women with
­metastatic ovarian cancer who have had complete clinical response to their
primary treatment protocol, consisting of surgical debulking and platium‐
based chemotherapy. Find the total sample size needed using a two‐sided
test at the 0.05 level of significance and a statistical power of 80% to detect
the alternative hypothesis that the two‐year relapse rates in the placebo and
the new drug arms are 60 and 40%, respectively.

14.28 Suppose that an investigator considers conducting a case–control study to
evaluate the relationship between invasive epithelial ovarian cancer and the


EXERCISES 527

history of infertility (yes/no). It is estimated that the proportion of controls
with a history of infertility is about 10% and the investigator wishes to
detect a relative risk of 2.0 with a power of 80% using a two-sided level of
significance of 0.05.

(a) Find the total sample size needed if the two groups, cases and controls,
are designed to have the same size.

(b) Find the total sample size needed if the investigator wants to have the
number of controls four times the number of cases.

(c) Find the number of cases needed if the design is 1:1 matched; matching
criteria are various menstrual characteristics, exogenous estrogen use,
and prior pelvic surgeries.

14.29 Suppose that an investigator considers conducting a case–control study to
evaluate the relationship between cesarean section delivery (C‐section) and
the use of electronic fetal monitoring (EFM, also called ultrasound) during
labor. It is estimated that the proportion of controls (vaginal deliveries) who
were exposed to EFM is about 40% and the investigator wishes to detect a
relative risk of 2.0 with a power of 90% using a two‐sided level of signifi-
cance of 0.05.

(a) Find the total sample size needed if the two groups, cases and controls,
are designed to have the same size.

(b) Find the total sample size needed if the investigator wants to have the
number of controls three times the number of cases.

(c) Find the number of cases needed if the design is 1:1 matched; matching
criteria are age, race, socioeconomic condition, education, and type of
health insurance.

14.30 When a patient is diagnosed as having cancer of the prostate, an important
question in deciding on treatment strategy for the patient is whether or not
the cancer has spread to the neighboring lymph nodes. The question is so
critical in prognosis and treatment that it is customary to operate on the
patient (i.e., perform a laparotomy) for the sole purpose of examining the
nodes and removing tissue samples to examine under the microscope for
evidence of cancer. However, certain variables that can be measured without
surgery may be predictive of the nodal involvement; one of which is level of
serum acid phosphatase. Suppose that an investigator considers conducting
a case–control study to evaluate this possible relationship between nodal
involvement (cases) and level of serum acid phosphatase. Suppose further
that it is desirable to detect an odds ratio of θ = 1.5 for a person with a serum
acid phosphatase level 1 standard deviation above the mean for his age
group using a two‐sided test with a significance level of 5% and a power
of 80%. Find the total sample size needed for using a two‐sided test at the
0.05 level of significance.


Click to View FlipBook Version