The words you are searching are inside this book. To get more targeted content, please make full-text search by clicking here.

Home Explore NHANES Dietary Web Tutorial_397 pages

NHANES Dietary Web Tutorial_397 pages

Like this book? You can publish your book online for free in a few minutes!

Download PDF

Related Publications

Discover the best professional documents and content resources in AnyFlip Document Base.

Published by smlneyman, 2019-01-16 01:35:47

NHANES Dietary Web Tutorial_397 pages

Pages:

NHANES Dietary Web Tutorial_397 pages

12/19/2018 NHANES Dietary Web Tutorial: Estimate Ratios: Task 1b

Print Text!

Task 1b: How to Estimate a Ratio of Means using SAS

This section describes how to use SAS to estimate a ratio of means for all adults and for males and females separately.
To illustrate this, the sum of calcium from milk is divided by the sum of total calcium for each population group as an
example.

Sorting is not a necessary first step in SAS as it is in SUDAAN. Therefore, properly weighted estimated means and
standard errors, using complex survey design factors (e.g., strata and PSU), can be obtained with the single SAS
procedure PROC SURVEYMEANS.

Use SAS to Estimate How Much Dietary Calcium Consumed by Adults, Ages 20 Years and Older, Comes from Milk

Sample Code

*-------------------------------------------------------------------------;
* Use the PROC SURVEYMEANS procedure in SAS to compute a properly weighted;
* estimated ratio of means for all persons ages 20+ and by gender. ;
*-------------------------------------------------------------------------;
* Run analysis for overall subpopulation of interest;

proc surveymeans data=DTTOT;
where usedat=1 ;
strata SDMVSTRA;
cluster SDMVPSU;
weight WTDRD1;
var D1MCALC DR1TCALC;
ratio D1MCALC / DR1TCALC;
title " Ratio of Means -- All Persons ages 20+" ;

run ;

*-------------------------------------------------------------------------; ;
* Use the PROC SORT procedure to sort the data by gender.
*-------------------------------------------------------------------------;

proc sort data =DTTOT;
by RIAGENDR;

run ;

* Run analysis by gender within subpopulation of interest;

proc surveymeans data=DTTOT;
where usedat= 1 ;
strata SDMVSTRA;
cluster SDMVPSU;
weight WTDRD1;
var D1MCALC DR1TCALC;
ratio D1MCALC / DR1TCALC;
by RIAGENDR;
title " Ratios of Means -- by Gender" ;

run ;

Output of Program 1/3

Ratio of Means -- All Persons ages 20+
Data Summary

https://www.cdc.gov/nchs/tutorials/Dietary/Basic/Ratios/Task1b.htm

12/19/2018 NHANES Dietary Web Tutorial: Estimate Ratios: Task 1b

Number of Strata 15
Number of Clusters 30
Number of Observations 4448
Sum of Weights 205284669

Statistics

Std Error Lower 95% Upper 95%

Variable Label N Mean of Mean CL for Mean CL for Mean

---------------------------------------------------------------------------------------------------

d1mcalc Calcium (mg) 4448 101.162167 7.647887 84.861081 117.463253

DR1TCALC Calcium (mg) 4448 880.130855 16.722099 844.488545 915.773166

---------------------------------------------------------------------------------------------------

Ratio Analysis

Numerator Denominator N Ratio Std Err 95% Confidence Interval

-----------------------------------------------------------------------------------------------

d1mcalc DR1TCALC 4448 0.114940 0.006826 0.100390 0.129490

-----------------------------------------------------------------------------------------------

Ratios of Means -- by Gender

Gender - Adjudicated=male

Data Summary

Number of Strata 15
Number of Clusters 30
Number of Observations 2135
Sum of Weights 98664010.2

Statistics

Std Error Lower 95% Upper 95%

Variable Label N Mean of Mean CL for Mean CL for Mean

-------------------------------------------------------------------------------------------------

d1mcalc Calcium (mg) 2135 122.142347 8.719800 103.556533 140.728162

DR1TCALC Calcium (mg) 2135 998.359501 21.809584 951.873474 1044.845528

------------------------------------------------------------------------------------------------

Ratio Analysis

Numerator Denominator N Ratio Std Err 95% Confidence Interval

-----------------------------------------------------------------------------------------------

d1mcalc DR1TCALC 2135 0.122343 0.007148 0.107107 0.137579

-----------------------------------------------------------------------------------------------

Gender - Adjudicated=female

Data Summary

Number of Strata 15
Number of Clusters 30
Number of Observations 2313
Sum of Weights 106620659

Statistics

Std Error Lower 95% Upper 95%

Variable Label N Mean of Mean CL for Mean CL for Mean

------------------------------------------------------------------------------------------------

d1mcalc Calcium (mg) 2313 81.747649 9.880726 60.687380 102.807918

DR1TCALC Calcium (mg) 2313 770.725113 15.292108 738.130756 803.319469

https://www.cdc.gov/nchs/tutorials/Dietary/Basic/Ratios/Task1b.htm 2/3

12/19/2018 NHANES Dietary Web Tutorial: Estimate Ratios: Task 1b

------------------------------------------------------------------------------------------------

Ratio Analysis

Numerator Denominator N Ratio Std Err 95% Confidence Interval

-----------------------------------------------------------------------------------------------

d1mcalc DR1TCALC 2313 0.106066 0.011329 0.081919 0.130213

-----------------------------------------------------------------------------------------------

Highlights from the output include:

The ratio of mean calcium from milk to total calcium, for all persons ages 20 and older, is 0.11 (with a standard error
of 0.01). The corresponding values for males and females, respectively, are 0.12 (0.01) and 0.11 (0.01).
Note that, even though this analysis did not incorporate a domain statement, the results are exactly equal to those
obtained using SUDAAN and its SUBPOPN statement because the subgroup of interest was one for which the
weighted NHANES sample is representative.

Close Window to return to module page.

https://www.cdc.gov/nchs/tutorials/Dietary/Basic/Ratios/Task1b.htm 3/3

12/19/2018 NHANES Dietary Web Tutorial: Estimate Ratios: Task 1d

Print Text!

Task 1d: How to Estimate a Mean Ratio using SAS

This section describes how to use SAS to estimate a mean ratio for all adults and for males and females separately. To
illustrate this, the sum of calcium from milk is divided by the sum of total calcium for each population group as an example.

Use SAS to Estimate Mean Contribution of milk to Calcium Intake for All Adults, Males, and Females, Ages 20
Years and Older

Sample Code

*-------------------------------------------------------------------------;

* Use the PROC SURVEYMEANS procedure in SAS to compute a properly weighted;

* estimated mean ratio for all persons ages 20+ and by gender. ;

*-------------------------------------------------------------------------;

* Run analysis for overall subpopulation of interest;

proc surveymeans data=DTTOT;
where usedat=1 ;
strata SDMVSTRA;
cluster SDMVPSU;
weight WTDRD1;
var DAY1RATIO;
title " Mean Ratio -- All Persons ages 20+" ;

run ;

*-------------------------------------------------------------------------; ;
* Use the PROC SORT procedure to sort the data by gender.
*-------------------------------------------------------------------------;

proc sort data =DTTOT;
by RIAGENDR;

run ;

* Run analysis for by gender within subpopulation of interest;

proc surveymeans data=DTTOT;
where usedat=1 ;
strata

SDMVSTRA;
cluster SDMVPSU;
weight WTDRD1;
var DAY1RATIO;
by RIAGENDR;
title " Mean Ratios -- by Gender" ;

run ;

Output of Program

Mean Ratio -- All Persons ages 20+

Data Summary

Number of Strata 15

https://www.cdc.gov/nchs/tutorials/Dietary/Basic/Ratios/Task1d.htm 1/2

12/19/2018 NHANES Dietary Web Tutorial: Estimate Ratios: Task 1d

Number of Clusters 30
Number of Observations 4448
Sum of Weights 205284669

Statistics

Std Error Lower 95% Upper 95%

Variable N Mean of Mean CL for Mean CL for Mean

-----------------------------------------------------------------------------------------

day1ratio 4447 0.076355 0.004692 0.066355 0.086356

-----------------------------------------------------------------------------------------

Mean Ratios -- by Gender

Gender - Adjudicated=male

Data Summary

Number of Strata 15
Number of Clusters 30
Number of Observations 2135
Sum of Weights 98664010.2

Statistics

Std Error Lower 95% Upper 95%

Variable N Mean of Mean CL for Mean CL for Mean

-----------------------------------------------------------------------------------------

day1ratio 2134 0.083405 0.004729 0.073326 0.093484

-----------------------------------------------------------------------------------------

Gender - Adjudicated=female

The SURVEYMEANS Procedure

Data Summary

Number of Strata 15
Number of Clusters 30
Number of Observations 2313
Sum of Weights 106620659

Statistics

Std Error Lower 95% Upper 95%

Variable N Mean of Mean CL for Mean CL for Mean

-----------------------------------------------------------------------------------------

day1ratio 2313 0.069845 0.006857 0.055231 0.084460

-----------------------------------------------------------------------------------------

Highlights from the output include:

The mean is .076 (with a standard error of .005). For males, it is .083 (.005), and for females, .070 (.007).
Note that these results are different from the ratio of means, and consider that for most intents and purposes, the
ratio of means is preferred over the mean 1-day ratio.
Also note that this analysis has only 4,447 persons rather than the 4,448 shown in the ratio of means analysis. That
is because one man did not have any calcium and therefore the ratio (0/0) is undefined.

Close Window to return to module page.

https://www.cdc.gov/nchs/tutorials/Dietary/Basic/Ratios/Task1d.htm 2/2

12/19/2018 NHANES Dietary Web Tutorial: Estimate Ratios: Identify Important Food Group Sources of Nutrients

Print Text!

Task 2: Key Concepts about Identifying Important Food Group Sources of
Nutrients

There are two different ways to consider food sources of nutrients—as “important” vs. “rich” sources. Rich sources are
those foods with the greatest concentration of a nutrient; important sources are those that contribute the most to a
population’s dietary intake. For example, sardines are a rich source of calcium, but they are not a very important source in
the US diet, because they are consumed relatively infrequently. Fluid milk, on the other hand, is both a rich and important
source of calcium because it contains high levels of calcium (rich) and it is consumed frequently in the population
(important). A food composition table or database can provide information about rich sources of nutrients, whereas
population intake as well as food composition data are needed to identify important sources. This task demonstrates the
latter.

Food sources of nutrients are identified using the 24-hour recall data, because these data provide the necessary detail
regarding all foods consumed and contain nutrient values associated with each food.

Ratio of Means

Identifying food sources of nutrients involves ratios, because it deals with the proportion of a given nutrient that is supplied
by a given food. More specifically, it involves a ratio of means, because the percentage contribution of each food to the
population’s total consumption is of most interest.

A ratio of means provides a single aggregate value for the whole population, so individual values are not available for each
person. However, if subgroup differences are of interest then separate ratios of means can be estimated for each group.

When data from one or two 24-hour recalls are used to estimate a ratio of means, the mean in the numerator and the
mean in the denominator can each be considered an estimate of usual intake (for reasons stated in Task 1 of “Module 14:
Estimate Population Mean Intakes”). Therefore, no specific statistical adjustments are necessary.

Grouping Foods for Analysis

Identifying important food sources of nutrients requires grouping the foods reported in the survey using either pre-existing
schemes or one developed for the purpose. It is important to list all the food groups used in the analysis when reporting
results, because decisions regarding how the foods are grouped can have a major influence on the relative contributions of
one food group vs. another. For information about some pre-defined food grouping schemes, see “Module 4: Resources
for Dietary Data Analysis” of the Survey Orientation Course.

IMPORTANT NOTE
In identifying important food sources of nutrients, use the ratio of means and be clear about how foods are grouped.

Close Window to return to module page.

https://www.cdc.gov/nchs/tutorials/Dietary/Basic/Ratios/Info2.htm 1/1

12/19/2018 NHANES Dietary Web Tutorial: Estimate Ratios: Task 2b

Print Text!

Task 2b: How to Identify Important Food Group Sources of Nutrients Using
SAS

This section describes how to use SAS to identify food group sources of nutrients along with standard errors. To illustrate
this, food sources of calcium are identified for the whole population, ages 2 and older, for 2001-2004. In this example, a
simplistic food grouping scheme based on the first digit of the USDA food codes, was used for illustrative purposes.

Step 1: Create Folder

Create a folder to save the dataset, list the contents of each dataset, and create a dataset comprised of 4 years of data.
(Program not shown. See the full program in Additional Resources for more information.)

Step 2: Sort and Merge Datasets

Sort and then merge the demographic and individual food intake datasets. Create new variables, as needed. Note that
the food groups are simply characterized by first digit of individual food code: milk and milk products; meat, poultry, fish
and mixtures; eggs; legumes, nuts and seeds; grain products; fruits; vegetables; fats, oils and salad dressings; and sugar,
sweeteners and beverages. (Program not shown. See the full program in Additional Resources for more information.)

Step 3: Calculate the Weighted Contribution of Calcium from Each Food Group

Calculate the weighted contribution of calcium from each food group using the PROC SURVEYFREQ procedure in SAS.

Identifying Food Group Sources of Calcium

Identifying Food Group Sources of Calcium

Sample Code

* The SURVEYFREQ procedure in SAS calculates the weighted contribution of ;

* calcium from each food group. ;

*;

* Note that for this analysis, only the data for INCOH=1 is of interest. ;

* However, this code will also generate data for INCOH=0. ;

*-------------------------------------------------------------------------;

proc surveyfreq data=FDSRC;
strata SDMVSTRA;
cluster SDMVPSU;
weight WTD_CALC;
tables FOODGRP*INCOH;
title "Percent calcium by food group, using PROC SURVEYFREQ" ;

run ;

Output of Program 1/2

https://www.cdc.gov/nchs/tutorials/Dietary/Basic/Ratios/Task2b.htm

12/19/2018 NHANES Dietary Web Tutorial: Estimate Ratios: Task 2b
The SURVEYFREQ Procedure

Data Summary

Number of Strata 30
Number of Clusters 60
Number of Observations 274168
Number of Observations Used 257658
Number of Obs with Nonpositive Weights 16510
Sum of Weights 2.49766E11

Broad food grp based on 1st digit of USDA food code

Weighted Std Dev of Std Error of

FOODGRP Frequency Frequency Wgt Freq Percent Percent

----------------------------------------------------------------------------------------------------

Milk & Milk Products 40207 1.1581E11 6226455948 46.3673 0.4185

Meat, Poultry, Fish & Mixtures 29350 1.77625E10 994172188 7.1117 0.2178

Eggs 4137 4642738173 225930307 1.8588 0.0680

Legumes, Nuts and Seeds 6102 4129696674 251569993 1.6534 0.0707

Grain Products 63548 7.35244E10 3425294440 29.4373 0.3696

Fruits 21721 8017836305 405334291 3.2101 0.1207

Vegetables 41477 1.20417E10 582319600 4.8212 0.1317

Fats, Oils & Salad Dressings 9075 792221049 51116840 0.3172 0.0162

Sugar, Sweeteners & Beverages 42041 1.3045E10 607625180 5.2229 0.1254

Total 257658 2.49766E11 1.1986E10 100.000

----------------------------------------------------------------------------------------------------

Highlights from the output include:

Using the simply-defined food groups mentioned above, milk and milk products were the major contributor of
calcium to the diets of Americans, from 2001-2004, providing over 46 percent. Grain products were the next highest
contributor, with over 29 percent. All other food groups provided less than 10 percentage points each.

IMPORTANT NOTE

The frequency counts in this analysis represent the number of reports of foods that contain calcium, by food group. It is
important to note that the frequencies in the SAS output to do not match those in the SUDAAN output because of special
procedures required in SAS to conduct this analysis (see Task 3 in “Module 11: Weighting” in the Continuous NHANES
Tutorial for more information). However, the unweighted frequencies are not important to this analysis and they do not
represent an estimate for the U.S. population. Therefore, they can be ignored.

Watch animation of program and output
Can't view the demonstration? Try our Tech Tips for troubleshooting help.

Close Window to return to module page.

https://www.cdc.gov/nchs/tutorials/Dietary/Basic/Ratios/Task2b.htm 2/2

12/19/2018 NHANES Dietary Web Tutorial: Test Hypotheses: Task 3b

Print Text!

Task 3b: How to Perform Chi-Square Test Using SAS

In this task, you will use the chi-square test to determine whether age group and osteoporosis treatment status are
independent of each other.

Step 1: Examine Relationship Between Two Categorical Variables

The PROC SURVEYFREQ procedure is used in SAS to examine the relationship between two categorical variables and
obtain chi-square statistics. Use the STRATA statement to specify the strata variable to account for the design effects of
stratification. Use the CLUSTER statement to specify PSU to account for design effects of clustering. Use the WEIGHT
statement to account for the unequal probability of sampling and non-response. Use the WHERE statement to specify the
subpopulation of interest.

Use the TABLE statement to create a cross tab of the categorical variables age group (AGEGRP) and osteoporosis
treatment status (TREATOSTEO). The options included after the backslash instruct SAS to output the column percent
(COL), row percent (ROW), Wald chi-square (WCHISQ), and Wald log linear chi-square (WLLCHISQ), and suppress the
standard deviation (NOSTD) and weighted sums (NOWT). The CHISQ option is used to obtain the Rao-Scott chi-square
and the CHISQ1 option is used to obtain the Rao-Scott modified chi-square. Use the FORMAT statement to read the SAS
formats.

Calculate Chi-square Statistic to Determine whether Gender and Osteoporosis Treatment Status are Independent
Using SAS Survey Procedures

Sample Code

*-------------------------------------------------------------------------;
* Use the PROC SURVEYFREQ procedure to perform a chi-square test in SAS. ;
* This test will be used to determine whether age group and treatment for ;
* osteoporosis are independent of each other in respondents aged 20 and ;
* over. ;
*-------------------------------------------------------------------------;

proc surveyfreq data=DEMOOSTS;
strata SDMVSTRA;
cluster SDMVPSU;
weight WTINT2YR;
where RIDAGEYR >= 20 ;
table AGEGRP*TREATOSTEO/col row nostd nowt wchisq wllchisq
chisq chisq1;
format AGEGRP AGEGRP. TREATOSTEO YESNO. ;

run ;

IMPORTANT NOTE

For complex survey data such as NHANES, using the Rao-Scott F adjusted chi-square statistic is recommended since it
yields a more conservative interpretation than the Wald chi-square.

Output of Program 1/3

The SURVEYFREQ Procedure
https://www.cdc.gov/nchs/tutorials/Dietary/Basic/TestHypotheses/Task3b.htm

12/19/2018 NHANES Dietary Web Tutorial: Test Hypotheses: Task 3b
Data Summary

Number of Strata 15
Number of Clusters 30
Number of Observations 5041
Sum of Weights 205284669

Table of AGEGRP by treatOSTEO

Row Column

AGEGRP treatOSTEO Frequency Percent Percent Percent

--------------------------------------------------------------------

20-39 Yes 2 0.0924 0.2375 2.2097

No 1738 38.8105 99.7625 40.5042

Total 1740 38.9029 100.000

--------------------------------------------------------------------

40-59 Yes 36 1.0062 2.6126 24.0624

No 1358 37.5077 97.3874 39.1446

Total 1394 38.5139 100.000

--------------------------------------------------------------------

>= 60 Yes 227 3.0831 13.6521 73.7279

No 1662 19.5001 86.3479 20.3512

Total 1889 22.5832 100.000

--------------------------------------------------------------------

Total Yes 265 4.1817 100.000

No 4758 95.8183 100.000

Total 5023 100.000

--------------------------------------------------------------------

Frequency Missing = 18

Rao-Scott Chi-Square Test

Pearson Chi-Square 341.6678
Design Correction 0.6712

Rao-Scott Chi-Square 509.0778
DF 2
Pr > ChiSq
<.0001

F Value 254.5389
Num DF 2
Den DF
Pr > F 30
<.0001

Sample Size = 5023

Rao-Scott Modified Chi-Square Test

Pearson Chi-Square 341.6678
Design Correction 1.5353

Rao-Scott Chi-Square 222.5434
DF 2
Pr > ChiSq
<.0001

F Value 111.2717
Num DF 2
Den DF
30

https://www.cdc.gov/nchs/tutorials/Dietary/Basic/TestHypotheses/Task3b.htm 2/3

12/19/2018 NHANES Dietary Web Tutorial: Test Hypotheses: Task 3b
<.0001
Pr > F

Sample Size = 5023

Wald Chi-Square Test

Chi-Square 91.2484

F Value 45.6242
Num DF 2
Den DF
Pr > F 15
<.0001

Adj F Value 42.5826
Num DF 2
Den DF
Pr > Adj F 14
<.0001

Sample Size = 5023

Wald Log-Linear Chi-Square Test

Chi-Square 1216.9520

F Value 608.4760
Num DF 2
Den DF
Pr > F 15
<.0001

Adj F Value 567.9109
Num DF 2
Den DF
Pr > Adj F 14
<.0001

Sample Size = 5023

Highlights from the output include:

5,041 respondents were used in this analysis.
The row percentages indicate that persons greater than or equal to 60 years of age are more likely to have been
treated for osteoporosis compared to younger persons.
In this example, the p-values of both the Rao-Scott modified chi square test and the Wald chi-square test are
<0.0001. Therefore, the null hypothesis is rejected at the 0.05 level and it is concluded that age group and
treatment for osteoporosis status are significantly associated.

Close Window to return to module page.

https://www.cdc.gov/nchs/tutorials/Dietary/Basic/TestHypotheses/Task3b.htm 3/3

12/19/2018 NHANES Dietary Web Tutorial: Test Hypotheses: Use the T-Test Statistic

Print Text!

Task 1: Key Concepts about Using the T-Test Statistic

The t-test is used to test the null hypothesis that two population means or proportions, θ1 and θ2, are equal or,
equivalently, that the difference between two population means or proportions is zero. To test this hypothesis, assuming
the covariance is small, as is the case with NHANES data, the following formula is used:

Equation for t-Test Where Covariance is Small

where,

1 is an estimate of the first population mean or proportion based on a probability sample,
1 is an estimate of the standard error of 1,
2 is an estimate of the second population mean or proportion,
and 2 is an estimate of the standard error of 2.
In instances where only a small number (<30) of independent pieces of information are available with which to estimate the
quantity [ 1 - 2], the t-statistic given above follows a Student's t distribution with zero mean and unit variance, and with a
number of degrees of freedom corresponding to the number of independent pieces of information. In a simple random
sample, the number of independent pieces of information is generally equal to the number of people in the sample minus
one. In NHANES, however, the number of independent pieces of information is substantially lower due to the multi-stage
probability sample design. In NHANES, this number (referred to as degrees of freedom) is equal to the number of PSUs
minus the number of strata (see “Module 12: Sample Design” of the Continuous NHANES Tutorial for more information).
The equality of means is usually tested at the 0.05 level of significance. However, at the 0.05 level of significance, some
differences that are not meaningful (usually very small) are significant because of the large sample size.

Close Window to return to module page.

https://www.cdc.gov/nchs/tutorials/Dietary/Basic/TestHypotheses/Info1.htm 1/1

12/19/2018 NHANES Dietary Web Tutorial: Test Hypotheses: Generate Confidence Intervals

Print Text!

Task 2: Key Concepts about Generating Confidence Intervals

Typically, a sufficiently large probability sample will have point estimates that are approximately normally distributed. The

end points of the confidence interval, then, are a function of the estimate ( ), its standard error ( ), and a percentile

of the normal distribution with zero mean and unit variance, referred to as the standard normal deviate (z score), and are

given by:

Equation for Confidence Interval Endpoints

The continuous NHANES sample is a multistage, area probability sample. The number of independent pieces of
information, or degrees of freedom, depends upon the number of PSUs rather than on the number of sample persons.
Sample persons within a given PSU are not independent. Therefore, a t-statistic with degrees of freedom equal to the
difference between the number of PSUs and the number of strata containing observations is used instead of a z-statistic,
which would otherwise be used in a large sample. The endpoints for a confidence interval for the continuous NHANES are
given by:

Equations for Confidence Interval Endpoints in Continuous NHANES

Sample weights and other design effects (e.g. strata, PSUs) must be incorporated when calculating an estimate and its
standard error (see “Module 5: Overview of NHANES Survey Design and Weighting” for more information). Taylor Series
Linearization is one example of a design-based method. The design variables needed to obtain estimates of standard
errors through this method are provided on the demographic files for the continuous NHANES (see below for an example
of a program).

Interpretation

Confidence intervals, as constructed above, are based on one possible sample from a finite population. Many possible
samples of the same size can be obtained using the same procedures and measurements. For each of these samples, a
confidence interval can be constructed. For a 95% CI, 95% percent of these intervals would then contain the true value of
the population parameter.

Close Window to return to module page.

https://www.cdc.gov/nchs/tutorials/Dietary/Basic/TestHypotheses/Info2.htm 1/1

12/19/2018 NHANES Dietary Web Tutorial: Test Hypotheses: Use Chi-Square Test

Print Text!

Task 3: Key Concepts about Using the Chi-Square Test

The chi-square test is used to test the independence of two variables cross classified in a two-way table. (A chi-square
statistic with n degrees of freedom is based on a statistic equal to the sum of the squares of n independent normally
distributed random variables with mean=0 and unit variance.)

For example, suppose we wished to test the hypothesis that osteoporosis treatment status is independent of gender and
that we have the following observed frequencies:

Frequency of Osteoporosis Treatment Status

Treated = Yes Treated = No Total

Males 30 2,241 2,271

Females 212 2,244 2,456

Total 242 4,485 4,727

In a simple random sample setting (unweighted data), the expected cell frequencies under the null hypothesis that
osteoporosis treatment status and gender are independent could be obtained by multiplying the marginal total for the jth
column by the proportion of individuals in the ith row.

For example, the expected number of males being treated for osteoporosis would be 242*(2,271/4,727)=116.3; the
expected value of not being treated for osteoporosis in females would be 4,485*(2,456/4,727)=2,330.3.

Thus, if Oij = the expected frequency of the ith row and jth column, where i=1,2, … i and j=1,2, … j and

Eij = the expected frequency of the ith row and jth column

Then the formula to test the null hypothesis of independence, using the chi-square statistic, would be:

Equation to Test the Null Hypothesis

This statistic has degrees of freedom equal to the number of rows minus 1, multiplied by the number of columns minus 1.

In a complex sample setting, you would use a statistic similar to the equation above, modified to account for survey design
with degrees of freedom equal to the number of PSUs minus the number of strata containing observations. There are
several different ways to calculate this statistic using SAS and SUDAAN. In SAS, the surveyfreq procedure is used, which
is based on the Rao-Scott chi-square with an adjusted F statistic). In SUDAAN, the proc crosstab procedure is used,
which provides limited chi-square statistics based on Wald chi-square and does not provide an F adjusted p-value.

The Cochran Mantel Haenzel Test, an extension of the Pearson Chi-Square, can be applied to stratified two-way tables to
test for homogeneity or independence in a non-survey setting. For a complex sample its analogue can be obtained in
SUDAAN proc crosstab.

Close Window to return to module page. 1/2

https://www.cdc.gov/nchs/tutorials/Dietary/Basic/TestHypotheses/Info3.htm

12/19/2018 NHANES Dietary Web Tutorial: Test Hypotheses: Use Chi-Square Test

https://www.cdc.gov/nchs/tutorials/Dietary/Basic/TestHypotheses/Info3.htm 2/2

NHANES Dietary Web Tutorial - Survey Orientation https://www.cdc.gov/nchs/tutorials/dietary/Basic/TestHypotheses/intro.htm

Test Hypotheses

Purpose

The t-test and chi-square statistics are used to test statistical hypotheses about population parameters. For example, you might
wish to determine whether the mean intake of calcium among males in a population is different from that among females. By
using various statistics, it is possible to test whether these differences of statistically significant (i.e. the differences are not due
solely to random error). This module will demonstrate the use of these statistics in NHANES data analysis.

Task 1: Use the T-Test Statistic

The t-test is used to test the null hypothesis that the means or proportions of two population subgroups are equal or,
equivalently, that the difference between two means or proportions equals zero. It is appropriate in cases where a small
number (<30) of degrees of freedom are available, which is the case for the NHANES sample. This tutorial shows how to
calculate a t-statistic for a sub-population only in SUDAAN. Doing the same calculation in SAS requires more advanced
programming (see the Continuous Tutorial for more information).

Key Concepts about Using the T-Test Statistic in NHANES (/nchs/tutorials/Dietary/Basic/TestHypotheses/Info1.htm)
How to Set Up a T-Test in NHANES Using SUDAAN (/nchs/tutorials/Dietary/Basic/TestHypotheses/Task1.htm)

Task 2: Generate Confidence Intervals

A confidence interval (CI) gives a range of plausible values of a population parameter, such as a population mean or percent.
CIs reflect the uncertainty of an estimate for a variable that is computed from a probability sample of the population rather
than a census. NHANES surveys have numerous variables on demographic and health characteristics of the U.S. non-
institutionalized population, such as age, body mass index, and food and nutrient intakes. This tutorial shows how to generate
a confidence interval only in SUDAAN. Doing the same calculation in SAS requires more advanced programming.

Key Concepts about Generating Confidence Intervals (/nchs/tutorials/Dietary/Basic/TestHypotheses/Info2.htm)
How to Generate Confidence Intervals Using SUDAAN (/nchs/tutorials/Dietary/Basic/TestHypotheses/Task2.htm)

Task 3: Perform Chi-Square Test

The chi-square test is used to test the association between two variables cross-classified in a two-way table and the
homogeneity of their association.

Key Concepts about Using the Chi-Square Test in NHANES (/nchs/tutorials/Dietary/Basic/TestHypotheses/Info3.htm)
How to Perform a Chi-Square Test in NHANES Using SUDAAN (/nchs/tutorials/Dietary/Basic/TestHypotheses/Task3a.htm)
How to Perform a Chi-Square Test in NHANES Using SAS (/nchs/tutorials/Dietary/Basic/TestHypotheses/Task3b.htm)

Page last updated: May 3, 2013
Page last reviewed: May 3, 2013
Content source: CDC/National Center for Health Statistics
Page maintained by: NCHS/NHANES

Centers for Disease Control and Prevention 1600 Clifton Road Atlanta, GA 30329-4027, USA
800-CDC-INFO (800-232-4636) TTY: (888) 232-6348 - Contact CDC–INFO

1 of 1 1/14/2019, 9:21 PM

NHANES Dietary Web Tutorial - Survey Orientation https://www.cdc.gov/nchs/tutorials/dietary/Basic/intro.htm

Basic Dietary Analyses

The Basic Dietary Analyses course contains 5 modules:

Module 12. Identify Important Statistical Considerations Regarding Dietary Data Analyses (/nchs/tutorials/Dietary
/Basic/StatisticalConsiderations/intro.htm)

Module 13. Estimate Variance, Analyze Subgroups, and Calculate Degrees of Freedom (/nchs/tutorials/Dietary/Basic
/EstimateVariance/intro.htm)

Module 14. Estimate Population Mean Intakes (/nchs/tutorials/Dietary/Basic/PopulationMeanIntakes/intro.htm)

Module 15. Estimate Ratios (/nchs/tutorials/Dietary/Basic/Ratios/intro.htm)

Module 16. Test Hypotheses (/nchs/tutorials/Dietary/Basic/TestHypotheses/intro.htm)

These five modules demonstrate how to perform some of the most frequently requested analyses of NHANES dietary data. The
course begins with a description of the particular idiosyncrasies of dietary data, especially measurement error and its relevance
to the interpretation of results. The next module continues the statistical theme by showing how to estimate variances, analyze
subgroups, and calculate degrees of freedom, given the NHANES sampling design. The remaining modules demonstrate how
to perform selected basic analyses with dietary data.

The course is aimed at specific issues surrounding dietary analysis. If you are interested in learning more about the basics of
estimating means and prevalences, conducting t-tests or chi-square tests, generating confidence intervals, or performing age
standardization or regression analyses, please see the Continuous NHANES Web Tutorial, NHANES Analysis Course (/nchs
/tutorials/Nhanes/NHANESAnalyses/NHANES_Analyses_intro.htm) .

If you are an experienced analyst who needs only specific information to help you complete an analysis on your own, you can
pick and choose topics of interest from the navigation bar to the left, or from the Tutorial A-Z Index (/nchs/tutorials/dietary/A-
Zindex.htm) . You may also go to the Sample Code and Datasets page (/nchs/tutorials/dietary/Downloads/downloads.htm) to
download and modify sample programs and datasets for your own use.

Sample Code

Abbreviated SAS and SUDAAN code is presented throughout the tutorial for the sole purpose of demonstrating and explaining
specific steps in an analysis. The abbreviated code does not comprise a complete SAS or SUDAAN program that can be readily
submitted for a computer run. If you need the complete SAS or SUDAAN program, please consult the Additional Resources
section of this tutorial.

Before you get started

Check out the Dietary Data Tutorial Roadmap (/nchs/tutorials/dietary/roadmap.htm) to orient yourself to the tutorial’s
content.
Read the Introduction (/nchs/tutorials/dietary/introduction.htm) to find answers to frequently asked questions about
NHANES dietary data and this tutorial.
Browse through the Logistics (/nchs/tutorials/dietary/logistics/logistics.htm) section to learn about the web layouts and
templates used in the tutorial and find out the basic knowledge and skills you’ll need to use the tutorial.
Go to Technical & Software Requirements (/nchs/tutorials/dietary/logistics/techsoftwarereqs.htm) for information about
what’s required to view the tutorials correctly and run the sample programs properly. This section also is the place to go if
you need help with technical problems.

Page last updated: May 3, 2013
Page last reviewed: May 3, 2013
Content source: CDC/National Center for Health Statistics
Page maintained by: NCHS/NHANES

Centers for Disease Control and Prevention 1600 Clifton Road Atlanta, GA 30329-4027, USA
800-CDC-INFO (800-232-4636) TTY: (888) 232-6348 - Contact CDC–INFO

1 of 1 1/14/2019, 10:20 PM

NOTES

UNIT 4

ADVANCED DIETARY ANALYSES

NOTES

NHANES Dietary Web Data Tutorial - Estimating Prevalence and Exami... https://www.cdc.gov/nchs/tutorials/dietary/advanced/EstimatePrevalence...

Estimating Prevalence and Examining Relationships Using Supplement
Data

Purpose

National estimates of supplement use may be calculated using NHANES data. This module will introduce how to obtain
estimates of the prevalence of supplement use, and how to examine the relationship between supplement use and a categorical
or continuous outcome.

Task 1: Estimating Prevalence of Supplement Use

Proportions of participants who report supplement use on the NHANES survey may be used as an estimate of the prevalence of
supplement use in the US.

Key Concepts about Estimating Prevalence of Supplement Use Using Proportions (/nchs/tutorials/dietary/Advanced
/EstimatePrevalence/Info1.htm)
How to Estimate Prevalence of Supplement Use Using Proportions Using SUDAAN (/nchs/tutorials/Dietary/Advanced
/EstimatePrevalence/Task1a.htm)
How to Estimate Prevalence of Supplement Use Using Proportions Using SAS Survey Procedures (/nchs/tutorials/dietary
/Advanced/EstimatePrevalence/Task1b.htm)
Download Sample Code and Datasets (/nchs/tutorials/dietary/downloads/downloads.htm)

Task 2: Examining the Relationship Between Supplement Use and a Categorical Outcome
using a Chi-Square Test

When interest is in examining the relationship between supplement use and a categorical outcome, a chi-square test may be
used to test the association between supplement use and another categorical variable in a two-way table.

Key Concepts about Examining the Relationship Between Supplement Use and a Categorical Outcome Using a Chi-Square
Test (/nchs/tutorials/dietary/Advanced/EstimatePrevalence/Info2.htm)
How to Calculate a Chi-Square Test Using SUDAAN (/nchs/tutorials/dietary/Advanced/EstimatePrevalence/Task2a.htm)
How to Calculate a Chi-Square Test Using SAS (/nchs/tutorials/dietary/Advanced/EstimatePrevalence/Task2b.htm)
Download Sample Code and Datasets (/nchs/tutorials/dietary/downloads/downloads.htm)

Task 3: Examining the Relationship Between Supplement Use and a Dichotomous Outcome
using Logistic Regression

Logistic regression may be used when there is interest in adjusting the relationship between supplement use and a categorical
outcome for the effect of other covariates. Logistic regression is a statistical method used to assess the likelihood of a disease or
health condition as a function of a risk factor (and covariates). There are two kinds of logistic regression, simple and
multiple. Both simple and multiple logistic regression, assess the association between independent variable(s) (Xj) —
sometimes called exposure or predictor variables — and a dichotomous dependent variable (Y) — sometimes called the
outcome or response variable.

Key Concepts about Using Logistic Regression In NHANES (/nchs/tutorials/dietary/Advanced/EstimatePrevalence/Info3.htm)
How to Perform Logistic Regression Using SUDAAN (/nchs/tutorials/dietary/Advanced/EstimatePrevalence/Task3a.htm)
How to Perform Logistic Regression using SAS Survey Procedures (/nchs/tutorials/dietary/Advanced/EstimatePrevalence
/Task3b.htm)
Download Sample Code and Datasets (http://www.cdc.gov/nchs/tutorials/Dietary/SurveyOrientation
/DietaryDataStructureContents/Info4.htm)

Task 4: Examining the Relationship between Supplement Use and a Continuous Outcome
Using a T-test

The t-test is used to test the null hypothesis that the means or proportions of two population subgroups are equal or,
equivalently, that the difference between two means or proportions equals zero. It is appropriate in cases where a small number

1 of 2 1/14/2019, 9:21 PM

NHANES Dietary Web Data Tutorial - Estimating Prevalence and Exami... https://www.cdc.gov/nchs/tutorials/dietary/advanced/EstimatePrevalence...

(<30) of degrees of freedom are available, which is the case for the NHANES sample. This tutorial shows how to calculate a
t-statistic for a sub-population using SUDAAN and SAS version 9.2.

Key Concepts about Using the T-Test Statistic (/nchs/tutorials/dietary/Advanced/EstimatePrevalence/Info4.htm)
How to Set Up a T-Test in NHANES Using SUDAAN (/nchs/tutorials/dietary/Advanced/EstimatePrevalence/Task4a.htm)
How to Set Up a T-Test in NHANES Using SAS Survey Procedures (/nchs/tutorials/dietary/Advanced/EstimatePrevalence
/Task4b.htm)
Download Sample Code and Datasets (/nchs/tutorials/downloads/downloads.htm)

Task 5: Examining the Relationship between Supplement Use and a Continuous Outcome
Using Multiple Regression

Linear Regression models, both simple and multiple, assess the association between independent variable(s) (Xi) — sometimes
called exposure or predictor variables — and a continuous dependent variable (Y) — sometimes called the outcome or response
variable. In cross-sectional surveys such as NHANES, linear regression analyses can be used to examine associations between
covariates and health outcomes. In this Module, supplement use is the primary independent variable of interest. This Task will
give examples using SUDAAN and SAS version 9.2.

Key Concepts about Linear Regression (/nchs/tutorials/dietary/Advanced/EstimatePrevalence/Info5.htm)
How to Perform Linear Regression Using SUDAAN Code (/nchs/tutorials/dietary/Advanced/EstimatePrevalence/Task5a.htm)
How to Perform Linear Regression Using SAS (/nchs/tutorials/dietary/Advanced/EstimatePrevalence/Task5b.htm)
Download Sample Code and Datasets (/nchs/tutorials/downloads/downloads.htm)

Page last updated: July 20, 2011
Page last reviewed: July 20, 2011
Content source: CDC/National Center for Health Statistics
Page maintained by: NCHS/NHANES

Centers for Disease Control and Prevention 1600 Clifton Road Atlanta, GA 30329-4027, USA
800-CDC-INFO (800-232-4636) TTY: (888) 232-6348 - Contact CDC–INFO

2 of 2 1/14/2019, 9:21 PM

12/19/2018 NHANES Dietary Web Tutorial: Estimating Prevalence of Supplement Use

Print Text!

Task 1: Key Concepts about Estimating Prevalence of Supplement Use
Using Proportions

Prevalence estimates (percent of population who reports supplement consumption) are of interest in nutritional
epidemiology and surveillance. Using a national cross-sectional survey such as NHANES, prevalence estimates of
supplement use among the U.S. population and among population supgroups can be generated.

To calculate proportions and standard errors, it is necessary to use software that takes into account the complex survey
design of NHANES when determining variance estimates. If the standard errors are not needed, you simply could use a
SAS procedure, i.e., proc freq with the weight statement.

Supplement data are collected from the household interview. For further information on developing a dataset with
supplement data, see Module 3 (Dietary Data Structure & Contents), Task 4, Key Concepts About Dietary Supplement
Files. It is important to note that beginning in 2007-08, dietary supplement data were collected on the 24-hour recall.
Different statistical techniques may be used with these data, other than those described in this tutorial.

Close Window to return to module page.

https://www.cdc.gov/nchs/tutorials/dietary/Advanced/EstimatePrevalence/Info1.htm 1/1

12/19/2018 NHANES Dietary Web Tutorial: Estimating Prevalence of Supplement Use: Task 1b

Print Text!

Task 1b: How to Estimate Prevalence of Supplement Use Using Proportions
Using SAS Survey Procedures

In this example, to determine the prevalence rate of calcium supplement use among older adults in the U.S., you will
identify women and men age 50 years and older who report calcium supplement use on the household interview.

Step 1: Determine variables of interest

This example uses the demoadv dataset (download at Sample Code and Datasets). This dataset contains a created
variable called anycalsup that has a value of 1 for those who report calcium supplement use, and a value of 2 for those
who do not. A participant was considered not to have any calcium supplement use if the daily average amount of calcium
supplement use was zero; otherwise, a participant was considered a supplement user (see Supplement Code under
Sample Code and Module 9, Task 4 for more information). You will need to define and create a categorical variable
calcium indicating whether persons report supplement use (100 = calcium supplement use; 0 = no calcium supplement
use).

Step 2: Create Variable to Subset Population

In order to subset the data in SAS Survey Procedures, you will need to create a variable for the population of interest. In
this example, the sel variable is set to 1 if the sample person is age 50 years or older, and 2 if the sample person is
younger than age 50 years. Then this variable is used in the domain statement to specify the population of interest (those
ages 50 years and older).

Step 3: Use proc surveymeans to generate proportions and their standard errors in SAS

In the SAS surveymeans procedure, persons who report calcium supplement use, as defined above, are assigned a value
of 100, and persons who do not report supplement use are assigned a value of 0. The weighted mean of sample persons
with a value equal to 100 or 0 (which will be expressed as a percent) is an estimate of the prevalence of calcium
supplement use in the U.S.

Generate Proportions in SAS Survey Procedures

Statements Explanation

proc surveymeans data Use the surveymeans procedure to obtain the number of
=demoadv nobs mean observations, mean, and standard error.
stderr;
stratum sdmvstra; Use the stratum statement to define the strata variable
cluster sdmvpsu; (sdmvstra).
class riagendr;
Use the cluster statement to define the PSU variable
domain sel sel*riagendr; (sdmvpsu).

var calcium; Use the class statement to specify the discrete variables used
to form the subpopulations of interest. In this example, the
subpopulations of interest are specified by gender (riagendr).

Use the domain statement to specify the table layout to form
the subpopulations of interest. This example uses age greater
than or equal to 50 years (sel) by gender (riagendr).

Use the var statement to name the variable(s) to be analyzed.
In this example, the calcium supplement use variable
(calcium) is used. If the sample person reports calcium

https://www.cdc.gov/nchs/tutorials/dietary/Advanced/EstimatePrevalence/Task1b.htm 1/3

12/19/2018 NHANES Dietary Web Tutorial: Estimating Prevalence of Supplement Use: Task 1b

supplement use, then the value equals 100. Otherwise, the
variable equals 0.

IMPORTANT NOTE

The SAS Survey procedure, proc surveymeans, uses the

variable coded as 100 and 0 to obtain weighted means

expressed as percentages.

weight wtint2yr; Use the weight statement to account for the unequal
probability of sampling and non-response. In this example,
the interview weight for 2 years of data (wtint2yr) is used.

ods output Use the ods statement to output the dataset of estimates from

domain(match_all)=domain; the subdomains listed on the domain statement above. This

run ; set of commands will output two datasets for each subdomain
specified in the domain statement above (domain for sel;

domain1 for sel*riagendr).

data all; Use the data statement to name the temporary SAS dataset
set domain domain1; (all), append the two datasets (created in the previous step)
if sel= 1 ; with the set statement , and subset those participatnts with
age greater than or equal to 50 years (sel).

run ;

Use the print procedure to print the number of observations,
the mean, and standard error of the mean in a printer- friendly
proc print noobs data =all format.

split = '/' ;

var riagendr N mean
stderr;

format n 5.0 mean 7.4
stderr 6.4 ;

label N= 'Sample' /
'size' mean= 'Percent'

stderr= 'Standard' /
'error' / 'of the ' /
'percent' ;

title1 'Percent of
adults 50 years and older
who report calcium
supplement use' ;

run ;

Step 4: Review output

The percentages in the output are the estimated proportions of persons ages 50 years and older in the target population
who consume calcium supplements.

Reviewing the output, you will see tables for both genders and all ages, both genders by age group (sel), males only
by age group, and females only by age group.

https://www.cdc.gov/nchs/tutorials/dietary/Advanced/EstimatePrevalence/Task1b.htm 2/3

12/19/2018 NHANES Dietary Web Tutorial: Estimating Prevalence of Supplement Use: Task 1b

Notice that the prevalence of supplement use is higher in women, and supplement use is much higher in those over
age 50 years. (However, this procedure only produces estimates of the prevalence by group, not a formal test of
whether the estimates differ by age group.)

Close Window to return to module page.

https://www.cdc.gov/nchs/tutorials/dietary/Advanced/EstimatePrevalence/Task1b.htm 3/3

12/19/2018 NHANES Dietary Web Tutorial: Examine the Relationship Between Supplement use and a Categorical Outcome Using a Chi-Square Test

Print Text!

Task 2: Key Concepts about Examining the Relationship Between
Supplement Use and a Categorical Outcome Using a Chi-Square Test

The chi-square test is used to test the independence of two variables cross classified in a two-way table. For example,
suppose we wished to test the hypothesis that calcium supplement use is independent of osteoporosis treatment status
and that we have the following observed frequencies obtained as a result of the cross-classification of osteoporosis and
supplement use for women.

Osteoporosis Treatment Status and Supplement

Use

Osteoporosis Osteoporosis

Treatment Treatment Total

Status - Yes Status - No

Supplement 155 566 721
Use - Yes

Supplement 47 419 466
Use - No

Total 202 985 1187

In a simple random sample setting (unweighted data), the expected cell frequencies under the null hypothesis that
osteoporosis treatment status and calcium supplement use are independent could be obtained by multiplying the marginal
total for the ith row by the proportion of individuals in the jth column.

For example, the expected value of supplement users who received treatment for osteoporosis would be 721*
(202/1187)=123; the expected value of supplement users who did not receive treatment for osteoporosis 721*
(985/1187)=598.

Thus, if Oij = the observed frequency of the ith row and jth column, where i=1,2, … i and j=1,2, … j and Eij = the
expected frequency of the ith row and jth column. Then the formula to test the null hypothesis of independence, using the
chi-square statistic, would be

Equation 1. Equation to Test the Null Hypothesis

This statistic has degrees of freedom equal to the number of rows minus 1, multiplied by the number of columns minus 1.

In a complex sample setting, you would use a statistic similar to equation (1) above, modified to account for survey design
with degrees of freedom equal to the number of PSUs minus the number of strata containing observations. This statistic
can be obtained through SAS proc surveyfreq (chisq, based on the Rao-Scott chi-square with an adjusted F statistic). The
analogous procedure in SUDAAN version 10.0 (proc crosstab), provides limited chi-square statistics based on Wald chi-
square and does not provide an F adjusted p-value. However, SUDAAN regression models do provide F adjusted chi-
square statistics which are recommended for analyzing NHANES data.

The Cochran Mantel Haenzel Test, an extension of the Pearson Chi-Square, can be applied to stratified two-way tables to
test for homogeneity or independence in a non-survey setting. For a complex sample its analogue can be obtained in
SUDAAN proc crosstab (cmh).

References:

Agresti A. An Introduction to Categorical Data Analysis. Wiley Series in Probability and Statistics. 1996. New York.

https://www.cdc.gov/nchs/tutorials/dietary/Advanced/EstimatePrevalence/Info2.htm 1/2

12/19/2018 NHANES Dietary Web Tutorial: Examine the Relationship Between Supplement use and a Categorical Outcome Using a Chi-Square Test

Close Window to return to module page.

https://www.cdc.gov/nchs/tutorials/dietary/Advanced/EstimatePrevalence/Info2.htm 2/2

12/19/2018 NHANES Dietary Web Tutorial: Estimating Prevalence of Supplement Use: Task 2b

Print Text!

Task 2b: How to Calculate a Chi-Square Test Using SAS

In this task, you will use the chi-square test in SAS to determine whether calcium supplement use and treatment for
osteoporosis are independent of each other for men and women ages 50 years and older.

Step 1: Determine variables of interest

This example uses the demoadv dataset (download at Sample Code and Datasets). This dataset contains a created
variable anycalsup that has a value of 1 for those who report calcium supplement use, and a value of 2 for those who do
not. A participant was considered not to have any calcium supplement use if the daily average amount of calcium
supplement use was zero; otherwise, a participant was considered a supplement user (see Supplement Code under
Sample Code and Module 9, Task 4 for more information). The variable treatosteo indicates treatment for osteoporosis. A
participant was coded as having had treatment for osteoporosis if he or she responded “yes” to OSQ.070 (“{Were you/Was
SP} treated for osteoporosis?”) from the osteoporosis questionnaire, and was set to “no” if he or she responded “no” to
OSQ.070 or to OSQ.060 (“Has a doctor ever told {you/SP} that {you/s/he} had osteoporosis, sometimes called thin or
brittle bones?”) from the osteoporosis questionnaire. (The SAS code to create this variable is found in the “Supplement
Program” sample SAS code.) The demoadv dataset for this example only includes those with MEC weights (wtmec2yr>0).

Step 2: Create Variable to Subset Population

In order to subset the data in SAS Survey Procedures, you will need to create a variable for the population of interest. In
this example, the sel variable is set to 1 if the sample person is age 50 years or older, and 2 if the sample person is
younger than 50 years.

Step 3: Set Up SAS to Perform Chi-Square Test

The chi-square statistic is requested from the SAS surveyfreq procedure. The summary table below provides an example
of how to code for a chi-square test in SAS.

Calculating the chi-square test Using SAS surveyfreq Procedure

Statements Explanation

proc surveyfreq data =demoadv; Use the SAS Survey procedure, proc
surveyfreq, to examine the relationship between
two categorical variables.

Strata sdmvstra; Use the strata statement to specify the strata
variable (sdmvstra) and account for design
effects of stratification.

cluster sdmvpsu; Use the cluster statement to specify
PSU(sdmvpsu) to account for design effects of
clustering.

weight wtmec2yr; Use the weight statement to account for the
unequal probability of sampling and non-
response. In this example, the MEC weight for
2 years of data (wtmec2yr) is used.

table Use the table statement to specify cross-
sel*riagendr*anycalsup*treatosteo/col
row nostd nowt wchisq wllchisq chisq tabulations for which estimates are requested.
chisq1; In the example, the estimates are for age
greater than or equal to 50 years (sel) by gender

https://www.cdc.gov/nchs/tutorials/dietary/Advanced/EstimatePrevalence/Task2b.htm 1/2

12/19/2018 NHANES Dietary Web Tutorial: Estimating Prevalence of Supplement Use: Task 2b

(riagendr) and by osteoporosis treatment
(treatosteo). The options after the slash will
output the column percent (col), row percent
(row), Wald chi-square (wchisq), and Wald log
linear chi-square (wllchisq), and suppress the
standard deviation (nostd) and weighted sums
(nowt). Use the chisq option to obtain the Rao-
Scott chi-square and the chisq1 to obtain the
Rao-Scott modified chi-square.

format riagendr gender. anycalsup Use the format statement to read the SAS
yesnos. treatosteo yesno.; run ; formats.

IMPORTANT NOTE

For complex survey data such as NHANES, we recommend using the Rao-Scott F adjusted chi-square statistic because it
yields a more conservative interpretation than the Wald chi-square.

IMPORTANT NOTE

SAS version 9.2 and version 9.1.3 produce different estimates of the Rao-Scott Chi-Square test. This is because in version
9.1.3 SURVEYFREQ uses the total (over all tables) sample size in the Rao-Scott computations. In version 9.2, the
procedure uses the individual two-way sample size. SAS recommends the use of version 9.2.

Step 4: Review output

1,123 men and 1,187 women older than 50 years (sel=1) have information on calcium supplement use and
osteoporosis treatment
The row percentages indicate that supplement users tend to be more likely to be treated for osteoporosis than non-
users for both men and women.
However, because the p-value is greater than 0.05, you would accept the null hypothesis that calcium supplement
use and osteoporosis treatment are independent. The probability of obtaining a value of 0.4282 or more is
approximately 0.51 for men over 50. For women over 50, the probability of obtaining a value of 3.21 or more is
approximately 0.07.

Close Window to return to module page.

https://www.cdc.gov/nchs/tutorials/dietary/Advanced/EstimatePrevalence/Task2b.htm 2/2

12/19/2018 NHANES Dietary Web Tutorial: Examine the Relationship Between Supplement use and a Categorical Outcome Using a Chi-Square Test

Print Text!

Task 3: Key Concepts about Using Logistic Regression In NHANES

Logistic regression is used to assess the association between independent variable(s) (Xj) -- sometimes called exposure
or predictor variables — and a dichotomous dependent variable (Y) — sometimes called the outcome or response
variable. Logistic regression analysis tells you how much an increment in a given exposure variable affects the odds of the
outcome.

Simple logistic regression is used to explore associations between one (dichotomous) outcome and one (continuous,
ordinal, or categorical) exposure variable. For example, you answer questions like, "how does calcium supplement use
affect the probability of receiving treatment for osteoporosis?,” similar to using a chi-square test.

Multiple logistic regression is used to explore associations between one (dichotomous) outcome variable and two or more
exposure variables (which may be continuous, ordinal or categorical). The purpose of multiple logistic regression is to
isolate the relationship between the exposure variable and the outcome variable from the effects of one or more other
variables (called covariates or confounders). For example, you can answer questions like, "How does calcium
supplement use affect the probability of receiving treatment for osteoporosis, after accounting for (or unconfounded by, or
independent of, or holding constant) age, income, etc.?" This process of accounting for covariates or confounders is called
adjustment. For example, say that the prevalence of osteoporosis treatment tends to be lower in younger people; and
younger people are less likely to take calcium supplements. In this case, inferences about osteoporosis treatment and
calcium supplement use get confused by the effect of age on supplement use and osteoporosis treatment. This kind of
"confusion" is called confounding (and these covariates are sometimes called confounders). Confounders are variables
which are associated with both the exposure and outcome of interest. This relationship is shown in the following figure.

Diagram of the Relationship between Exposure, Outcome, and the Confounder

You can use multiple logistic regression to adjust for confounding and isolate the relationship of interest. The process of
accounting for covariates is also called adjustment.

Comparing the results of simple and multiple logistic regression can help to answer the question, "How much did the
covariates in the model alter the relationship between exposure and outcome (i.e., How much confounding was there)?"

Research Question

In this module, you will assess the association between calcium supplement use (the exposure variable) and the likelihood
of receiving treatment for osteoporosis (the outcome). You will look at both simple logistic regression and then multiple
logistic regression. The multiple logistic regression model will include age, race/ethnicity, and body mass index (BMI) as
covariates. This analysis will answer the question, “What is the effect of calcium supplement use on the likelihood of
receiving treatment for osteoporosis – after controlling for age, race/ethnicity, and body mass index (BMI)?”

Dependent Variable and Independent Variables

As noted, the dependent variable Y for a Logistic Regression is dichotomous, which means that it can take on one of two
possible values. NHANES includes many questions where people must answer either “yes” or “no”; these include
questions like “Has the doctor ever told you that you have congestive heart failure?”. Alternatively, you can create
dichotomous variables by setting a threshold (e.g., “diabetes” = 1 if fasting blood sugar > 126 and “diabetes”=0 otherwise);
or by combining information from several variables. In this module, we will use a dichotomous variable called “treatosteo”
indicating osteoporosis treatment. In SUDAAN, and SAS Survey Procedures the dependent variable is coded as 1 (for

https://www.cdc.gov/nchs/tutorials/dietary/Advanced/EstimatePrevalence/Info3.htm 1/4

12/19/2018 NHANES Dietary Web Tutorial: Examine the Relationship Between Supplement use and a Categorical Outcome Using a Chi-Square Test

having the outcome) and 0 (for not having the outcome). A participant was coded as having had treatment for
osteoporosis if he or she responded “yes” to OSQ.070 (“{Were you/Was SP} treated for osteoporosis?”) from the
osteoporosis questionnaire, and was set to “no” if he or she responded “no” to OSQ.070 or to OSQ.060 (“Has a doctor
ever told {you/SP} that {you/s/he} had osteoporosis, sometimes called thin or brittle bones?”) from the osteoporosis
questionnaire. (The SAS code to create this variable is found in the “Supplement Program” sample SAS code.) In this
example, for people who receive treatment for osteoporosis, the treatosteo variable would have a value of 1, while the
variable would have a value of 0 for people who were are not receiving treatment.

In logistic regression, the independent variables Xj can be categorical (e.g., gender, race), ordinal (e.g., supplement use
(y/n), age groups, BMI categories), or continuous (e.g., continuous BMI). There are different ways that one can define
categorical variables using indicator, or “dummy” variables. One common way is to define a reference category, i.e., the
category to which the other levels of the categorical variable are compared.

Note that getting statistical packages like SUDAAN and SAS Survey to run analyses is the easy part of regression. What
is not easy is knowing which variables to include in your analyses, how to represent them, and when to worry about
confounding; determining if your models are any good; and knowing how to interpret them. These tasks require thought,
training, experience, and respect for the underlying assumptions of regression. Remember, garbage in - garbage out.

Finally, remember that NHANES analyses can only establish associations and not causal relationships. This is because
the data are cross-sectional, so there is no way to establish temporal sequences (i.e., which came first the "exposure" or
the "outcome"?).

Logit Function

Because you are trying to find associations between risk factors and a condition, you need a formula that will allow you to
link these variables. The logit function that is used in logistic regression is also known as the link function because it
connects, or links, the values of the independent variables to the probability of occurrence of the event defined by the
dependent variable.

Logit Model

In the logit formula above, Pr(Y=1) means the “probability that Y=1”, or the probability that the event occurs. In this
equation ‘log' indicates natural log.

Optional: Learn more about odds ratios, linear and logistic regression

Output of Logistic Regression

The statistics of primary interest in logistic regression are the beta coefficients (ß1,ß2, ß3...), their standard errors, and their
p-values. Like other statistics, the standard errors are used to calculate confidence intervals around the beta coefficients.
It is easy to transform the beta coefficients into a more interpretable format, the odds ratio, as follows:

eß= odds ratio.
If Xj is a dichotomous variable with values of 1 or 0, then the beta coefficient represents the log odds that an individual
will have the event for a person with Xj=1 versus a person with Xj=0. In a multivariate model, this beta coefficient is the
independent effect of variable Xj on Yi after adjusting for all other covariates in the model. The odds ratio eß represents
the odds that an individual will have the event for a person with Xj= 1 versus an individual with Xj=0.
If Xj is a continuous variable, then the eß represents the odds that an individual will have the event for a person with
Xj=m+1 versus an individual with Xj=m. In other words, for every one unit increase in Xj, the odds of having the event Yi
changes by eß adjusting for all other covariates in a multivariate model.
A summary table about interpretation of beta coefficients is provided below:

https://www.cdc.gov/nchs/tutorials/dietary/Advanced/EstimatePrevalence/Info3.htm 2/4

12/19/2018 NHANES Dietary Web Tutorial: Examine the Relationship Between Supplement use and a Categorical Outcome Using a Chi-Square Test

Table: What does the Beta Coefficient Mean?

Independent Example The beta coefficient in The beta coefficient in multiple
Variable Variables simple logistic logistic regression
Type regression

height, The change in the log The change in the log odds of

weight, LDL odds of the dependent dependent variable per 1 unit

Continuous variable per 1 unit change change in the independent variable

in the independent after controlling for the confounding

variable. effects of the covariates in the

model.

Supplement The difference in the log The difference in the log odds of the

use (two odds of the dependent dependent variable for one value of

subgroups – variable for one value of the categorical variable vs. the

Categorical yes and no. the categorical variable vs. reference group (for example,
(also known This the reference group (for between supplement users, and the
as discrete) example will example, between reference group, non-users), after
use no as supplement users, and the controlling for the confounding

the reference group, non- effects of the covariates in the

reference users). model.

group.)

IMPORTANT NOTE
Odds and odds ratios are not the same as risk and relative risks.
In particular, odds and probability are two different ways to express the likelihood of an outcome.
Here are their definitions and some examples.

Table of Differences between Odds and Probability

Definition Example: Getting heads in Example: Getting a 1 in a

1 flip of a coin single roll of a die

# of times = 1/1 = 1 (or 1:1) = 1/5 = 0.2 (or 1:5)

something

Odds happens

# of times it does

NOT happen

# of times = 1/2 = .5 (or 50%) = 1/6 = .16 (or 16%)

something

Probability happens

# of times it could

happen

The above example illustrates the difference between odds and probabilities. In the example, the odds of getting a 1 in a
single roll of a die is 20%, whereas the probability is 16%. The same is true for odds ratios and relative risk. Both are
ratios of the odds or probability of the event for one group compared to another. For example, if we wanted to know the
odds ratio of rolling a 1 with a blue die compared to a red die.

Few people think in terms of odds. Many people equate odds with probability and thus equate odds ratios with risk ratios.
When the outcome of interest is uncommon (i.e., it occurs less than 10% of the time), such confusion makes little
difference, since odds ratios and risk ratios are approximately equal. When the outcome is more common, however, the
odds ratio increasingly overstates the risk ratio. So, to avoid confusion, when event rates are high, odds ratios should
be converted to risk ratios. (Schwartz LM, Woloshin S, Welch HG. Misunderstandings about the effects of race and sex
on physicians’ referrals for cardiac catheterization. N Engl J Med 1999;341:279–83) There are simple methods of
conversion for both crude and adjusted data. (Zhang J, Yu KF. What's the relative risk? A method of correcting the odds

https://www.cdc.gov/nchs/tutorials/dietary/Advanced/EstimatePrevalence/Info3.htm 3/4

12/19/2018 NHANES Dietary Web Tutorial: Examine the Relationship Between Supplement use and a Categorical Outcome Using a Chi-Square Test

ratio in cohort studies of common outcomes. JAMA 1998;280:1690-1691. Davies HT, Crombie IK, Tavakoli M. When can
odds ratios mislead? BMJ 1998;316:989-991)

The following formulas demonstrate how to convert between probability and odds.

Close Window to return to module page.

https://www.cdc.gov/nchs/tutorials/dietary/Advanced/EstimatePrevalence/Info3.htm 4/4

12/19/2018 NHANES Dietary Web Tutorial: Estimating Prevalence of Supplement Use: Task 3b

Print Text!

Task 3b: How to Perform Logistic Regression Using SAS Survey
Procedures

In this module, you will use simple logistic regression to analyze NHANES data to assess the association between calcium
supplement use (anycalsup) — the exposure or independent variable — and the likelihood of receiving treatment for
osteoporosis (treatosteo) — the outcome or dependent variable, among participants ages 20 years old and older. You will
then use multiple logistic regression to assess the relationship after controlling for selected covariates. The covariates
include gender (riagendr), age (ridageyr), race/ethnicity (ridreth1), and body mass index (bmxbmi).

Step 1. Determine the appropriate weight for the data used

This example uses the demoadv dataset (download at Sample Code and Datasets). This dataset already contains a
variable anycalsup that has a value of 1 for those who report calcium supplement use, and a value of 2 for those who do
not. A participant was considered not to have any calcium supplement use if the daily average amount of calcium
supplement use was zero; otherwise, a participant was considered a supplement user (see Supplement Code under
Sample Code and Module 9, Task 4 for more information).

It is always important to check all the variables in the model, and use the weight of the smallest common denominator. In
the example of univariate analysis, the 2-year MEC weight is used, because the osteoporosis variable is from the MEC
examination. The demoadv dataset for this example only includes those with MEC weights (wtmec2yr>0).

Step 2: Create independent categorical variables

This example will also illustrate the creation of additional independent categorical variables (age, bmigrp) from the age,
and BMI categorical variables, and these new variables will be used in this analysis.

Code to Generate Independent Categorical Variables

Independent variable Code to generate independent categorical variables
Age
if 20 <=ridageyr<40 then age= 1 ;
BMI category else if 40 <=ridageyr<60 then age= 2 ;
else if ridageyr>= 60 then age= 3 ;
if 0 <=bmxbmi<25 then bmigrp= 1 ;
else if 25 <=bmxbmi<30 then bmigrp= 2 ;
else if bmxbmi>= 30 then bmigrp= 3 ;

Step 3: Create new weight variable for Domain (Subpopulation) Analysis (prior to SAS 9.2) or add
domain statement (SAS 9.2 and higher)

You should not use a where clause or by-group processing in order to analyze a subpopulation with the SAS Survey
Procedures. Prior to SAS 9.2, to get an approximate domain (subpopulation) analysis when using proc surveylogistic, you
would assign a near zero weight to observations that do not belong to your current domain. The reason that you cannot
make the weight zero is that the procedure will exclude any observation with zero weight. In this example, you have a
domain (subpopulation) where age is greater than or equal to 20 years, and if you specify in a data step:

if ridageyr GE 20 then newweight=wtmec2yr;

else newweight=1e-6;

you could then perform the logistic regression using the newweight variable as:

weight newweight;

https://www.cdc.gov/nchs/tutorials/dietary/Advanced/EstimatePrevalence/Task3b.htm 1/3

12/19/2018 NHANES Dietary Web Tutorial: Estimating Prevalence of Supplement Use: Task 3b

IMPORTANT NOTE

The code above with the newweight variable is no longer necessary in SAS 9.2. The statement

weight newweight;
may be replaced with the statements

weight wtmec2yr;
domain sel;
where sel is defined as

if ridageyr GE 20 then sel= 1 ;
else sel= 2 ;
(Note that for this particular example, osteoporosis treatment is only collected for those ages 20 and over, so you will not
notice a difference whether wtmec2yr or newweight is used. However, if a different age group or variable was used for the
subpopulation, differences would be noted.)

Reference: SAS Technical Support

Step 4: Fit Multiple Logistic Regression Model in SAS

This step introduces you to the SAS procedure for logistic regression, proc surveylogistic. There is a summary table of the
SAS program below.

IMPORTANT NOTE

These programs use variable formats listed in the sample program. You may need to format the variables in your dataset
the same way to reproduce results presented in the tutorial.

SAS Logistic Regression Procedure

Statements Explanation

proc surveylogistic data =demoadv; Use the proc surveylogistic procedure to
perform multiple logistic regression to assess
stratum sdmvstra; the association between the outcome and
cluster sdmvpsu; multiple risk factors, including: age, gender,
weight newweight; race/ethnicity, and body mass index.

Use the stratum statement to specify strata to
account for design effects of stratification.

Use the cluster statement to specify primary
sampling unit (PSU) to account for design
effects of clustering.

Use the weight statement to account for the
unequal probability of sampling and non-
response. In this example, you use the new
weight variable created in the data step. See
Step 1.

https://www.cdc.gov/nchs/tutorials/dietary/Advanced/EstimatePrevalence/Task3b.htm 2/3

12/19/2018 NHANES Dietary Web Tutorial: Estimating Prevalence of Supplement Use: Task 3b

Statements Explanation

class age ( ref = '40-59' ) riagendr ( Use the class statement to specify all
ref = 'Male' ) ridreth1 anycalsup ( categorical variables in the model.
ref = 'No supp use' ) bmigrp ( ref =
'25<=BMI<30' )/ param =ref; Use the param and ref options to choose your
reference group for the categorical variables.
model treatosteo =anycalsup riagendr
age ridreth1 bmigrp; Use the model statement to specify the
dependent variable and all independent
format riagendr gender. age agegrp. variable(s) in your logistic regression model.
ridreth1 race. anycalsup yesnos.
bmigrp bmifmt. ; Use the format statement to read the SAS
formats for all formatted variables.

IMPORTANT NOTE

The SAS Survey Procedure, proc surveylogistic, produces the Wald statistic and its p value. It does not produce the
Satterthwaite χ2 or the Satterthwaite F and the corresponding p values recommended for NHANES analyses. For this
reason, it is recommended that you use proc rlogist in SUDAAN for logistic regression.

Step 5: Review SAS Multiple Logistic Regression Output

In this step, the SAS output is reviewed. The highlighted elements show that:

238 respondents receive osteoporosis treatment and 4,385 do not.
Odds ratios should be interpreted as adjusted odds ratios because there are multiple covariates in the model. The
adjusted odds of osteoporosis treatment are 1.37 (95% C.I. 0.92-2.04) for supplement users compared to non-
users. Because the confidence interval includes 1, we conclude that calcium supplement users are not more likely
to be receiving osteoporosis treatment compared to non-users, after adjusting for the covariates.
All other covariates are statistically significant at p-value<0.05, except for BMI.

If you ran both the SAS Survey and SUDAAN programs (or reviewed the output provided on the Sample Code and
Datasets page), you may have noticed slight differences in the output. These differences can be caused by missing data
in any paired PSU or how each software program handles degrees of freedom.

Close Window to return to module page.

https://www.cdc.gov/nchs/tutorials/dietary/Advanced/EstimatePrevalence/Task3b.htm 3/3

12/19/2018 NHANES Dietary Web Tutorial: Examine the Relationship Between Supplement use and a Categorical Outcome Using a Chi-Square Test

Print Text!

Task 4: Key Concepts about Using the T-Test Statistic

The t-test is used to test the null hypothesis that two population means or proportions, θ1 and θ2, are equal or,
equivalently, that the difference between two population means or proportions is zero. To test this hypothesis, assuming
the covariance is small, as is the case with NHANES data, the following formula is used:

Equation for t-Test where Covariance is Small

where,
1 is an estimate of the first population mean or proportion based on a probability sample,
1 is an estimate of the standard error of 1,
2 is an estimate of the second population mean or proportion,

and 2 is an estimate of the standard error of 2.

In instances where only a small number (<30) of independent pieces of information are available with which to estimate the
quantity [1 - 2], the t-statistic given above follows a Student's t distribution with zero mean and unit variance, and with a
number of degrees of freedom corresponding to the number of independent pieces of information. In a simple random
sample, the number of independent pieces of information is generally equal to the number of people in the sample minus
one. In NHANES, however, the number of independent pieces of information is substantially lower due to the multi-stage
probability sample design. In NHANES, this number (referred to as degrees of freedom) is equal to the number of PSUs
minus the number of strata (see “Module 12: Sample Design” of the Continuous NHANES Tutorial for more information).

The equality of means is usually tested at the 0.05 level of significance. However, at the 0.05 level of significance, some
differences that are not meaningful (usually very small) are significant because of the large sample size.

Close Window to return to module page.

https://www.cdc.gov/nchs/tutorials/dietary/Advanced/EstimatePrevalence/Info4.htm 1/1

12/19/2018 NHANES Dietary Web Tutorial: Estimating Prevalence of Supplement Use: Task 4b

Print Text!

Task 4b: How to Set Up a T-Test in NHANES Using SAS Survey Procedures

In this task, you will use SAS to calculate a t-statistic and assess whether the mean systolic blood pressure in calcium
users vs. non-users ages 20 years and older is statistically different.

Step 1: Identify variables

This example uses the demoadv dataset (download at Sample Code and Datasets). The t-test is used with one
continuous variable and one dichotomous variable. The dataset contains a created variable called anycalsup that has a
value of 1 for those who report calcium supplement use, and a value of 2 for those who do not. A participant was
considered not to have any calcium supplement use if the daily average amount of calcium supplement use was zero;
otherwise, a participant was considered a supplement user (see Supplement Code under Sample Code and Module 9,
Task 4 for more information). A variable called sel is created that defines those individuals 20 years old and older. Blood
pressure is measured in the MEC; therefore MEC weights are used in the analysis. The demoadv dataset for this example
only includes those with MEC weights (wtmec2yr>0):

data demoadv;
set nh.demoadv;
if wtmec2yr> 0 ;
if ridageyr >= 20 then sel= 1 ;
else sel= 2 ;
run ;

Step 2: Compute Properly Weighted Estimated Means

Compute Properly Weighted Estimated Means

Statements Explanation

proc surveymeans data =demoadv nobs mean stderr Use the proc surveymeans procedure
; to obtain number of observations,

mean, and standard error.

stratum sdmvstra; Use the stratum statement to define
the strata variable (sdmvstra).

cluster sdmvpsu; Use the cluster statement to define
the PSU variable (sdmvpsu).

class anycalsup; Use the class statement to specify the
discrete variables used to form the
subpopulations of interest. In this
example, the subpopulation of interest
is by supplement use (anycalsup).

domain sel sel*anycalsup; Use the domain statement to specify
the table layout to form the
subpopulations of interest. This
example uses age greater than or
equal to 20 (sel) by supplement use
(anycalsup).

var mean_sbp; Use the var statement to name the
variable(s) to be analyzed. In this
example, the mean systolic blood

https://www.cdc.gov/nchs/tutorials/dietary/Advanced/EstimatePrevalence/Task4b.htm 1/3

12/19/2018 NHANES Dietary Web Tutorial: Estimating Prevalence of Supplement Use: Task 4b

pressure variable (mean_sbp) is
used.

weight wtmec2yr; Use the weight statement to account
for the unequal probability of sampling
and non-response. In this example,
the MEC weight for 2 years of data
(wtmec2yr) is used.

format anycalsup yesnos. ; Formats the anycalsup variable.

Step 3: Interpret Results

Highlights from the output include:

4,392 respondents ages 20 and older were included in this analysis; 2,050 respondents were supplement users and
2,342 were non-users.
The mean systolic blood pressure for calcium supplement users ages 20 years and older was 124.3 and the mean
calcium intake for non-users ages 20 and older was 122.2.

Step 4: Use a t-test to Test for Significance

A t-test is used to test whether the mean systolic blood pressure in calcium supplement users is statistically different from
the mean systolic blood pressure in non-users. In SAS, a simple linear model using proc surveyreg may be used to obtain
the t-test.

Code to Test for Significance

Statements Explanation

proc surveyreg data Use the proc surveyreg procedure to obtain number of observations,
=demoadv ; mean, and standard error.

stratum sdmvstra; Use the stratum statement to define the strata variable (sdmvstra).

cluster sdmvpsu; Use the cluster statement to define the PSU variable (sdmvpsu).

model Use the model statement to specify the outcome variable
mean_sbp=anycalsup; (mean_sbp) and the 2-level variable used to form the subpopulation

of interest. In this example, the subpopulation of interest is by

supplement use (anycalsup).

domain sel; Use the domain statement to specify the table layout to form the
subpopulations of interest. This example uses age greater than or
equal to 20 (sel) by supplement use (anycalsup).

weight wtmec2yr; Use the weight statement to account for the unequal probability of
sampling and non-response. In this example, the MEC weight for 2
years of data (wtmec2yr) is used.

format anycalsup Formats the anycalsup variable.

yesnos. ;

Highlights from the output include:

4,392 respondents ages 20 years and older were included in this analysis where sel=1.
The null hypothesis is that there is no relationship between systolic blood pressure and calcium supplement use, or
that the mean systolic blood pressure for supplement users equals the mean systolic blood pressure for non-users.

https://www.cdc.gov/nchs/tutorials/dietary/Advanced/EstimatePrevalence/Task4b.htm 2/3

12/19/2018 NHANES Dietary Web Tutorial: Estimating Prevalence of Supplement Use: Task 4b

To test this hypothesis, the t-statistic with 15 degrees of freedom is computed as 3.73. The p-value is 0.0020. The
difference is 2 mm Hg.
Therefore, the null hypothesis is rejected at the 0.05 level and it is concluded that the mean systolic blood pressure
in calcium supplement users does not equal the mean systolic blood pressure in non-users.

Close Window to return to module page.

https://www.cdc.gov/nchs/tutorials/dietary/Advanced/EstimatePrevalence/Task4b.htm 3/3

12/19/2018 NHANES Dietary Web Tutorial: Examine the Relationship Between Supplement Use and a Categorical Outcome Using a Chi-Square Test

Print Text!

Task 5: Key Concepts about Linear Regression

In cross-sectional surveys such as NHANES, linear regression analyses can be used to examine the association between
multiple covariates and a health outcome measured on a continuous scale. For example, we will assess the association
between systolic blood pressure (Y) and selected covariates (Xi) in this module. The covariates in this example will
include calcium supplement use, race/ethnicity, age, and body mass index (BMI).

Simple linear regression is used when you have a single independent variable (e.g., supplement use); multiple linear
regression may be used when you have more than one independent variable (e.g., supplement use and one or more
covariates). Multiple regression allows you to examine the effect of the exposure of interest on the outcome after
accounting for the effects of other variables (called covariates or confounders).

Simple linear regression is used to explore associations between one (continuous, ordinal or categorical) exposure and
one (continuous) outcome variable. Simple linear regression lets you answer questions like, "How does systolic blood
pressure vary with supplement use?".

Multiple linear regression is used to explore associations between two or more exposure variables (which may be
continuous, ordinal or categorical) and one (continuous) outcome variable. The purpose of multiple linear regression is to
isolate the relationship between the exposure variable and the outcome variable from the effects of one or more other
variables called covariates. For example, say that systolic blood pressure values tend to be lower in younger people; and
younger people are less likely to take calcium supplements. In this case, inferences about systolic blood pressure and
calcium supplement use get confused by the effect of age on supplement use and blood pressure. This kind of "confusion"
is called confounding (and these covariates are sometimes called confounders). Confounders are variables which are
associated with both the exposure and outcome of interest. This relationship is shown in the following figure.

Diagram of the Relationship between Exposure, Outcome, and the Confounder

You can use multiple linear regression to adjust for confounding and isolate the relationship of interest. In this example, the
relationship is between systolic blood pressure level and calcium supplement use. That is, multiple linear regression lets
you answer the question, "How does systolic blood pressure vary with calcium supplement use, after accounting for — or
unconfounded by — or independent of — age?" As mentioned, you can include many covariates at one time. The
process of accounting for covariates is also called adjustment.

Comparing the results of simple and multiple linear regressions can help to answer the question "How much did the
covariates in the model distort the relationship between exposure and outcome (i.e., how much confounding was there)?"

Note that getting statistical packages like SUDAAN, SAS Survey, and Stata to run analyses is the easy part of regression.
What is not easy is knowing which variables to include in your analyses, how to represent them, when to worry about
confounding, determining if your models are any good, and knowing how to interpret them. These tasks require thought,
training, experience, and respect for the underlying assumptions of regression. Remember, garbage in - garbage out.

Finally, remember that NHANES analyses can only establish associations and not causal relationships. This is because
the data are cross-sectional, so there is no way to establish temporal sequences (i.e., which came first the "exposure" or
the "outcome"?).

https://www.cdc.gov/nchs/tutorials/dietary/Advanced/EstimatePrevalence/Info5.htm 1/4

12/19/2018 NHANES Dietary Web Tutorial: Examine the Relationship Between Supplement Use and a Categorical Outcome Using a Chi-Square Test

This module will assess the association between systolic blood pressure (the continuous outcome variable) and selected
covariates to show how to use linear regression with SUDAAN and SAS. The covariates in this example will include
calcium supplement use, race/ethnicity, age, and body mass index (BMI). In other words, what is the effect of each of
these variables, independent of the effect of the other variables?

Simple Linear Regression Model

In the simplest case, you plot the values of a dependent, continuous variable Y against an independent, continuous
variable X1, (i.e. a correlation) and see the best-fit line that can be drawn through the points.

The first thing to do is make sure the relationship of interest is linear (since linear regression draws a straight line through
data points). The best way to do this is to look at a scatterplot. If the relationship between variables is linear, continue
(see panels A and B below). If it is not linear, do not use linear regression. In this case, you can try and transforming the
data or using other forms of regression such as polynomial regression.

Example of a Linear Relationship

Panel A Panel B

Example of a Non-linear Relationship Panel D
Panel C

This relationship between X1 and Y can be expressed as 2/4
Equation for Simple Linear Regression
(1)

https://www.cdc.gov/nchs/tutorials/dietary/Advanced/EstimatePrevalence/Info5.htm

12/19/2018 NHANES Dietary Web Tutorial: Examine the Relationship Between Supplement Use and a Categorical Outcome Using a Chi-Square Test

b0 also known as the intercept, denotes the point at which the line intersects the vertical axis; b1 , or the slope, denotes the
change in dependent variable, Y, per unit change in independent variable, X 1; and ε indicates the degree to which the plot
of Y against X differs from a straight line. Note that for survey data, ε is always greater than 0.

Multiple Regression Model

You can further extend equation (1) to include any number of independent variables Xi , where i=1,..,n (both continuous
(e.g. 0-100) and discrete (e.g. 0,1 or yes/no)).

Equation for Multiple Regression Model
(2)

The choice of variables to include in equation (2) can be based on results of univariate analyses, where Xi and Y have a
demonstrated association. It also can be based on empirical evidence where a definitive association between Y and an
independent variable has been demonstrated in previous studies.

Polynomial Regression

It is possible to have two continuous variables, Y and X1, on sampled individuals such that if the values of Y are plotted
against the values of X1, the resulting plot would resemble a parabola (i.e., the value of Y could increase with increasing
values of X, level off and then decline). A polynomial regression model is used to describe this relationship between X1
and Y and is expressed as

Equation for Polynomial Regression
(3)

Interaction

Consider the situation described in equation (2), where a discrete independent variable, X2, and a continuous independent
variable, X1, affect a continuous dependent variable, Y. This relationship would yield two straight lines, one showing the
relationship between Y and X1 for X2=0, and the other showing the relationship of Y and X1 for X2=1. If these straight lines
were parallel, the rate of change of Y per unit change in X1 would be the same for X2=0 as for X2=1, and therefore, there
would be no interaction between X1 and X2. If the two lines were not parallel, the relationship between Y and X1 would
depend upon the relationship between Y and X2, and therefore there would be an interaction between X1 and X2.

Interpretation of Coefficients

For continuous independent variables, the beta coefficient indicates the change in the dependent variable per unit change
in the independent variable, controlling for the confounding effects of the other independent variables in the
model. A discrete random variable, X1, can assume 2 or more distinct values corresponding to the number of subgroups in
a given category. One subgroup (usually arbitrarily) is designated as the reference group. The beta coefficient for a
discrete variable indicates the difference in the dependent variable for one value of Xi (e.g., the difference between
supplement users and the reference group, non-users), when all other independent variables in the model are held
constant. A positive value for the beta coefficient indicates a larger value of the dependent variable for the subgroup
(supplement users) than for the reference group (non-users), whereas a negative value for the beta coefficient indicates a
smaller value.

: 3/4

https://www.cdc.gov/nchs/tutorials/dietary/Advanced/EstimatePrevalence/Info5.htm

12/19/2018 NHANES Dietary Web Tutorial: Examine the Relationship Between Supplement Use and a Categorical Outcome Using a Chi-Square Test

Interpretation of Coefficients Summary Table

Independent Examples What does the b What does the b coefficient
variable coefficient mean in mean in Multiple linear
type regression?
Simple linear
regression?

Continuous height, weight, The change in the The change in the dependent
LDL dependent variable variable per unit change in
per unit change in the independent variable
the independent after controlling for the
variable. confounding effects of the
covariates in the model.

Supplement use The difference in the The difference in the

(2 subgroups, dependent variable dependent variable for one

users and non- for one value of value of categorical variable

users where one categorical variable (e.g., between supplement

Categorical is designated as (e.g., the difference users and the reference

the reference between supplement group non-users), after

group (non- users and the controlling for the

users, in this reference group, confounding effects of the

example)). non-users). covariates in the model.

SUDAAN ((proc regress), SAS Survey (proc surveyreg), andStata (svy:regress) procedures produce beta coefficients,
standard errors for these coefficients, confidence intervals, a t-statistic for the null hypothesis (i.e., ß=0), a p-value for the
t-statistic (i.e., the probability of obtaining a value greater than or equal to the value for the t statistic).

Close Window to return to module page.

https://www.cdc.gov/nchs/tutorials/dietary/Advanced/EstimatePrevalence/Info5.htm 4/4

12/19/2018 NHANES Dietary Web Tutorial: Estimating Prevalence of Supplement Use: Task 5b

Print Text!

Task 5b: How to Perform Linear Regression Using SAS Survey Procedures

This example uses the demoadv dataset (download at Sample Code and Datasets). In this example, you will assess the
association between systolic blood pressure (mean_spb) — the outcome variable — and calcium supplement use
(anycalsup) — the exposure variable — after controlling for selected covariates in NHANES 2003-2004. These covariates
include race/ethnicity (ridreth1), age (ridageyr), and body mass index (BMI) (bmxbmi).

Step 1: Specify the variables in the model

The demoadv dataset for this example only includes those with MEC weights (wtmec2yr>0).

For continuous variables, you have a choice of using the variable in its original form (continuous) or changing it into a
categorical variable (e.g. based on standard cutoffs, quartiles or common practice). The categorical variables should
reflect the underlying distribution of the continuous variable and not create categories where there are only a few
observations. It is important to examine the data both ways, since the assumption that a dependent variable has a
continuous relationship with the outcome may not be true. Looking at the categorical version of the variable will help you
to know whether this assumption is true. For example, you could model BMI as a continuous variable or convert it into a
categorical variable based on standard BMI definitions of underweight, normal weight, overweight and obese. Here is how
categorical BMI variables and eligibility variables are created:

Table of Code to Generate Categorical BMI and Eligibility Variables

Code to generate categorical BMI variables Category

if 0 le bmxbmi lt 18.5 then bmicat= 1 ; underweight

else if 18.5 le bmxbmi lt 25 then bmicat= 2 ; normal
weight
else if 25 le bmxbmi lt 30 then bmicat= 3 ;
else if bmxbmi ge 30 then bmicat= 4 ; overweight

obese

if (dxdtobmd^= . and ridreth1^= . and ridageyr^= . and bmxbmi^= . and

anycalsup^= . ) and wtmec2yr> 0 and (ridageyr>= 20 ) then eligible= 1 eligibility

;

IMPORTANT NOTE

These programs use variable formats listed in the sample program. You may need to format the variables in your dataset
the same way to reproduce results presented in the tutorial.

Step 2: Fit a simple linear regression model

The association between the dependent and independent variables is expressed using the model statement in the in the
proc surveyreg procedure. The dependent variable must be a continuous variable and will always appear on the left hand
side of the equation. The variables on the right hand side of the equation are the independent variables and may be
discrete or continuous.

Discrete variables are specified using a class statement. In proc surveyreg, the dependent variable is NEVER specified in
a subgroup or a class statement because it must be a continuous variable.

Code to Fit a Simple Linear Regression Model

Statements Explanation

proc surveyreg data =demoadv ; Use the proc surveymeans procedure to obtain
number of observations, mean, and standard error.
stratum sdmvstra;
Use the stratum statement to define the strata variable

https://www.cdc.gov/nchs/tutorials/dietary/Advanced/EstimatePrevalence/Task5b.htm 1/3

12/19/2018 NHANES Dietary Web Tutorial: Estimating Prevalence of Supplement Use: Task 5b

(sdmvstra).

cluster sdmvpsu; Use the cluster statement to define the PSU variable
(sdmvpsu).

class anycalsup; Use the class statement to define a dummy variable for
the independent variable (anycalsup).

model mean_sbp=anycalsup; Use the model statement to specify the dependent
variable (mean_sbp) and the independent variable
(anycalsup).

domain eligible; Use the domain statement to specify the table layout to
form the subpopulations of interest. This example uses
the eligible participants for the multiple regression.

weight wtmec2yr; Use the weight statement to account for the unequal
probability of sampling and non-response. In this
example, the MEC weight for 2 years of data
(wtmec2yr) is used.

format anycalsup yesnos. ; Formats the anycalsup variable.

Highlights from the output include:

4,392 respondents ages 20 years and older with complete data for the dependent and independent variables were
included in this analysis.
The results from the first model indicate that for calcium supplement users, on average, systolic blood pressure is
higher by 2.04 mm Hg.
This value is significantly greater than 0 (p-value = 0.0014).

Step 3: Fit a multiple linear regression model

Code to Fit a Multiple Linear Regression Model

Statements Explanation

proc surveyreg data Use the proc surveymeans procedure to obtain number of
=demoadv ; observations, mean, and standard error.

stratum sdmvstra; Use the stratum statement to define the strata variable
(sdmvstra).

cluster sdmvpsu; Use the cluster statement to define the PSU variable
(sdmvpsu).

class riagendr anycalsup Use the class statement to define dummy variables.

ridreth1 bmicat;

model mean_sbp=anycalsup Use the model statement to specify the dependent variable
riagendr ridreth1 (mean_sbp) and the independent variables (anycalsup

ridageyr bmicat/ solution riagendr ridreth1 ridageyr bmicat).
;

domain eligible; Use the domain statement to specify the table layout to form
the subpopulations of interest. This example uses the eligible
participants.

weight wtmec2yr; Use the weight statement to account for the unequal
probability of sampling and non-response. In this example,
the MEC weight for 2 years of data (wtmec2yr) is used.

format anycalsup yesnos. ; Formats the anycalsup variable.

https://www.cdc.gov/nchs/tutorials/dietary/Advanced/EstimatePrevalence/Task5b.htm 2/3

12/19/2018 NHANES Dietary Web Tutorial: Estimating Prevalence of Supplement Use: Task 5b

Step 4: Review Output and Highlights of the Results

In this step, the SAS output is reviewed.

There are 4,327 observations used for the subpopulation.
Systolic blood pressure is 0.42 mm Hg lower in supplement users compared to non-users, after adjusting for the
other variables in the model.
The F test for calcium supplement use indicates that this effect is not significant (p = 0.47).
Therefore, the null hypothesis is not rejected at the 0.05 level and it is concluded that the mean systolic blood
pressure in calcium supplement users is not different from mean systolic blood pressure in non-users, after
adjusting for gender, race/ethnicity, age, and BMI category.

Close Window to return to module page.

https://www.cdc.gov/nchs/tutorials/dietary/Advanced/EstimatePrevalence/Task5b.htm 3/3

NHANES Dietary Web Data Tutorial - Modeling Usual Intake Using Die... https://www.cdc.gov/nchs/tutorials/Dietary/Advanced/ModelUsualIntake...

Modeling Usual Intake Using Dietary Recall Data

Purpose

For simplicity, dietary recommendations intended to achieve nutrient adequacy and promote health are often expressed in
terms of daily targets. However, because nutrients are stored in the body, it is unnecessary to achieve those targets every day.
Furthermore, because dietary intake varies from day to day, “usual” or long-term average intake is a key concept in dietary
assessment.

Unfortunately, all self-report dietary assessment tools are prone to error. The types of errors vary by instrument, and are
described in Module 12 of the Dietary Tutorial. (We recommend that you review Module 12 before beginning this module.) Due
to the presence of these errors, statistical methods are needed to estimate usual intake of dietary constituents.

This module provides essential background for Modules 19-22. It describes the history of usual intake estimation, statistical
methods that have been used to estimate the distribution of dietary intake, and the development of a unified framework for
estimating usual dietary intakes (“the NCI method”). It also illustrates the use of balanced repeated replication (BRR) to
estimate standard errors; BRR is used in Modules 19-22.

Task 1: Describing Measurement Error

Measurement error may have a large impact on estimating usual intake of dietary constituents. In this task, different types of
measurement error and the impact that this error can have on estimates of usual intake are described.

Key Concepts about Measurement Error (/nchs/tutorials/dietary/Advanced/ModelUsualIntake/Info1.htm)

Task 2: Describing Statistical Methods that Have Been Used to Estimate the Distribution of
Usual Intake with a Few Days of 24-hour Recalls

Statistical methods have been developed to estimate the distribution of usual intake when 24-hour recalls are used to assess
dietary intake. This task describes these methods, including similarities and differences.

Key Concepts about Statistical Methods that have been used to Estimate the Distribution of Usual Intake with a Few Days
of 24-hour Recalls (/nchs/tutorials/dietary/Advanced/ModelUsualIntake/Info2.htm)

Task 3: Using a Unified Framework to Estimate Usual Dietary Intakes

In collaboration with colleagues from numerous institutions, the National Cancer Institute (NCI) developed a unified
framework to predict usual dietary intakes of episodically-consumed or ubiquitously-consumed dietary constituents. This
method requires that two or more 24-hour recalls are available for at least a subset of a sample. It can be used for a variety of
general applications.

Key Concepts about Using a Unified Framework to Estimate Usual Dietary Intakes (/nchs/tutorials/dietary/Advanced
/ModelUsualIntake/Info3.htm)

Task 4: Using Balanced Repeated Replication to Estimate Standard Errors

Balanced repeated replication (BRR) is used to estimate variance in succeeding modules.

Key Concepts about using Balanced Repeated Replication (BRR) (/nchs/tutorials/dietary/Advanced/ModelUsualIntake
/Info4.htm)
How to Estimate Standard Errors with Balanced Repeated Replication (BRR) Using SAS (/nchs/tutorials/dietary/Advanced
/ModelUsualIntake/Task4.htm)
Download Sample Code and Datasets (/nchs/tutorials/dietary/downloads/downloads.htm)

Page last updated: July 20, 2011 1/14/2019, 9:22 PM

1 of 2

NHANES Dietary Web Data Tutorial - Modeling Usual Intake Using Die... https://www.cdc.gov/nchs/tutorials/Dietary/Advanced/ModelUsualIntake...

Page last reviewed: July 20, 2011
Content source: CDC/National Center for Health Statistics
Page maintained by: NCHS/NHANES

Centers for Disease Control and Prevention 1600 Clifton Road Atlanta, GA 30329-4027, USA
800-CDC-INFO (800-232-4636) TTY: (888) 232-6348 - Contact CDC–INFO

2 of 2 1/14/2019, 9:22 PM

Pages:

Click to View FlipBook Version