STATISTICS
Lecture Notes
Suryanti Bt Saadon
Nik Zuraini Bt Nik Mahmood
Suryanti Bt Saadon
Lecturer in Business Studies
Nik Zuraini Bt Nik Mahmood
Lecturer in Marketing
We would like to take this opportunity to thank to
those who have contributed directly or indirectly
in the preparation and publication of this ebook
especially our beloved family, Head of
Department, Head of Program and also our
collegues.
Preface
ii
This ebook is a compilation of lecture notes
extracted from various textbooks on statistics from
local and international publications and also
from internet. There are 7 topics in this book
which follow the syllabus from Politeknik
Malayasia. The objective is to help students to get
quick reference notes in their studies.
Abstract
iii
Contents
Content Page
Topic 1: Introduction to Statistics 1-7
Topic 2: Data Presentation 8 - 18
Topic 3: Measures of Central Tendency 19 - 26
Topic 4: Measures of Dispersion and skewness 27 - 36
Topic 5: Correlation and Regression 37 - 49
Topic 6: Hypothesis testing 50 - 71
Topic 7: Elementary of Probability Concept 72 - 84
t-table
85
iv
Learning Outcomes:
At the end of this topic, you should be able to
1. Explain the meaning of statistics
2. Compare types of statistics
3. Identify sources of data
4. Identify type of data
5. Explain statistical terms
6. Explain data collection methods
Topic 1
INTRODUCTION TO STATISTICS
2 Topic 1
INTRODUCTION TO STATISTICS
WHAT IS STATISTICS?
• Statistics is a branch of mathematics which study of data. It is not merely counting of
people, animals, trees and so on. Data is a raw score that does not provide any useful
information until it analyzed using a particular method such as a statistical method. Data
needs to be processed to obtain useful and meaningful information.
• The word statistics is indirectly comes from Latin status and from German Statistik, meaning
‘collection of data involving the State’. The term statistics came to be used to describe the
collection of any sort of data.
• The statistics can be defined as a science of conducting studies which dealing with
collection, organization, analysis, interpretation and presentation of data.
WHY WE NEED TO STUDY STATISTICS?
Statistics used in almost all fields of human endeavor.
• In business, statistics are widely used in marketing, financing, accounting, auditing, human
resources, operation management and so on. For example, Human Resources manager
might use statistics to get the percentage of employees attendance or to get the employees
enrollment before they hired new workers.
• As marketers, they need the statistics to predict their sales in the future. Statistics provide
demographic information such as age, income level, consumer preferences that might be
their target market or potential customer.
• In restaurant, the manager use statistics to
examine variability of important performance
measures, such as customer orders and food
cost, to plan work schedules and estimate
material purchases. The chefs will use
statistics in recipe ingredients.
• Statistics also been applied in sport, health,
and our daily life. Can you tell how you apply
statistics in your daily life?
A study on the suggested food consumed by children for balance diet.
Source: Balance diet chart for children, Mdhealth.com
Topic 1 3
INTRODUCTION TO STATISTICS
TYPES OF STATISTICS
There are two types of statistics:
1. Descriptive statistics
• It is used to describe the characteristics of a variable and used to summarize the
numerical data.
• Do not generalize from the sample to the population where the sample was taken.
• Descriptive statistics use indicators such as mean, median, mode, standard deviation to
state the characteristics of a variable.
• The data also presenting in some meaningful form such as histogram, pie chart, polygon
and so on.
2. Inferential Statistics
• Inferential statistics is used to make inferences about the characteristics of a
populations based on sample data.
• It consists of generalizing from samples to population, performing estimations and
hypotheis tests, determining relationship among variables, and making prediction.
• Inferential statistics use indicators such chi-square, t-test, ANOVA and so on
4 Topic 1
INTRODUCTION TO STATISTICS
SOURCES OF DATA
1. Primary data
• Primary data is a data that observed or collected from first-hand experience.
• For example, data gathered by observing events, people, objects; or by administering
questionnaires to individuals.
2. Secondary data
• The existing sources or published data and the data collected in the past or other parties
are called secondary data.
• Certain types of information such as the background details of the company can be
obtained from available published record, the websites of the company, its archives and
other sources.
• Written information such as journal, newspaper, magazines, company policies, can be
obtained from the organization’s record and document.
Topic 1 5
INTRODUCTION TO STATISTICS
TYPES OF DATA
There are TWO types of data
1. Qualitative data
• Information about qualities which it can’t be measured.
• Qualitative data are variables that can be placed into distinct categories, according to
some characteristics or attribute.
• For example: Gender, colours, textures, smells, tastes etc.
2. Quantitative data
• Quantitative data are numerical and can be ordered or ranked.
• Quantitative variable can be classified into two groups:
a) Discrete quantitative
The discrete variable can assume only certain values with no other sub-values in
between (countable).
Data that counting whole and it is indivisible entities
Eg: number of children, number of house, etc.
b) Continuous quantitative
The continuous variable can assume any numerical value between a specified
interval.
Eg: temperature, person’s weight, income, etc.
6 Topic 1
INTRODUCTION TO STATISTICS
STATISTICAL TERMS
Most popular terms in statitics are:
Population - all people or items that share one or more characteristics from which data can
be gathered and analyzed.
Sample - a set of individuals or items selected from a population for analysis to yield
estimates of, or to test hypotheses about, parameters of the whole population.
DATA COLLECTION METHOD
Data collection methods are an integral part of research design. Each collection method has
its own advantages and disadvantages. There are a variety of ways to collect data from
respondents. Some of them are:
Personal interview
• It involves direct interaction (face-to-face method), conversation or meeting between
interviewer and interviewee. The purpose of conducting a personal interview survey is to
explore the responses of the people to gather more and deeper information.
Telephone surveys
• An interview conducted over telephone where telephone numbers are used to contact
potential respondents, either from the general population or from a known sample (for
example, bank customers or members of an organization).
Topic 1 7
INTRODUCTION TO STATISTICS
Direct/ mailed questionnaire surveys
• A direct questionnaire is a research instrument consisting of a series of questions and
other prompts for the purpose of gathering information from respondents. The
questionss are giving directly to the respondents. While mailed questionnaire is the
collection of data via postal.
An experiment
• It is a controlled study in which the researcher attempts to understand cause-and-effect
relationships. The study is "controlled" in the sense that the researcher controls how
subjects are assigned to groups and which treatments each group receives.
Observation
• It is a technique that involves the direct observation of phenomena in their natural
setting. It is act of recognizing and noting facts or occurrences. No questions are asked
in data collection.
Internet survey
• An online survey is the systematic gathering of data from the target audience
characterized by the invitation of the respondents and the completion of the
questionnaire over the World Wide Web (www).
Learning Outcomes:
At the end of this topic, you should be able to
1. Construct frequency distribution tables
2. Organize quantitative data
• Construct histogram
• Construct frequency polygon
• Construct ogive
Topic 2
DATA PRESENTATION
9 Topic 2
DATA PRESENTATION
In statistics, a frequency distribution is a table that displays the frequency of various
outcomes in a sample. Each entry in the table contains the frequency or count of the
occurrences of values within a particular group or interval, and in this way, the table
summarizes the distribution of values in the sample.
ELEMENTS THE FREQUENCY TABLES
Elements in the frequency distribution tables:
1. Range
2. Number of class
3. class interval
4. frequency
5. cumulative frequency
6. class boundaries
7. mid-point
8. relative frequency
HOW TO CONSTRUCT THE FREQUENCY TABLES
The first three steps before constructing frequency tables.
Step 1: Find the range
= Highest value – lowest value
Step 2: Determine the number of classes, k.
= 1 + 3.3 log n
!!! No. of class must be round number.
Step 3: Find the size class
= range / number of classes
!!! In case of fractional results, the next higher whole number is taken as the size of the
class interval. It must round up, not off.
Topic 2 10
DATA PRESENTATION
Example:
The data set shown here represents the number of hours that 18 part-time employees
worked at the My Palace Park during a randomly selected week in June.
16 25 18 39 25 17
22 18 12 23 32 35
20 19 25 26 25 20
Step 1: Range
= Highest value – lowest value
= 39 – 12
= 27
Step 2: Number of classes, k.
= 1 + 3.3 log n
= 1 + 3.3 log 18
= 1 + 4.14
= 5.14
=5
Step 3: Size class
= range / k
= 27 / 5
= 5.4
=6
11 Topic 2
DATA PRESENTATION
Class interval
While arranging large amount of data (in statistics), they are grouped into different classes
to get an idea of the distribution, and the range of such class of data is called the Class
Interval.
Lower limit of Class interval Upper limit of the first
the first class 12 - 17 class
18 - 23
Lowest value 24 - 29 = 12 + 6 (size class) - 1
30 - 35
Lower limit 36 - 41
of the second
class
= 12 + 6 (size
class)
!!! To construct class interval:
Descrete number Minus 1
1 decimal place Minus 0.1
2 decimal places Minus 0.01
Topic 2 12
DATA PRESENTATION
Tally marks and frequency
Tally or mark each observation into each class limit. One observation will fall into only one
class. Each bundle (||||) consist of 5. Count the tally marks and record in the frequency
table. Find the total frequency.
Class Interval Tally marks frequency
12 - 17 ||| 3
18 - 23 7
24 - 29 |||| || 5
30 - 35 |||| 2
36 - 41 || 1
|
∑ = 18
Relative frequency
• Relative frequency is used to look at the number of times a specific event occurs
compared to the total number of events.
• Relative frequency is the number of observations of a given type divided by the total
number of observations.
13 Topic 2
DATA PRESENTATION
Class Boundaries
The class boundaries are found by taking the average of the upper limit of one class and
the lower limit of the next class. The class boundaries is calculated by adding the upper
limit for the first class and lower limit for the second class nad divided by two.
Class Interval Class boundaries
12 - 17 11.5 – 17.5
18 - 23 17.5 – 23.5
24 - 29 23.5 – 29.5
30 - 35 29.5 – 35.5
36 - 41 35.5 – 41.5
!!! To construct class boundaries:
Descrete number Lower boundaries Upper boundaries
1 decimal place Minus 0.5 Plus 0.5
2 decimal places Minus 0.05 Plus 0.05
Minus 0.005 Plus 0.005
Cumulative frequencies
The cumulative frequency is calculated by adding each frequency from a frequency
distribution table to the sum of its predecessors.
The last value will always be equal to the total for all observations, since all frequencies will
already have been added to the previous total.
Class Limit frequency Cumulative
frequency
12 - 17 3
18 - 23 7 3
24 - 29 5 10
30 - 35 2 15
36 - 41 1 17
18
Topic 2 14
DATA PRESENTATION
Midpoint
Midpoint is the middle of the line segment. It is of calculated by adding lower limit and
upper limit and divided by 2.
Class Interval Midpoint = (12 + 17) / 2
12 - 17 14.5 = (18 + 23) / 2
18 - 23 20.5
24 - 29 26.5
30 - 35 32.5
36 - 41 38.5
15 Topic 2
DATA PRESENTATION
Example 1:
The data below are the marks obtained by 20 students in statistics final examination.
Range (highest value – lowest value)
= 92 – 48
= 44
No. of class (1 + 3.3 log n)
= 1 + 3.3 log 20
= 1 + 3.3 (1.3010)
= 1 + 4.2934
= 5.2934
≈5
Size class (range / no. of class)
= 44 / 5
= 8.8
≈9
Frequency Distribution Table
Topic 2 16
DATA PRESENTATION
Example 2:
The following data are the pollution level observed in a river.
2.2 3.4 3.0 2.6 3.8 1.8 2.8 3.2 3.7
2.2 2.4 4.0 3.3 2.3 1.7 2.6 3.5 1.4
2.7 3.0 3.6 2.9 1.9 3.4 3.1
Range
= 4.0 – 1.4
= 2.6
No. of class
= 1 + 3.3 log 25
= 1 + 3.3 (1.3979)
= 1 + 4.613
= 5.613
≈6
Size class
= 2.6 / 6
= 0.433
≈ 0.5
Frequency Distribution Table
17 Topic 2
DATA PRESENTATION
HISTOGRAM & POLYGON
Histogram Polygon
The histogram is like a column graph A frequency polygon is a line graph created
without the spaces between columns. It is by joining all of the top points of a
used for continuous data, where the bins histogram.
represent ranges of data.
They are called polygons because the line in
There is a difference between Bar Charts the graph creates resembles half of
and Histogram. Bar charts, each column a polygon.
represents a group defined by a categorical
variable, whereas the histograms, each To create a frequency polygon, start just as
column represents a group defined by a for histograms, by choosing a midpoint and
continuous, quantitative variable. frequency.
y-axis is representing the frequency value x-axis representing the frequency while the
while the x-axis is representing the class y-axis representing the midpoint values of
boundaries. the data.
!!! Histogram and frequency also can be can be drawn separately or combined.
Topic 2 18
DATA PRESENTATION
OGIVE
An ogive (oh-jive), also known as a cumulative frequency polygon.
It is a type of frequency polygon that shows cumulative frequencies. In other words, the
cumulative percents are added on the graph.
An ogive graph plots cumulative frequency on the y-axis and class boundaries along the x-
axis.
Before we draw an ogive, we have to construct another table which consist of upper
boundaries and a less-than or greater-than cumulative frequency.
Less-than ogive
greater-than ogive
Learning Outcomes:
At the end of this topic, you should be able to
• Explain the measure of central tendency
• Calculate the measure of central tendency
for ungrouped data
• Calculate the measure of central tendency
for grouped data
• Explain the relationship among mean,
median and mode
Topic 3
MEASURES OF CENTRAL TENDENCY
20 Topic 3
MEASURES OF CENTRAL TENDENCY
A measure of central tendency is a way of specifying - central value. The measure of central
tendency is an average of a set of measurements, the word average being variously
construed as mean, median, or other measure of location, depending on the context.
A measure of central tendency is a measure that tells us where the middle of a bunch of data
lies.
The most common measures of central tendency are mean, median and mode.
➢ Mean
Mean is the most common measure of central tendency. It is simply the sum of the
numbers divided by the number of numbers in a set of data. This is also known as
average.
➢ Median
Median is the number present in the middle when the numbers in a set of data are
arranged in ascending or descending order. If the number of numbers in a data set is
even, then the median is the mean of the two middle numbers.
➢ Mode
Mode is the value that occurs most frequently in a set of data.
TYPES OF DATA
➢ Ungrouped data - Data that has not been organized into groups. Ungrouped data looks
like a big list of numbers.
➢ Grouped data – Data that has been organized into groups (into a frequency distribution).
Topic 3 21
MEASURES OF CENTRAL TENDENCY
UNGROUPED DATA MEAN
GROUPED DATA
Example :
4 8 9 12 14
Solution:
Mean = (4 + 8 + 9 + 12 + 14) / 5
= 9.4
Example: No of x fx Solution:
students 104.5 522.5
weight 114.5 1030.5 Mean
100 – 109 5 124.5 2365.5 = 9045 / 70
110 – 119 9 134.5 2959 = 129.21
120 – 129 19 144.5 2167.5
130 – 139 22 ∑ = 9045
140 - 149 15
∑ = 70
22 Topic 3
MEASURES OF CENTRAL TENDENCY
MEDIAN
UNGROUPED DATA Example 2 (even data):
Steps:
1. Arrange the data in ascending/ 2 3 13 5 7 3 12 8
descending order Solution:
2. Find position of median Arrange data
= (n + 1) / 2 2 3 3 5 7 8 12 13
1. Find the value of median
Example 1 (odd data): Find position of median
= (n + 1) / 2
2 3 13 5 7 3 12 = (8 + 1) / 2
=9/2
Solution: = 4.5th
Arrange data
Value of median
2 3 3 5 7 12 13 = (5 + 7) / 2
=6
Find position of median
= (n + 1) / 2 •7
= (7 + 1) / 2 •3
=8/2 • 12
= 4th (look at the 4th place in the
above series number)5
Value of median
=5
Topic 3 23
MEASURES OF CENTRAL TENDENCY
MEDIAN
GROUPED DATA
Example:
Grade point No of students
100 – 109 5
110 – 119 9
120 – 129 19
130 – 139 22
140 - 149 15
∑ = 70
Solution:
To find the place of median, divide total frequency by 2 then look at the cumulative
frequency.
Place of median = n / 2
= 70 / 2
= 35th
24 Topic 3
MEASURES OF CENTRAL TENDENCY
Grade point No of students Class boundaries C. f
100 – 109 5 99.5 – 109.5 5
110 – 119 9 109.5 – 119.5 14
120 – 129 19 119.5 – 129.5 33
130 – 139 22 129.5 – 139.5 55**
140 - 149 15 139.5 – 149.5 70
∑ = 70
= Lm + n/2 - ∑ fm-1
Median C
∑ fm
= 129.5 + 35 - 33 10
22
= 129.5 + 0.909
= 130.41
Topic 3 25
MEASURES OF CENTRAL TENDENCY
UNGROUPED DATA MODE
GROUPED DATA
Example 1:
22456668
Solution:
Mode = 6
∆2 ∆1
Example 2: Example:
13346679 Grade point No of Class boundaries
students
Solution:
Mode = 3 and 6 100 – 109 5 99.5 – 109.5
110 – 119 9 109.5 – 119.5
120 – 129 ∆1 19 119.5 – 129.5
130 – 139 22 ∆2 129.5 – 139.5
140 - 149 15 139.5 – 149.5
Example 3: ∑ = 70
13469572 Mode = Lb + ∆1 C
∆1 + ∆2
Solution:
Mode = none
= 129.5 + (22 – 19)
10
(22-19) + (22 – 15)
= 129.5 + 3
= 132.5
26 Topic 3
MEASURES OF CENTRAL TENDENCY
EMPIRICAL RELATIONSHIP
Empirical relationship is a relationship between mean, median and mode.
Negative distribution / skewed Positive distribution / skewed
to the left to the right
Value mean < median < mode Value mode < median < mean
Normal distribution / bell-shaped
Value mode = median = mean
Learning Outcomes:
At the end of this topic, you should be able to
1. Explain the measures of dispersion
2. Calculate the measure of dispersion for
ungrouped data
3. Calculate the measure of dispersion for
grouped data
4. Calculate the coefficient of variation
5. Calculate the measures of skewness
Topic 4
MEASURES OF DISPERSION AND
SKEWNESS
28 Topic 4
MEASURES OF DISPERSION AND SKEWNESS
The measures of dispersion can be used to compare dispersion of various samples. It
indicates the scattering of data. A widely spread distribution should not be used for decision-
making.
For example, a financial analyst knows that a widely dispersed earnings indicate a high risk to
stockholders and creditors whereas small dispersion of earnings indicate stable earnings and
therefore lower risk level.
Several common measures of dispersion.
a. Range
b. Mean deviation
c. Variance
d. Standard deviation
e. Coefficient of variation
f. Pearson’s Coefficient of Skewness 1 (PCS 1)
g. Pearson’s Coefficient of Skewness 2 (PCS 2)
UNGROPUPED DATA
Range = Highest value – lowest value
Mean Deviation, MD = (∑│x – x̅ │) / n
Variance, s² = Where,
Standard Deviation, s = √ s2 s2 = sample variance
x = observation value
n = number of observation
∑x2 = sum of all the squares of
observation
Topic 4 29
INTRODUCTION TO STATISTICS
GROPUPED DATA
Range = upper boundary of highest class – lower boundary of lowest class
Mean deviation, MD = 1
∑f (∑f │x – x│)
Variance, s² = 1 ∑fx2 - (∑fx)2 Where,
∑f - 1 ∑f s2 = sample variance
f = frequency
x = midpoint
Standard deviation, s = √ s2
A larger standard deviation makes a wider distribution, while a smaller value of standard
deviation makes a narrower distribution. The narrower distribution is better because it
indicated that there are no extreme data values in the data set.
Coefficient of variation, CV = Standard Deviation
X 100
mean
Rule of thumb, the larger the percentage, the greater is the relative variation. A larger
relative variation implies less consistency, while a smaller relative variation implies more
consistency.
30 Topic 4
MEASURES OF DISPERSION AND SKEWNESS
MEASURES OF SKEWNESS
Pearson’s coefficient of skewness is usually used to measure the skewness of the
distribution.
Pearson’s coefficient of skewness 1 (PCS 1)
PCS 1= Mean - mode
Standard deviation
Pearson’s coefficient of skewness 2 (PCS 2)
PCS 2 = 3 (mean – median)
Standard deviation
If the skewness = 0, the distribution is symmetrical.
If it is positive, the distribution is skewed to the right or positively-skewed.
If the skewness is negative, the distribution is skewed to the left or negatively-skewed.
Topic 4 31
INTRODUCTION TO STATISTICS
Example 1:
Data below shows the distance (in kilometers) travelled by 9 students to class everyday.
10 12 15 20 15 21 16 11 13
a) Range = 21 – 10
= 11
b) Mean deviation
MD = (∑│x – x̅ │) / n
x̅ = (10+12+15+20+15+21+16+11+13) / 9
= 133 / 9
= 14.78
MD = (│10-14.78│+│12-14.78│+│15-14.78│+│20-14.78│+│15-14.78│+│21-
14.78│+│16-14.78│+│11-14.78│+│13-14.78│) / 9
= 26.22 / 9
= 2.91
c) Variance
s² =
32 Topic 4
MEASURES OF DISPERSION AND SKEWNESS
∑x2 =102+122+152+202+152+212+162+112+132
= 2081
1
s² = (133)2
9 - 1 2081 -
9
= 1 / 8 (2081 – 1965.44)
= 14.445
d) Standard Deviation
s = √ s2
= √ 14.445
= 3.8
e) Pearson’s coefficient of skewness 1 & 2 (PCS 1 & 2)
Given,
Mode = 15, Median = 15
PCS 1 =
= - 0.22 / 3.8
= - 0.06 (negative skewness)
PCS 2 =
= - 0.66 / 3.8
= - 0.17 (negative skewness)
Topic 4 33
INTRODUCTION TO STATISTICS
Example 2:
The age distribution of the employees in MAF Company is as follows.
Age (years) No of workers Cum.frequency
20 – 24 12 12
25 – 29 34 46
30 – 34 20 66
35 – 39 13 79
40 – 44 10 89
45 – 49 7 96
50 - 54 4 100
a) Range = 54.5 – 19.5
= 35
b) Mean Deviation
Age (years) No of x fx │x-x│̄ f│x-x│̄
workers
20 – 24 12 22 264 10.6 127.2
25 – 29 34 27 918 5.6 190.4
30 – 34 20 32 640 0.6 12
35 – 39 13 37 481 4.4 57.2
40 – 44 10 42 420 9.4 94
45 – 49 7 47 329 14.4 100.8
50 - 54 4 52 208 19.4 77.6
= 100 = 3260 = 659.2
34 Topic 4
MEASURES OF DISPERSION AND SKEWNESS
Mean = (∑fx /∑f)
MD = 3260 – 100
= 32.6
1
= (∑f │x – x̄│)
∑f
1
= (659.2)
100
= 6.592
c) Variance
Age (years) No of x fx x2 fx2
workers
484 5808
20 – 24 12 22 264 729 24786
1024 20480
25 – 29 34 27 918 1369 17797
30 – 34 20 32 640 1764 17640
2209 15463
35 – 39 13 37 481 2704 10816
40 – 44 10 42 420 = 112790
45 – 49 7 47 329
50 - 54 4 52 208
= 100 = 3260
1 (∑fx)2
∑f
s² = ∑fx2 -
∑f - 1
1 (3260)2
s² = 112790 -
100 - 1 100
Topic 4 35
INTRODUCTION TO STATISTICS
s² = 1/99 (112790 – 106276)
= 1/99 (6514)
= 65.80
d) Standard Deviation
s = √ s2
= √ 65.80
= 8.11
e) Coefficient of variation
Standard Deviation X 100
CV = X 100
mean
CV = 8.11
32.6
= 24.88 %
f) PCS 1 = Mean - mode
Standard deviation
Mode = Lb + ∆1 C
∆1 + ∆2
Mode = 24.5+ 22 5
= 24.5 + 3.06 22 + 14
= 27.56
36 Topic 4
MEASURES OF DISPERSION AND SKEWNESS
PCS 1 = 32.6 – 27.56
8.11
= 0.62 (positive distribution)
** The distribution is skewed to the right (positively skewed).
g) PCS 2 3 (mean – median)
=
Standard deviation
Median n/2 - ∑ fm-1 C
= Lm +
fm
= 100 / 2
= 50th (place of median)
= 29.5 + 50 - 46 5
20
= 29.5 + 1
= 30.5
PCS 2 = 3 (32.6 – 30.5)
8.11
= 6.3 / 8.11
= 0.78 (positive distribution)
** The distribution is skewed to the right (positively skewed).
Learning Outcomes:
At the end of this topic, you should be able to
1. Explain the concept of correlation
2. Construct scatter diagram
3. Calculate linear coefficient of correlation
4. Show concept of regression
Topic 5
CORRELATION AND REGRESSION
38 Topic 5
CORRELATION AND REGRESSION
Correlation analysis is used to measure strength of the association (linear relationship)
between two variables.
Scatter diagram is used to show the relationship between two variables.
Regression analysis is used to predict the value of a dependent variable based on the value of
at least one independent variable.
TWO QUANTITATIVE VARIABLES
The response variable, also called the dependent variable, is the variable we want to predict,
and is usually denoted by y.
The explanatory variable, also called the independent variable, is the variable that attempts
to explain the response, and is denoted by x.
• Relationship between x and y is described by a linear function.
• Changes in y are assumed to be caused by changes in x.
Example:
A real estate agent wishes to examine the relationship between the selling price of a home
and its size (measured in square feet).
• Dependent variable (y) = house price
• Independent variable (x) = the size (square feet)
TYPES OF CORRELATION COEFFICIENT
Pearson Correlation Coefficient
The Pearson product-moment correlation coefficient, r is a measure of the
linear correlation (dependence) between two variables X and Y. It is a measure of how
well the data fit a straight line.
Spearman Rank Correlation Coefficient
Spearman's rank correlation coefficient or Spearman's rho, named after Charles
Spearman and often denoted by the Greek letter ρ (rho) is a measure of statistical
dependence between two variables. It assesses how well the relationship between two
variables.
Topic 5 39
CORRELATION AND REGRESSION
VALUE OF ‘r’ AND ‘ρ’
If r > 0 we have a Positive correlation
r = +1 means there is a perfectly positive linear relationship exists.
r = +0.5 - +0.9 means there is a strong positive linear relationship exists.
r = +0.1 - +0.4 means there is a positive linear relationship exists.
If r < 0 we have a Negative correlation
r = -1 means there is a perfectly negative linear relationship exists.
r = -0.5 - -0.9 means there is a strong negative linear relationship exists.
r = -0.1 - -0.4 means there is a negative linear relationship exists.
If r = 0 we have No correlation
40 Topic 5
CORRELATION AND REGRESSION
SCATTER DIAGRAM
Perfect positive Strong positive Weak positive
Perfect negative Strong negative Weak negative
No relationship
Topic 5 41
CORRELATION AND REGRESSION
PEARSON CORRELATION COEFFICIENT
(also known pearson product moment correlation coefficient)
n ∑ xy – (∑x) (∑y) Where:
r= n = total number of sample
x = independent variable
√[n∑x2 – (∑x)2] [n∑y2 – (∑y)2] y = dependent variable
Example:
The marketing executive of MAF trading company believes that there is a relationship
between the amounts of mileage claims, x made by salesmen and their monthly sales, y.
table below shows the amount of sales and mileage claims made by the salesmen.
salesman Amin Ben Clara Dalia Elmy Fiona Gery
Mileage claims, x (RM’00) 9 6 9 12 10 13 8
sales, y (RM’000) 13 11 15 17 16 20 12
Solution:
Amin x y x2 y2 xy
Ben 9 13 81 169 117
Clara 6 11 36 121 66
Dalia 9 15 81 225 135
Elmy 12 17 144 289 204
Fiona 10 16 100 256 160
Gio 13 20 169 400 260
total 8 12 64 144 96
67 104 675 1604 1038
42 Topic 5
CORRELATION AND REGRESSION
n ∑ xy – (∑x) (∑y)
r = √[n∑x2 – (∑x)2] [n∑y2 – (∑y)2]
7 (1038) – (67) (104)
r = √[7(675) – (67)2] [7(1604) – (104)2]
7266 - 6968
r = √[4725 – 4489] [11228 – 10816]
298
r=
√[236] [412]
298
r=
√97232
298
r=
311.82
r = 0.96
Conclusion:
There is a strong positive linear relationship exists between mileage claims and
sales.
Topic 5 43
CORRELATION AND REGRESSION
SPEARMAN RANK CORRELATION COEFFICIENT
ρ = 1 - 6 ∑ d2 1) Where:
n(n2 – n = total number of sample
d = the difference in ranks (rx – ry)
Example:
The following are scores obtained by seven students in their Mathematics and Statistics are
shown in table below.
Student Mathematics Statistics
A 67 76
B 81 80
C 43 37
D 59 61
E 66 61
F 91 82
G 79 76
Solution:
Student Mathematics Statistics rx ry ry’ d = rx - ry d2
A 67 76 4 4** 4.5 -0.5 0.25
B 81 80 66 6 0
C 43 37 11 1 0 0
D 59 61 2 2* 2.5 -0.5 0
E 66 61 3 3* 2.5 0.5 0.25
F 91 82 77 7 0 0.25
G 79 76 5 5** 4.5 0.5 0
0.25
∑=1
Note:
rx = ranking for data x from smallest value to the highest value
ry = ranking for data y from smallest value to the highest value
44 Topic 5
CORRELATION AND REGRESSION
ρ=1- 6 ∑ d2
n(n2 – 1)
ρ=1- 6 (1)
7(49 – 1)
ρ=1- 6
7(48)
ρ=1- 6
336
ρ = 1 – 0.0179
ρ = 0.98
Conclusion:
There is a strong positive linear relationship exists between mathematics
and add maths.
**Note: rank of
Remark the data which have same values then find the average
those data.
Topic 5 45
CORRELATION AND REGRESSION
SCATTER DIAGRAM
Scatter diagram which also called scatter plot or scatter graph provides relationship between
two variables x and y, also provides a visual correlation coefficent. For example, a scatter
plot shows the relationship between
The correlations may be positive (rising), negative (falling), or null (uncorrelated).
Example of scatter diagram
HOW TO DRAW SCATTER DIAGRAM
1. Determine the dependent and independent variable.
For example: SALES (y) & PROMOTION (x)
2. Draw an x-axis for independent variable and y-axis for dependent variable.
3. Mark or plot each data point on graph paper.
4. Label each axes on graph (y-axes) and (x-axes) and title of the graph.
46 Topic 5
CORRELATION AND REGRESSION
2. Draw and label y-axis
1. Draw and label x-axis
3. Plot each data point
4. Write a title on graph