The words you are searching are inside this book. To get more targeted content, please make full-text search by clicking here.

Home Explore DPB30063 Statistics

View in Fullscreen

This is a compilation notes from various sources which follow polytechnic syllabus.

Like this book? You can publish your book online for free in a few minutes!

Download PDF

Related Publications

Discover the best professional documents and content resources in AnyFlip Document Base.

Published by suryantisaadon, 2022-11-13 23:28:02

DPB30063 Statistics

Pages:

1 - 50
51 - 91

This is a compilation notes from various sources which follow polytechnic syllabus.

Keywords: DPB30063,Statistics

STATISTICS

Lecture Notes

Suryanti Bt Saadon
Nik Zuraini Bt Nik Mahmood

Suryanti Bt Saadon
Lecturer in Business Studies

Nik Zuraini Bt Nik Mahmood
Lecturer in Marketing

We would like to take this opportunity to thank to
those who have contributed directly or indirectly
in the preparation and publication of this ebook
especially our beloved family, Head of
Department, Head of Program and also our
collegues.

Preface

ii

This ebook is a compilation of lecture notes
extracted from various textbooks on statistics from
local and international publications and also
from internet. There are 7 topics in this book
which follow the syllabus from Politeknik
Malayasia. The objective is to help students to get
quick reference notes in their studies.

Abstract

iii

Contents

Content Page

Topic 1: Introduction to Statistics 1-7
Topic 2: Data Presentation 8 - 18
Topic 3: Measures of Central Tendency 19 - 26
Topic 4: Measures of Dispersion and skewness 27 - 36
Topic 5: Correlation and Regression 37 - 49
Topic 6: Hypothesis testing 50 - 71
Topic 7: Elementary of Probability Concept 72 - 84
t-table
85

iv

Learning Outcomes:
At the end of this topic, you should be able to
1. Explain the meaning of statistics
2. Compare types of statistics
3. Identify sources of data
4. Identify type of data
5. Explain statistical terms
6. Explain data collection methods

Topic 1
INTRODUCTION TO STATISTICS

2 Topic 1

INTRODUCTION TO STATISTICS

WHAT IS STATISTICS?

• Statistics is a branch of mathematics which study of data. It is not merely counting of
people, animals, trees and so on. Data is a raw score that does not provide any useful
information until it analyzed using a particular method such as a statistical method. Data
needs to be processed to obtain useful and meaningful information.

• The word statistics is indirectly comes from Latin status and from German Statistik, meaning
‘collection of data involving the State’. The term statistics came to be used to describe the
collection of any sort of data.

• The statistics can be defined as a science of conducting studies which dealing with
collection, organization, analysis, interpretation and presentation of data.

WHY WE NEED TO STUDY STATISTICS?
Statistics used in almost all fields of human endeavor.
• In business, statistics are widely used in marketing, financing, accounting, auditing, human

resources, operation management and so on. For example, Human Resources manager
might use statistics to get the percentage of employees attendance or to get the employees
enrollment before they hired new workers.
• As marketers, they need the statistics to predict their sales in the future. Statistics provide
demographic information such as age, income level, consumer preferences that might be
their target market or potential customer.
• In restaurant, the manager use statistics to
examine variability of important performance
measures, such as customer orders and food
cost, to plan work schedules and estimate
material purchases. The chefs will use
statistics in recipe ingredients.
• Statistics also been applied in sport, health,
and our daily life. Can you tell how you apply
statistics in your daily life?

A study on the suggested food consumed by children for balance diet.
Source: Balance diet chart for children, Mdhealth.com

Topic 1 3

INTRODUCTION TO STATISTICS

TYPES OF STATISTICS
There are two types of statistics:
1. Descriptive statistics
• It is used to describe the characteristics of a variable and used to summarize the

numerical data.
• Do not generalize from the sample to the population where the sample was taken.
• Descriptive statistics use indicators such as mean, median, mode, standard deviation to

state the characteristics of a variable.
• The data also presenting in some meaningful form such as histogram, pie chart, polygon

and so on.

2. Inferential Statistics
• Inferential statistics is used to make inferences about the characteristics of a

populations based on sample data.
• It consists of generalizing from samples to population, performing estimations and

hypotheis tests, determining relationship among variables, and making prediction.
• Inferential statistics use indicators such chi-square, t-test, ANOVA and so on

4 Topic 1

INTRODUCTION TO STATISTICS

SOURCES OF DATA
1. Primary data
• Primary data is a data that observed or collected from first-hand experience.
• For example, data gathered by observing events, people, objects; or by administering

questionnaires to individuals.

2. Secondary data
• The existing sources or published data and the data collected in the past or other parties

are called secondary data.
• Certain types of information such as the background details of the company can be

obtained from available published record, the websites of the company, its archives and
other sources.
• Written information such as journal, newspaper, magazines, company policies, can be
obtained from the organization’s record and document.

Topic 1 5

INTRODUCTION TO STATISTICS

TYPES OF DATA
There are TWO types of data
1. Qualitative data
• Information about qualities which it can’t be measured.
• Qualitative data are variables that can be placed into distinct categories, according to

some characteristics or attribute.
• For example: Gender, colours, textures, smells, tastes etc.

2. Quantitative data

• Quantitative data are numerical and can be ordered or ranked.

• Quantitative variable can be classified into two groups:
a) Discrete quantitative
The discrete variable can assume only certain values with no other sub-values in
between (countable).
Data that counting whole and it is indivisible entities
Eg: number of children, number of house, etc.
b) Continuous quantitative
The continuous variable can assume any numerical value between a specified
interval.
Eg: temperature, person’s weight, income, etc.

6 Topic 1

INTRODUCTION TO STATISTICS

STATISTICAL TERMS
Most popular terms in statitics are:
Population - all people or items that share one or more characteristics from which data can
be gathered and analyzed.
Sample - a set of individuals or items selected from a population for analysis to yield
estimates of, or to test hypotheses about, parameters of the whole population.

DATA COLLECTION METHOD
Data collection methods are an integral part of research design. Each collection method has
its own advantages and disadvantages. There are a variety of ways to collect data from
respondents. Some of them are:
Personal interview
• It involves direct interaction (face-to-face method), conversation or meeting between

interviewer and interviewee. The purpose of conducting a personal interview survey is to
explore the responses of the people to gather more and deeper information.

Telephone surveys
• An interview conducted over telephone where telephone numbers are used to contact

potential respondents, either from the general population or from a known sample (for
example, bank customers or members of an organization).

Topic 1 7

INTRODUCTION TO STATISTICS

Direct/ mailed questionnaire surveys

• A direct questionnaire is a research instrument consisting of a series of questions and
other prompts for the purpose of gathering information from respondents. The
questionss are giving directly to the respondents. While mailed questionnaire is the
collection of data via postal.

An experiment

• It is a controlled study in which the researcher attempts to understand cause-and-effect
relationships. The study is "controlled" in the sense that the researcher controls how
subjects are assigned to groups and which treatments each group receives.

Observation

• It is a technique that involves the direct observation of phenomena in their natural
setting. It is act of recognizing and noting facts or occurrences. No questions are asked
in data collection.

Internet survey

• An online survey is the systematic gathering of data from the target audience
characterized by the invitation of the respondents and the completion of the
questionnaire over the World Wide Web (www).

Learning Outcomes:
At the end of this topic, you should be able to
1. Construct frequency distribution tables
2. Organize quantitative data

• Construct histogram
• Construct frequency polygon
• Construct ogive

Topic 2
DATA PRESENTATION

9 Topic 2

DATA PRESENTATION

In statistics, a frequency distribution is a table that displays the frequency of various
outcomes in a sample. Each entry in the table contains the frequency or count of the
occurrences of values within a particular group or interval, and in this way, the table
summarizes the distribution of values in the sample.
ELEMENTS THE FREQUENCY TABLES
Elements in the frequency distribution tables:
1. Range
2. Number of class
3. class interval
4. frequency
5. cumulative frequency
6. class boundaries
7. mid-point
8. relative frequency
HOW TO CONSTRUCT THE FREQUENCY TABLES
The first three steps before constructing frequency tables.
Step 1: Find the range

= Highest value – lowest value

Step 2: Determine the number of classes, k.
= 1 + 3.3 log n
!!! No. of class must be round number.

Step 3: Find the size class
= range / number of classes
!!! In case of fractional results, the next higher whole number is taken as the size of the
class interval. It must round up, not off.

Topic 2 10

DATA PRESENTATION

Example:
The data set shown here represents the number of hours that 18 part-time employees
worked at the My Palace Park during a randomly selected week in June.

16 25 18 39 25 17
22 18 12 23 32 35
20 19 25 26 25 20

Step 1: Range
= Highest value – lowest value
= 39 – 12
= 27

Step 2: Number of classes, k.
= 1 + 3.3 log n
= 1 + 3.3 log 18
= 1 + 4.14
= 5.14
=5

Step 3: Size class
= range / k
= 27 / 5
= 5.4
=6

11 Topic 2

DATA PRESENTATION

Class interval

While arranging large amount of data (in statistics), they are grouped into different classes
to get an idea of the distribution, and the range of such class of data is called the Class
Interval.

Lower limit of Class interval Upper limit of the first
the first class 12 - 17 class
18 - 23
Lowest value 24 - 29 = 12 + 6 (size class) - 1
30 - 35
 Lower limit 36 - 41
of the second
class

 = 12 + 6 (size
class)

!!! To construct class interval:

Descrete number Minus 1

1 decimal place Minus 0.1

2 decimal places Minus 0.01

Topic 2 12

DATA PRESENTATION

Tally marks and frequency

Tally or mark each observation into each class limit. One observation will fall into only one
class. Each bundle (||||) consist of 5. Count the tally marks and record in the frequency
table. Find the total frequency.

Class Interval Tally marks frequency
12 - 17 ||| 3
18 - 23 7
24 - 29 |||| || 5
30 - 35 |||| 2
36 - 41 || 1
|
∑ = 18

Relative frequency

• Relative frequency is used to look at the number of times a specific event occurs
compared to the total number of events.

• Relative frequency is the number of observations of a given type divided by the total
number of observations.

13 Topic 2

DATA PRESENTATION

Class Boundaries

The class boundaries are found by taking the average of the upper limit of one class and
the lower limit of the next class. The class boundaries is calculated by adding the upper
limit for the first class and lower limit for the second class nad divided by two.

Class Interval Class boundaries
12 - 17 11.5 – 17.5
18 - 23 17.5 – 23.5
24 - 29 23.5 – 29.5
30 - 35 29.5 – 35.5
36 - 41 35.5 – 41.5

!!! To construct class boundaries:

Descrete number Lower boundaries Upper boundaries
1 decimal place Minus 0.5 Plus 0.5
2 decimal places Minus 0.05 Plus 0.05

Minus 0.005 Plus 0.005

Cumulative frequencies

The cumulative frequency is calculated by adding each frequency from a frequency
distribution table to the sum of its predecessors.

The last value will always be equal to the total for all observations, since all frequencies will
already have been added to the previous total.

Class Limit frequency Cumulative
frequency
12 - 17 3
18 - 23 7 3
24 - 29 5 10
30 - 35 2 15
36 - 41 1 17
18

Topic 2 14

DATA PRESENTATION

Midpoint

Midpoint is the middle of the line segment. It is of calculated by adding lower limit and
upper limit and divided by 2.

Class Interval Midpoint = (12 + 17) / 2
12 - 17 14.5 = (18 + 23) / 2
18 - 23 20.5
24 - 29 26.5
30 - 35 32.5
36 - 41 38.5

15 Topic 2

DATA PRESENTATION

Example 1:

The data below are the marks obtained by 20 students in statistics final examination.

Range (highest value – lowest value)
= 92 – 48
= 44

No. of class (1 + 3.3 log n)
= 1 + 3.3 log 20
= 1 + 3.3 (1.3010)
= 1 + 4.2934
= 5.2934
≈5

Size class (range / no. of class)
= 44 / 5
= 8.8
≈9

Frequency Distribution Table

Topic 2 16

DATA PRESENTATION

Example 2:

The following data are the pollution level observed in a river.

2.2 3.4 3.0 2.6 3.8 1.8 2.8 3.2 3.7
2.2 2.4 4.0 3.3 2.3 1.7 2.6 3.5 1.4
2.7 3.0 3.6 2.9 1.9 3.4 3.1

Range

= 4.0 – 1.4
= 2.6

No. of class
= 1 + 3.3 log 25
= 1 + 3.3 (1.3979)
= 1 + 4.613
= 5.613
≈6

Size class
= 2.6 / 6
= 0.433
≈ 0.5

Frequency Distribution Table

17 Topic 2

DATA PRESENTATION

HISTOGRAM & POLYGON

Histogram Polygon

The histogram is like a column graph A frequency polygon is a line graph created
without the spaces between columns. It is by joining all of the top points of a
used for continuous data, where the bins histogram.
represent ranges of data.
They are called polygons because the line in
There is a difference between Bar Charts the graph creates resembles half of
and Histogram. Bar charts, each column a polygon.
represents a group defined by a categorical
variable, whereas the histograms, each To create a frequency polygon, start just as
column represents a group defined by a for histograms, by choosing a midpoint and
continuous, quantitative variable. frequency.

y-axis is representing the frequency value x-axis representing the frequency while the
while the x-axis is representing the class y-axis representing the midpoint values of
boundaries. the data.

!!! Histogram and frequency also can be can be drawn separately or combined.

Topic 2 18

DATA PRESENTATION

OGIVE

An ogive (oh-jive), also known as a cumulative frequency polygon.

It is a type of frequency polygon that shows cumulative frequencies. In other words, the
cumulative percents are added on the graph.

An ogive graph plots cumulative frequency on the y-axis and class boundaries along the x-
axis.

Before we draw an ogive, we have to construct another table which consist of upper
boundaries and a less-than or greater-than cumulative frequency.

Less-than ogive
greater-than ogive

Learning Outcomes:
At the end of this topic, you should be able to
• Explain the measure of central tendency
• Calculate the measure of central tendency

for ungrouped data
• Calculate the measure of central tendency

for grouped data
• Explain the relationship among mean,

median and mode

Topic 3
MEASURES OF CENTRAL TENDENCY

20 Topic 3

MEASURES OF CENTRAL TENDENCY

A measure of central tendency is a way of specifying - central value. The measure of central
tendency is an average of a set of measurements, the word average being variously
construed as mean, median, or other measure of location, depending on the context.

A measure of central tendency is a measure that tells us where the middle of a bunch of data
lies.

The most common measures of central tendency are mean, median and mode.

➢ Mean
Mean is the most common measure of central tendency. It is simply the sum of the
numbers divided by the number of numbers in a set of data. This is also known as
average.

➢ Median
Median is the number present in the middle when the numbers in a set of data are
arranged in ascending or descending order. If the number of numbers in a data set is
even, then the median is the mean of the two middle numbers.

➢ Mode
Mode is the value that occurs most frequently in a set of data.

TYPES OF DATA

➢ Ungrouped data - Data that has not been organized into groups. Ungrouped data looks
like a big list of numbers.

➢ Grouped data – Data that has been organized into groups (into a frequency distribution).

Topic 3 21

MEASURES OF CENTRAL TENDENCY

UNGROUPED DATA MEAN
GROUPED DATA
Example :
4 8 9 12 14

Solution:
Mean = (4 + 8 + 9 + 12 + 14) / 5

= 9.4

Example: No of x fx Solution:
students 104.5 522.5
weight 114.5 1030.5 Mean
100 – 109 5 124.5 2365.5 = 9045 / 70
110 – 119 9 134.5 2959 = 129.21
120 – 129 19 144.5 2167.5
130 – 139 22 ∑ = 9045
140 - 149 15
∑ = 70

22 Topic 3

MEASURES OF CENTRAL TENDENCY

MEDIAN

UNGROUPED DATA Example 2 (even data):
Steps:
1. Arrange the data in ascending/ 2 3 13 5 7 3 12 8

descending order Solution:
2. Find position of median Arrange data

= (n + 1) / 2 2 3 3 5 7 8 12 13
1. Find the value of median
Example 1 (odd data): Find position of median
= (n + 1) / 2
2 3 13 5 7 3 12 = (8 + 1) / 2
=9/2
Solution: = 4.5th
Arrange data
Value of median
2 3 3 5 7 12 13 = (5 + 7) / 2
=6
Find position of median
= (n + 1) / 2 •7
= (7 + 1) / 2 •3
=8/2 • 12
= 4th (look at the 4th place in the
above series number)5

Value of median
=5

Topic 3 23

MEASURES OF CENTRAL TENDENCY

MEDIAN

GROUPED DATA

Example:

Grade point No of students
100 – 109 5
110 – 119 9
120 – 129 19
130 – 139 22
140 - 149 15

∑ = 70

Solution:

To find the place of median, divide total frequency by 2 then look at the cumulative
frequency.

Place of median = n / 2

= 70 / 2

= 35th

24 Topic 3

MEASURES OF CENTRAL TENDENCY

Grade point No of students Class boundaries C. f

100 – 109 5 99.5 – 109.5 5
110 – 119 9 109.5 – 119.5 14
120 – 129 19 119.5 – 129.5 33
130 – 139 22 129.5 – 139.5 55**
140 - 149 15 139.5 – 149.5 70
∑ = 70

= Lm + n/2 - ∑ fm-1

Median C

∑ fm

= 129.5 + 35 - 33 10
22

= 129.5 + 0.909
= 130.41

Topic 3 25

MEASURES OF CENTRAL TENDENCY

UNGROUPED DATA MODE

GROUPED DATA

Example 1:

22456668

Solution:
Mode = 6

∆2 ∆1

Example 2: Example:

13346679 Grade point No of Class boundaries
students
Solution:
Mode = 3 and 6 100 – 109 5 99.5 – 109.5

110 – 119 9 109.5 – 119.5

120 – 129 ∆1 19 119.5 – 129.5

130 – 139 22 ∆2 129.5 – 139.5

140 - 149 15 139.5 – 149.5

Example 3: ∑ = 70

13469572 Mode = Lb + ∆1 C
∆1 + ∆2
Solution:
Mode = none

= 129.5 + (22 – 19)
10

(22-19) + (22 – 15)

= 129.5 + 3
= 132.5

26 Topic 3

MEASURES OF CENTRAL TENDENCY

EMPIRICAL RELATIONSHIP

Empirical relationship is a relationship between mean, median and mode.

Negative distribution / skewed Positive distribution / skewed
to the left to the right
Value mean < median < mode Value mode < median < mean

Normal distribution / bell-shaped
Value mode = median = mean

Learning Outcomes:
At the end of this topic, you should be able to

1. Explain the measures of dispersion
2. Calculate the measure of dispersion for

ungrouped data
3. Calculate the measure of dispersion for

grouped data
4. Calculate the coefficient of variation
5. Calculate the measures of skewness

Topic 4
MEASURES OF DISPERSION AND
SKEWNESS

28 Topic 4

MEASURES OF DISPERSION AND SKEWNESS

The measures of dispersion can be used to compare dispersion of various samples. It
indicates the scattering of data. A widely spread distribution should not be used for decision-
making.

For example, a financial analyst knows that a widely dispersed earnings indicate a high risk to
stockholders and creditors whereas small dispersion of earnings indicate stable earnings and
therefore lower risk level.

Several common measures of dispersion.

a. Range
b. Mean deviation
c. Variance
d. Standard deviation
e. Coefficient of variation
f. Pearson’s Coefficient of Skewness 1 (PCS 1)
g. Pearson’s Coefficient of Skewness 2 (PCS 2)

UNGROPUPED DATA

Range = Highest value – lowest value

Mean Deviation, MD = (∑│x – x̅ │) / n

Variance, s² = Where,
Standard Deviation, s = √ s2 s2 = sample variance
x = observation value
n = number of observation
∑x2 = sum of all the squares of
observation

Topic 4 29

INTRODUCTION TO STATISTICS

GROPUPED DATA

Range = upper boundary of highest class – lower boundary of lowest class

Mean deviation, MD = 1

∑f (∑f │x – x│)

Variance, s² = 1 ∑fx2 - (∑fx)2 Where,
∑f - 1 ∑f s2 = sample variance
f = frequency
x = midpoint

Standard deviation, s = √ s2

 A larger standard deviation makes a wider distribution, while a smaller value of standard

deviation makes a narrower distribution. The narrower distribution is better because it
indicated that there are no extreme data values in the data set.

Coefficient of variation, CV = Standard Deviation
X 100

mean

 Rule of thumb, the larger the percentage, the greater is the relative variation. A larger

relative variation implies less consistency, while a smaller relative variation implies more

consistency.

30 Topic 4

MEASURES OF DISPERSION AND SKEWNESS

MEASURES OF SKEWNESS
Pearson’s coefficient of skewness is usually used to measure the skewness of the
distribution.

Pearson’s coefficient of skewness 1 (PCS 1)

PCS 1= Mean - mode
Standard deviation

Pearson’s coefficient of skewness 2 (PCS 2)

PCS 2 = 3 (mean – median)
Standard deviation

If the skewness = 0, the distribution is symmetrical.
If it is positive, the distribution is skewed to the right or positively-skewed.
If the skewness is negative, the distribution is skewed to the left or negatively-skewed.

Topic 4 31

INTRODUCTION TO STATISTICS

Example 1:
Data below shows the distance (in kilometers) travelled by 9 students to class everyday.

10 12 15 20 15 21 16 11 13
a) Range = 21 – 10

= 11

b) Mean deviation
MD = (∑│x – x̅ │) / n
x̅ = (10+12+15+20+15+21+16+11+13) / 9
= 133 / 9
= 14.78

MD = (│10-14.78│+│12-14.78│+│15-14.78│+│20-14.78│+│15-14.78│+│21-
14.78│+│16-14.78│+│11-14.78│+│13-14.78│) / 9

= 26.22 / 9
= 2.91

c) Variance
s² =

32 Topic 4

MEASURES OF DISPERSION AND SKEWNESS

∑x2 =102+122+152+202+152+212+162+112+132
= 2081

1
s² = (133)2

9 - 1 2081 -
9

= 1 / 8 (2081 – 1965.44)
= 14.445
d) Standard Deviation
s = √ s2

= √ 14.445
= 3.8
e) Pearson’s coefficient of skewness 1 & 2 (PCS 1 & 2)

Given,
Mode = 15, Median = 15
PCS 1 =

= - 0.22 / 3.8
= - 0.06 (negative skewness)

PCS 2 =

= - 0.66 / 3.8
= - 0.17 (negative skewness)

Topic 4 33

INTRODUCTION TO STATISTICS

Example 2:
The age distribution of the employees in MAF Company is as follows.

Age (years) No of workers Cum.frequency
20 – 24 12 12
25 – 29 34 46
30 – 34 20 66
35 – 39 13 79
40 – 44 10 89
45 – 49 7 96
50 - 54 4 100

a) Range = 54.5 – 19.5
= 35

b) Mean Deviation

Age (years) No of x fx │x-x│̄ f│x-x│̄
workers

20 – 24 12 22 264 10.6 127.2

25 – 29 34 27 918 5.6 190.4

30 – 34 20 32 640 0.6 12

35 – 39 13 37 481 4.4 57.2

40 – 44 10 42 420 9.4 94

45 – 49 7 47 329 14.4 100.8

50 - 54 4 52 208 19.4 77.6

= 100 = 3260 = 659.2

34 Topic 4

MEASURES OF DISPERSION AND SKEWNESS

Mean = (∑fx /∑f)
MD = 3260 – 100
= 32.6

1
= (∑f │x – x̄│)

∑f

1
= (659.2)

100

= 6.592

c) Variance

Age (years) No of x fx x2 fx2
workers
484 5808
20 – 24 12 22 264 729 24786
1024 20480
25 – 29 34 27 918 1369 17797
30 – 34 20 32 640 1764 17640
2209 15463
35 – 39 13 37 481 2704 10816
40 – 44 10 42 420 = 112790

45 – 49 7 47 329

50 - 54 4 52 208
= 100 = 3260

1 (∑fx)2
∑f
s² = ∑fx2 -

∑f - 1

1 (3260)2

s² = 112790 -

100 - 1 100

Topic 4 35

INTRODUCTION TO STATISTICS

s² = 1/99 (112790 – 106276)
= 1/99 (6514)
= 65.80

d) Standard Deviation

s = √ s2
= √ 65.80
= 8.11

e) Coefficient of variation

Standard Deviation X 100
CV = X 100

mean

CV = 8.11
32.6

= 24.88 %

f) PCS 1 = Mean - mode
Standard deviation

Mode = Lb + ∆1 C
∆1 + ∆2

Mode = 24.5+ 22 5
= 24.5 + 3.06 22 + 14
= 27.56

36 Topic 4

MEASURES OF DISPERSION AND SKEWNESS

PCS 1 = 32.6 – 27.56

8.11
= 0.62 (positive distribution)
** The distribution is skewed to the right (positively skewed).

g) PCS 2 3 (mean – median)
=

Standard deviation

Median n/2 - ∑ fm-1 C
= Lm +

fm

= 100 / 2
= 50th (place of median)

= 29.5 + 50 - 46 5
20
= 29.5 + 1
= 30.5

PCS 2 = 3 (32.6 – 30.5)
8.11

= 6.3 / 8.11

= 0.78 (positive distribution)
** The distribution is skewed to the right (positively skewed).

Learning Outcomes:
At the end of this topic, you should be able to
1. Explain the concept of correlation
2. Construct scatter diagram
3. Calculate linear coefficient of correlation
4. Show concept of regression

Topic 5
CORRELATION AND REGRESSION

38 Topic 5

CORRELATION AND REGRESSION

Correlation analysis is used to measure strength of the association (linear relationship)
between two variables.

Scatter diagram is used to show the relationship between two variables.

Regression analysis is used to predict the value of a dependent variable based on the value of
at least one independent variable.

TWO QUANTITATIVE VARIABLES

The response variable, also called the dependent variable, is the variable we want to predict,
and is usually denoted by y.

The explanatory variable, also called the independent variable, is the variable that attempts
to explain the response, and is denoted by x.
• Relationship between x and y is described by a linear function.
• Changes in y are assumed to be caused by changes in x.
Example:
A real estate agent wishes to examine the relationship between the selling price of a home
and its size (measured in square feet).
• Dependent variable (y) = house price
• Independent variable (x) = the size (square feet)

TYPES OF CORRELATION COEFFICIENT

Pearson Correlation Coefficient

The Pearson product-moment correlation coefficient, r is a measure of the
linear correlation (dependence) between two variables X and Y. It is a measure of how
well the data fit a straight line.

Spearman Rank Correlation Coefficient
Spearman's rank correlation coefficient or Spearman's rho, named after Charles
Spearman and often denoted by the Greek letter ρ (rho) is a measure of statistical
dependence between two variables. It assesses how well the relationship between two
variables.

Topic 5 39

CORRELATION AND REGRESSION

VALUE OF ‘r’ AND ‘ρ’

If r > 0 we have a Positive correlation
r = +1 means there is a perfectly positive linear relationship exists.
r = +0.5 - +0.9 means there is a strong positive linear relationship exists.
r = +0.1 - +0.4 means there is a positive linear relationship exists.

If r < 0 we have a Negative correlation
r = -1 means there is a perfectly negative linear relationship exists.
r = -0.5 - -0.9 means there is a strong negative linear relationship exists.
r = -0.1 - -0.4 means there is a negative linear relationship exists.

If r = 0 we have No correlation

40 Topic 5

CORRELATION AND REGRESSION

SCATTER DIAGRAM

Perfect positive Strong positive Weak positive

Perfect negative Strong negative Weak negative

No relationship

Topic 5 41

CORRELATION AND REGRESSION

PEARSON CORRELATION COEFFICIENT
(also known pearson product moment correlation coefficient)

n ∑ xy – (∑x) (∑y) Where:
r= n = total number of sample
x = independent variable
√[n∑x2 – (∑x)2] [n∑y2 – (∑y)2] y = dependent variable

Example:

The marketing executive of MAF trading company believes that there is a relationship
between the amounts of mileage claims, x made by salesmen and their monthly sales, y.
table below shows the amount of sales and mileage claims made by the salesmen.

salesman Amin Ben Clara Dalia Elmy Fiona Gery

Mileage claims, x (RM’00) 9 6 9 12 10 13 8
sales, y (RM’000) 13 11 15 17 16 20 12

Solution:

Amin x y x2 y2 xy
Ben 9 13 81 169 117
Clara 6 11 36 121 66
Dalia 9 15 81 225 135
Elmy 12 17 144 289 204
Fiona 10 16 100 256 160
Gio 13 20 169 400 260
total 8 12 64 144 96
67 104 675 1604 1038

42 Topic 5

CORRELATION AND REGRESSION

n ∑ xy – (∑x) (∑y)
r = √[n∑x2 – (∑x)2] [n∑y2 – (∑y)2]

7 (1038) – (67) (104)
r = √[7(675) – (67)2] [7(1604) – (104)2]

7266 - 6968
r = √[4725 – 4489] [11228 – 10816]

298
r=

√[236] [412]

298
r=

√97232
298

r=
311.82

r = 0.96

Conclusion:
There is a strong positive linear relationship exists between mileage claims and
sales.

Topic 5 43

CORRELATION AND REGRESSION

SPEARMAN RANK CORRELATION COEFFICIENT

ρ = 1 - 6 ∑ d2 1) Where:
n(n2 – n = total number of sample
d = the difference in ranks (rx – ry)

Example:
The following are scores obtained by seven students in their Mathematics and Statistics are
shown in table below.

Student Mathematics Statistics
A 67 76
B 81 80
C 43 37
D 59 61
E 66 61
F 91 82
G 79 76

Solution:

Student Mathematics Statistics rx ry ry’ d = rx - ry d2
A 67 76 4 4** 4.5 -0.5 0.25
B 81 80 66 6 0
C 43 37 11 1 0 0
D 59 61 2 2* 2.5 -0.5 0
E 66 61 3 3* 2.5 0.5 0.25
F 91 82 77 7 0 0.25
G 79 76 5 5** 4.5 0.5 0
0.25
∑=1

Note:
rx = ranking for data x from smallest value to the highest value
ry = ranking for data y from smallest value to the highest value

44 Topic 5

CORRELATION AND REGRESSION

ρ=1- 6 ∑ d2
n(n2 – 1)

ρ=1- 6 (1)
7(49 – 1)

ρ=1- 6
7(48)

ρ=1- 6
336

ρ = 1 – 0.0179

ρ = 0.98

Conclusion:

There is a strong positive linear relationship exists between mathematics
and add maths.

**Note: rank of
Remark the data which have same values then find the average
those data.

Topic 5 45

CORRELATION AND REGRESSION

SCATTER DIAGRAM

Scatter diagram which also called scatter plot or scatter graph provides relationship between
two variables x and y, also provides a visual correlation coefficent. For example, a scatter
plot shows the relationship between

The correlations may be positive (rising), negative (falling), or null (uncorrelated).

Example of scatter diagram

HOW TO DRAW SCATTER DIAGRAM

1. Determine the dependent and independent variable.
For example: SALES (y) & PROMOTION (x)

2. Draw an x-axis for independent variable and y-axis for dependent variable.
3. Mark or plot each data point on graph paper.
4. Label each axes on graph (y-axes) and (x-axes) and title of the graph.

46 Topic 5

CORRELATION AND REGRESSION

2. Draw and label y-axis

1. Draw and label x-axis

3. Plot each data point
4. Write a title on graph

Pages:

1 - 50
51 - 91

Click to View FlipBook Version