The words you are searching are inside this book. To get more targeted content, please make full-text search by clicking here.

HMEF5053 Measurement and Evaluation in Education_vDec19

Discover the best professional documents and content resources in AnyFlip Document Base.
Search
Published by nur adila, 2020-08-12 20:42:13

HMEF5053 Measurement and Evaluation in Education_vDec19

HMEF5053 Measurement and Evaluation in Education_vDec19

184  TOPIC 8 RELIABILITY AND VALIDITY OF ASSESSMENT TECHNIQUES

Inter-rater reliability can be low because of the following reasons:

(i) Examiners are subconsciously being influenced by knowledge of the
students whose scripts are being marked;

(ii) Consistency in marking is affected after marking a set of either very
good or very weak scripts;

(iii) When there is an interruption during the marking of a batch of scripts,
different standards may be applied after the break; and

(iv) The marking scheme is poorly developed resulting in examiners
making their own interpretations of the answers.

Inter-rater reliability can be enhanced if the criteria for marking or the
marking scheme:

(i) Contain suggested answers related to the question;

(ii) Have made provision for acceptable alternative answers;

(iii) Allocate appropriate time for the work required;

(iv) Are sufficiently broken down to allow the marking to be as objective as
possible and the totalling of marks is correct; and

(v) Allocate marks according to the degree of difficulty of the question.

(b) Intra-rater Reliability
While inter-rater reliability involves two or more individuals, intra-rater
reliability refers to the consistency of grading by a single rater. Scores on a
test are rated by a single rater at different times. When we grade tests at
different times, we may become inconsistent in our grading for various
reasons. For example, some papers that are graded during the day may get
our full attention, while others that are graded towards the end of the day
may be very quickly glossed over. Similarly, changes in our mood may affect
the grading of papers. In these situations, the lack of consistency can affect
intra-reliability in the grading of student answers.

SELF-CHECK 8.3

List the steps that may be taken to enhance inter-rater reliability in the
grading of essay answer scripts.

Copyright © Open University Malaysia (OUM)

TOPIC 8 RELIABILITY AND VALIDITY OF ASSESSMENT TECHNIQUES  185

ACTIVITY 8.2

In the myINSPIRE online forum, suggest other steps you would take to
enhance intra-rater reliability in the grading of projects.

8.5 TYPES OF VALIDITY

Validity is often defined as the extent to which a test measures what it was
designed to measure (Nuttall, 1987). While reliability relates to the consistency of
the test, validity relates to the relevancy of the test. If it does not measure what it
sets out to measure, then its use is misleading and the interpretation based on the
test is not valid or relevant. For example, if a test that is supposed to measure the
„spelling ability of eight-year-old children‰ does not measure „spelling ability‰,
then the test is not a valid test. It would be disastrous if you make claims about
what a student can or cannot do based on a test that is actually measuring
something else. It is for this reason that many educators argue that validity is the
most important aspect of a test.

However, validity will vary from test to test depending on what it is used for. For
example, a test may have high validity in testing the recall of facts in economics
but that same test maybe low in validity with regard to testing the application of
concepts in economics.

Messick (1989) was most concerned about the inferences a teacher draws from the
test score, the interpretation the teacher makes about his or her students and the
consequences from such inferences and interpretation. You can imagine the power
an educator holds in his or her hand when designing a test. Your test could
determine the future of many thousands of students. Inferences based on a test of
low validity could give a completely different picture of the actual abilities and
competencies of students.

Three types of validity have been identified: construct validity, content validity
and criterion-related validity which is made up of predictive and concurrent
validity (refer to Figure 8.5).

Copyright © Open University Malaysia (OUM)

186  TOPIC 8 RELIABILITY AND VALIDITY OF ASSESSMENT TECHNIQUES

Figure 8.5: Types of validity

These various types of validity are further explained as follows:

(a) Construct Validity
Construct validity relates to whether the test is an adequate measure of the
underlying construct. A construct could be any phenomenon such as
mathematics achievement, map skills, reading comprehension, attitude
towards school, inductive reasoning, environmental awareness, spelling
ability and so forth. You might think of construct validity as the correct
„labelling‰ of something. For example, when you measure what you term as
„critical thinking‰, is that what you are really measuring?

Thus, to ensure high construct validity, you must be clear about the
definition of the construct you intend to measure. For example, a construct
such as reading comprehension would include vocabulary development,
reading for literal meaning and reading for inferential meaning. Some
experts in educational measurement have argued that construct validity is
the most critical type of validity. You could establish the construct validity
of an instrument by correlating it with another test that measures the same
construct. For example, you could compare the scores obtained on your
reading comprehension test with the scores obtained on another well-known
reading comprehension test administered to the same sample of students. If
the scores for the two tests are highly correlated, then you may conclude that
your reading comprehension test has high construct validity.

A construct is determined by referring to theory. For example, if you are
interested in measuring the construct „self-esteem‰, you need to be clear
what self-esteem is. Perhaps, you need to refer to various literature in the
field describing the attributes of self-esteem. You find that theoretically, self-
esteem is made of the following attributes: physical self-esteem, academic
self-esteem and social self-esteem. Based on this theoretical perspective, you
can build items or questions to measure self-esteem covering these three
types of self-esteem. Through such a process, you are more certain to ensure
high construct validity.

Copyright © Open University Malaysia (OUM)

TOPIC 8 RELIABILITY AND VALIDITY OF ASSESSMENT TECHNIQUES  187

(b) Content Validity
Content validity is more straightforward and likely to be related to construct
validity. It concerns the coverage of appropriate and necessary content, i.e.
does the test cover the skills necessary for good performance or all the aspects
of the subject taught? It is concerned with sample-population
representativeness, i.e. the facts, concepts and principles covered by the test
items should be representative of the larger domain (e.g. syllabus) of facts,
concepts and principles.
For example, the science unit on „energy and forces‰ may include facts,
concepts, principles and skills on light, sound, heat, magnetism and
electricity. However, it is difficult, if not impossible, to administer a 2 to
3 hours paper to test all aspects of the syllabus on „energy and forces‰ (refer
to Figure 8.6).

Figure 8.6: Sample of content tested for the unit on „energy and forces‰
Therefore, only selected facts, concepts, principles and skills from the
syllabus (or domain) are sampled. The content selected will be determined
by content experts who will judge the relevance of the content in the test to
the content in the syllabus or a particular domain.
Content validity will be low if the questions in the test include questions
testing content not included in the domain or syllabus. To ensure content
validity and coverage, most teachers use the table of specifications (as
discussed in Topic 3). Table 8.2 is an example of a table of specifications
which specifies the knowledge and skills to be measured and the topics
covered for the unit on „energy and forces‰.

Copyright © Open University Malaysia (OUM)

188  TOPIC 8 RELIABILITY AND VALIDITY OF ASSESSMENT TECHNIQUES

Table 8.2: Table of Specifications for the Unit on „Energy and Forces‰

Topics Understanding of Application of Total
Concept Concepts
Light 7 4 11 (22%)
Sound 7 4 11 (22%)
Heat 7 4 11 (22%)
Magnetism 3 3
Electricity 8 3 6 (11%)
TOTAL 32 (64%) 18 (36%) 11 (22%)
50 (100%)

Since you cannot measure all the content of a topic, you will have to focus on
the key areas and give due weighting to those areas that are important. For
example, the teacher has decided that 64 per cent of questions will emphasise
the understanding of concepts, while the remaining 36 per cent will focus on
the application of concepts for the five topics. A table of specifications
provides the teachers with evidence that a test has high content validity, that
it covers what should be covered.

Content validity is different from face validity, which refers not to what the
test actually measures, but to what it superficially appears to measure. Face
validity assesses whether the test „looks valid‰ to the examinees who take it,
the administrative personnel who decide on its use and other technically
untrained observers. The face validity is a weak measure of validity but that
does not mean that it is incorrect, only that caution is necessary. Its
importance however cannot be underestimated.

(c) Criterion-related Validity
Criterion-related validity of a test is established by relating the scores
obtained to some other criterion or the scores of other tests. There are two
types of criterion-related validity:

(i) Predictive validity relates to whether the test predicts accurately some
future performance or ability. Is STPM a good predictor of performance
in a university? One difficulty in calculating the predictive validity of
STPM is because only those who pass the exam will go on to university
(generally speaking) and we do not know how well students who did
not pass might have done. Also, only a small proportion of the
population takes the STPM and the correlation between STPM grades
and performance at the degree level would be quite high.

Copyright © Open University Malaysia (OUM)

TOPIC 8 RELIABILITY AND VALIDITY OF ASSESSMENT TECHNIQUES  189

(ii) Concurrent validity is concerned with whether the test correlates with,
or gives substantially the same results as, another test of the same skill.
For example, does your end-of-year language test correlate with the
Malaysian University English Test (MUET)? In other words, if your
language test correlates highly with MUET, then your language test has
high concurrent validity.

8.6 FACTORS AFFECTING RELIABILITY AND
VALIDITY

To prepare tests which are acceptably valid and reliable, the following factors
should be taken into account:

(a) Construction of Test Items
The quality of test items has a significant effect on the validity and reliability
of a test. If the test items are poorly constructed, ambiguous and open to
different interpretations, the reliability of the test will be affected because the
test results will not reflect the true abilities of the students being assessed. If
the items do not assess the right content and do not match the intended
learning outcomes, then the test is not measuring what it is supposed to
measure; thus, affecting the test validity.

(b) Length of the Test
Generally, the longer the test, the more reliable and valid the test is. A short
test would not adequately cover a yearÊs work. The syllabus needs to be
sampled. The test should consist of enough questions that are representative
of the knowledge, skills and competencies in the syllabus. However, there is
also a problem with tests that are too long. A lengthy test maybe valid, but it
will take too much time and fatigue may set in which may affect the
performance and the reliability of the test.

(c) Selection of Topics
The topics selected and the test questions prepared should reflect the way
the topics were treated during teaching and learning. It is necessary to be
clear about the learning outcomes and to design items that measure these
learning outcomes. For example, in your teaching, students were not given
an opportunity to think critically and solve problems. However, your test
consists of items requiring students to think critically and solve problems. In
such a situation, the reliability and validity of the test will be affected. The
test is not reliable because it will not produce consistent results. It is also not

Copyright © Open University Malaysia (OUM)

190  TOPIC 8 RELIABILITY AND VALIDITY OF ASSESSMENT TECHNIQUES

valid because the test does not measure the right intended learning
outcomes. There is no constructive alignment between instruction and
assessment.

(d) Choice of Testing Techniques
The testing techniques selected will also affect reliability and validity. For
example, if you choose to use essay questions, validity may be high but
reliability may be low. Essay questions tend to have validity because they are
capable of assessing simple and complex learning outcomes, but tend to be
less reliable because of the subjective manner studentsÊ responses are scored.
On the other hand, if a test chooses objective test items such as MCQs, true-
false questions, matching questions and short-answer questions are selected,
the reliability of the test can be high because the scoring of studentsÊ
responses are not influenced by the subjective judgement of the assessors.
The validity, however, can be low because not all intended learning
outcomes can be appropriately assessed by objective test items alone. For
instance, multiple-choice questions are not suitable in assessing learning
outcomes that require students to organise ideas.

(e) Method of Test Administration
Test administration is also an important step in the measurement process.
This includes the arrangement of items in a test, the monitoring of test taking
and the preparation of data files from the test booklets. Poor test
administration procedures can lead to problems in the data collected and
affect the validity of the test results. For instance, if the results of the students
taking a test are not accurately recorded, the test scores have become invalid.
Adequate time must also be allowed for the majority of students to finish the
test. This would reduce wild guessing and instead encourage students to
think carefully about the answer. Instructions need to be clear to reduce the
effects of confusion on reliability and validity. The physical conditions under
which the test is taken must be favourable for the students. There must be
adequate space and lighting, and the temperature must be conducive.
Students must be able to work independently and the possibility of
distractions in the form of movement and noise must be minimised. If such
measures are not taken, studentsÊ performance may be affected because they
are handicapped in demonstrating their true abilities.

(f) Method of Marking
The marking should be as objective as possible. Marking which depends on
the exercise of human judgement ă such as in essays, projects and portfolios
ă is subject to the variations of human fallibility (refer to inter-rater reliability
discussed earlier). Besides, poorly designed or inappropriate marking
schemes can affect validity. For example, if an essay test is intended to assess

Copyright © Open University Malaysia (OUM)

TOPIC 8 RELIABILITY AND VALIDITY OF ASSESSMENT TECHNIQUES  191

studentsÊ ability to discuss an issue but a checklist is used to assess content
(knowledge), the validity of the test is questionable. If unqualified or
incompetent examiners are engaged to mark responses to essay questions,
they will not be consistent in their scoring, thus affecting test reliability. It is
quite easy to mark objective items quickly, but it is also surprisingly easy to
make careless errors. This is especially true where large numbers of scripts
are being marked. A system of checks is strongly advised. One method is
through the comments of the students themselves when their marked papers
are returned to them.

8.7 RELATIONSHIP BETWEEN RELIABILITY
AND VALIDITY

Some people may think of reliability and validity as two separate concepts. In
reality, reliability and validity are related. Figure 8.7 shows the analogy.

Figure 8.7: Graphical representations of the relationship between reliability and validity
The centre or the bullÊs-eye is the concept that we are trying to measure. Say, for
example, in trying to measure the concept of „inductive reasoning‰, you are likely
to hit the centre (or the bullÊs-eye) if your inductive reasoning test is both reliable
and valid, which is what all test developers aim to achieve (refer to Figure 8.6d).
On the other hand, your inductive reasoning test can be „reliable but not valid‰.
How is that possible? Your test may not measure inductive reasoning but the
scores you obtain each time you administer the test is approximately the same
(refer to Figure 8.6b). In other words, the test is consistently and systematically
measuring the wrong construct (i.e. inductive reasoning). Imagine the
consequences of making judgement about the inductive reasoning of students
using such a test!

Copyright © Open University Malaysia (OUM)

192  TOPIC 8 RELIABILITY AND VALIDITY OF ASSESSMENT TECHNIQUES

However, in the context of psychological testing, if an instrument does not have
satisfactory reliability, one typically cannot claim validity. That is, validity requires
that instruments are sufficiently reliable. So, Figure 8.6c does not have high
validity even though the target is hit twice. The validity is low and it does not have
reliability because the hits are not concentrated. In other words, you are not getting
a valid estimate of the inductive reasoning ability of your students and they are
inconsistent.

The worst-case scenario is when the test is neither reliable nor valid (refer to
Figure 8.6a). In this scenario, the scores obtained by students tend to concentrate
at the top and left of the target and they are consistently missing the centre target.
Your measure in this case is neither reliable nor valid and the test should be
rejected or improved.

 The true score is a hypothetical concept with regard to the actual ability,
competency and capacity of an individual.

 The higher the reliability and validity of your test, the greater the likelihood
that you will be measuring the true scores of your students.

 Reliability refers to the consistency of a measure. A test is considered reliable
if we get the same result repeatedly.

 Validity requires that instruments are sufficiently reliable.

 Face validity is a weak measure of validity.

 Using the test-retest technique, the same test is administered again to the same
group of students.

 For the parallel or equivalent forms technique, two equivalent tests (or forms)
are administered to the same group of students.

 Internal consistency is determined using only one test administered once to the
students.

 When two or more people mark essay questions, the extent to which there is
agreement in the marks allotted is called inter-rater reliability.

 While inter-rater reliability involves two or more individuals, intra-rater
reliability is the consistency of grading by a single rater.

Copyright © Open University Malaysia (OUM)

TOPIC 8 RELIABILITY AND VALIDITY OF ASSESSMENT TECHNIQUES  193

 Validity is the extent to which a test measures what it claims to measure. It is
vital for a test to be valid in order for the results to be accurately applied and
interpreted.

 Construct validity relates to whether the test is an adequate measure of the
underlying construct.

 Content validity is more straightforward and likely to be related to construct
validity; it is related to the coverage of appropriate and necessary content.

 Some people may think of reliability and validity as two separate concepts. In
reality, reliability and validity are related.

Construct Reliability and validity relationship
Content and face Reliable and not valid
Criterion relate Test-retest
Internal consistency True score
Parallel-form Valid and reliable
Predictive Validity
Reliability

Deale, R. N. (1975). Assessment and testing in secondary school. Chicago, IL:
Evans Bros.

Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement
(3rd ed.). New York, NY: Macmillan.

Nunnally, J. (1978). Psychometric methods. New York, NY: McGraw-Hill.

Copyright © Open University Malaysia (OUM)

194  TOPIC 8 RELIABILITY AND VALIDITY OF ASSESSMENT TECHNIQUES
Nuttall, D. L. (1987). The validity assessment. European Journal of Psychology

Education, 2(2), 109ă118.
Wells, C. S., & Wollack, J. A. (2003). An instructorÊs guide to understanding test

reliability. Retrieved from https://testing.wisc.edu/Reliability.pdf

Copyright © Open University Malaysia (OUM)

Topic  Item Analysis

9

LEARNING OUTCOMES

By the end of the topic, you should be able to:
1. Describe what item analysis is and the steps in item analysis;
2. Calculate the difficulty index and discrimination index;
3. Apply item analysis on essay-type question;
4. Discuss the relationship between the difficulty index and

discrimination index of an item;
5. Do distractor analysis; and
6. Explain the role of an item bank in the development of tests.

 INTRODUCTION

When you develop a test, it is important to identify the strengths and weaknesses
of each item. To determine how well items in a test perform, some statistical
procedures need to be used.

In this topic, we will discuss item analysis which involves the use of three
procedures: item difficulty, item discrimination and distractor analysis to help the
test developer decide whether the items in a test can be accepted or should be
modified, or rejected. These procedures are quite straightforward and easy to use,
and the educator needs to understand the logic underlying the analyses in order
to use them properly and effectively.

Copyright © Open University Malaysia (OUM)

196  TOPIC 9 ITEM ANALYSIS

9.1 WHAT IS ITEM ANALYSIS?

After having administered a test and marked it, most teachers would discuss the
answers with their students. Discussion would usually focus on the right answers
and the common errors made by students. Some teachers may focus on the
questions most students performed poorly on and the questions they did very
well.

However, there is much more information available about a test that is often
ignored by teachers. This information will only be available if the item analysis is
done. What is item analysis?

Item analysis is a process which examines the responses to individual test
items or questions in order to assess the quality of those items and the test as
a whole.

Item analysis is especially valuable in improving items or questions that will be
used again in later tests, but it can also be used to eliminate ambiguous or
misleading items in a single test administration.

Specifically, in classical test theory (CTT) the statistics produced from analysing
the test results based on test scores include measures of difficulty index and
discrimination index. Analysing the effectiveness of distractors also becomes part
of the process (which we will discuss in detail later in the topic).

The quality of a test is determined by the quality of each item or question in the
test. The teacher who constructs a test can only roughly estimate the quality of a
test. This estimate is based on the fact that the teacher has followed all the rules
and conditions of test construction.

However, it is possible that this estimation may not be accurate and certain
important aspects have been ignored. Hence, it is suggested that to obtain a more
comprehensive understanding of the test, item analysis should be conducted on
the responses of students. Item analysis is done to obtain information about
individual items or questions in a test and how the test can be further improved.
It also facilitates the development of an item or question bank which can be used
in the construction of a test.

Copyright © Open University Malaysia (OUM)

TOPIC 9 ITEM ANALYSIS  197

9.2 STEPS IN ITEM ANALYSIS

Both CTT and the „modern‰ test theory such as item response theory (IRT) provide
useful statistics to help us analyse the test data. For many item analysis, CTT is
sufficient to provide the information we need. CTT will be used in this module.
Let us take an example of a teacher who has administered a 30-item multiple-
choice objective test in geography to 45 students in a secondary school classroom.

Step 1
Upon receiving the answer sheet, the first step would be to mark each of the
answer sheets.

Step 2
Arrange the 45 answer sheets from the highest score obtained to the lowest score
obtained. The paper with the highest score is on top and the paper with the lowest
score is at the bottom.

Step 3
Multiply 45 (the number of answer sheets) with 0.27 (or 27 per cent) which is 12.15
and round it up to 12. The use of the value 0.27 or 27 per cent is not inflexible. It is
possible to use any percentage from 27 to 35 per cent as the value. However, the
27 per cent rule can be ignored if the class size is too small. Instead of taking the
27 per cent sample, divide the number of answer sheets by two.

Step 4
Arrange the pile of 45 answer sheets according to the scores obtained (from the
highest score to the lowest score). Take out 12 answer sheets from the top of the
pile and 12 answer sheets from the bottom of the pile. Call these two piles „high
marks‰ students and „low marks‰ students respectively. Set aside the middle
group of papers (21 papers). Although these could be included in the analysis,
using only the high and low groups will simplify the procedure.

Step 5
Refer to Question 1 (refer to Figure 9.1), then:

(a) Count the number of students from the „high marks‰ group who selected
each of the options (A, B, C or D); and

(b) Count the number of students from the „low marks‰ group who selected the
option A, B, C or D.

Copyright © Open University Malaysia (OUM)

198  TOPIC 9 ITEM ANALYSIS

Figure 9.1: Item analysis for one item or question
From the analysis, 11 students from the „high marks‰ group and two students
from the „low marks‰ group selected „B‰ which is the correct answer. This means
that 13 out of the 24 students selected the correct answer. Also, note that all the
distractors (A, C and D) were selected by at least one student. However, the
information provided in Figure 9.1 is insufficient and further analysis has to be
conducted.

SELF-CHECK 9.1

1. Define item analysis.
2. Describe the five steps of item analysis.

Copyright © Open University Malaysia (OUM)

TOPIC 9 ITEM ANALYSIS  199

9.3 DIFFICULTY INDEX

Using the information provided in Figure 9.1, you can compute the difficulty index
which is a quantitative indicator with regard to the difficulty level of an individual
item or question. It can be calculated using the following formula:

Difficulty index  Number of students with the correct answer (R)

Total number of students who attempted the question (T)

 R  13  0.54
T 24

What does a difficulty index (p) of 0.54 mean? The difficulty index is a coefficient
that shows the percentage of students who got the correct answer compared with
the total number of students who attempted the question. In other words,
54 per cent of students selected the right answer. Although our computation is
based on the high and low scoring groups only, it provides a close approximation
of the estimate that would be obtained with the total group. Thus, it is proper to
say that the index of difficulty for this item is 54 per cent (for this particular group).
Note that since „difficulty‰ refers to the percentage getting the item right, the
smaller the percentage figure the more difficult the item. The meaning of the
difficulty index is shown in Figure 9.2.

Figure 9.2: Interpretation of the difficulty index (p)

If a teacher believes that the achievement 0.54 on the item is too low, he or she can
change the way he or she teaches the item to better meet the objective represented
by it. Another interpretation might be that the item was too difficult or confusing
or invalid, in which case the teacher can replace or modify the item, perhaps using
information from the itemÊs discrimination index or distractor analysis.

Under CTT, the item difficulty measure is simply the proportion that is correct for
an item. For an item with a maximum score of two, there is a slight modification
to the computation of proportion of percentage correct.

Copyright © Open University Malaysia (OUM)

200  TOPIC 9 ITEM ANALYSIS

This item has a possible partial credit scoring 0, 1, 2. If the total number of students
attempting this item is 100, and 23 students scored 0, 60 students scored 1 and
17 students scored 2, then a simple calculation will show that 23 per cent of the
students scored 0, 60 per cent of the students scored 1, and 17 per cent of the
students scored 2 for this particular item. The average score for this item should
be 0  0.23 + 1  0.6 + 2  0.17 = 0.94.

Thus, the observed average score of this item is 0.94 out of a maximum of 2. So the
average proportion correct is 0.94/2 = 0.47 or 47 per cent.

ACTIVITY 9.1

A teacher gave a 20-item Science test to a group of 35 students. The
correct answer for Question #20 is „C‰ and the results are as follows:

Options A B C D Blank
High marks group (n = 12) 02820
Low marks group (n = 12) 24321

(a) Calculate the difficulty index (p) for Question #20.
(b) Is Question #20 an easy or difficult question?
(c) Do you think you need to improve Question #20? Why?

Post your answers on the myINSPIRE online forum.

9.4 DISCRIMINATION INDEX

Discrimination index is a basic measure which shows the extent to which a
question discriminates or differentiates between students in the „high marks‰
group and „low marks‰ group. This index can be interpreted as an indication of
the extent to which overall knowledge of the content area or mastery of the skills
is related to the response on an item. Most crucial for a test item is that whether a
student answered a question correctly or not is due to his/her level of knowledge
or ability and not due to something else such as chance or test bias.

Copyright © Open University Malaysia (OUM)

TOPIC 9 ITEM ANALYSIS  201

In our example in subtopic 9.2, 11 students in the high group and two students in
the low group selected the correct answer. This indicates positive discrimination,
since the item differentiates between students in the same way that the total test
score does. That is, students with high scores on the test (high group) got the item
right more frequently than students with low scores on the test (low group).
Although analysis by inspection maybe all that is necessary for most purposes, an
index of discrimination can be easily computed using the following formula:

Discrimination index  Rh  RL
12 T

where Rh = Number of students in „high marks‰ group (Rh) with the correct
answer

RL = Number of students in „low marks‰ group (RL) with the correct
answer

T = Total number of students

Example 9.1:
A test was given to a group of 43 students and 10 out of the 13 „high marks‰ group
got the correct answer compared to five out of the 13 „low marks‰ group who got
the correct answer. The discrimination index is computed as follows:

 Rh  RL  10  5  10  5  0.38
12 T 13
12 26

What does a discrimination index of 0.38 mean? The discrimination index
is a coefficient that shows the extent to which the question discriminates or
differentiates between „high marks‰ students and „low marks‰ students.
Blood and Budd (1972) provide the guidelines on the meaning of
the discrimination index as follows (refer to Figure 9.3).

Copyright © Open University Malaysia (OUM)

202  TOPIC 9 ITEM ANALYSIS

Figure 9.3: Interpretation of the discrimination index
Source: Blood and Budd (1972)

A question that has a high discrimination index is able to differentiate between
students who know and those who do not know the answer. When we say that a
question has a low discrimination index, it is not able to differentiate between
students who know and students who do not know. A low discrimination index
means that more „low marks‰ students got the correct answer because the
question was too simple. It could also indicate that students from both the „high
marks‰ group and „low marks‰ group got the answer wrong because the question
was too difficult.
The formula for the discrimination index is such that if more students in the „high
marks‰ group chose the correct answer than students did in the low scoring group,
the number will be positive. At a minimum, one would hope for a positive value,
as that would indicate that it is knowledge of the question that resulted in the
correct answer. The greater the positive value (the closer it is to 1.0), the stronger
the relationship is between overall test performance and performance on that item.
If the discrimination index is negative, that means that for some reason, students
who scored low on the test were more likely to get the answer correct. This is a
strange situation which suggests poor validity for an item.

Copyright © Open University Malaysia (OUM)

TOPIC 9 ITEM ANALYSIS  203

9.5 APPLICATION OF ITEM ANALYSIS ON
ESSAY-TYPE QUESTIONS

The previous subtopics explain the use of item analysis on multiple-choice
questions. Item analysis can also be applied on essay-type questions. This subtopic
will illustrate how this can be done. For ease of understanding, the illustration will
use a short-answer essay question as an example.

Let us assume that a group of 20 students have responded to a short-answer essay
question with scores ranging from the minimum of 0 to the maximum of 4.
Table 9.1 provides the scores obtained by the students.

Table 9.1: Scores Obtained by Students for a Short-answer Essay Question

Item Score No. of Students Earning Each Total Scores Earned
Score
4 5 20
3 6 18
2 5 10
1 3 3
0 1 0
Total 51
Average Score 51/20 = 2.55

The difficulty index (p) of the item can be computed using the following formula:
p  Average score
Possible range of score

Using the information from Table 9.1, the difficulty index of the short-answer essay
question can be easily computed. The average score obtained by the group of
students is 2.55, while the possible range of score for the item is (4 ă 0) = 4. Thus,

p  2.55
4

 0.64

Copyright © Open University Malaysia (OUM)

204  TOPIC 9 ITEM ANALYSIS

The difficulty index (p) of 0.64 means that on average, students have received
64 per cent of the maximum possible score of the item. The difficulty index can
be interpreted the same as that of the multiple-choice question discussed in
subtopic 9.3. The item is of a moderate level of difficulty (refer to Figure 9.2).

Note that in computing the difficulty index in the previous example, the scores of
the whole group are used to obtain the average score. However, for a large group
of students, it is possible to estimate the difficulty index for an item based on only
a sample of students comprising the high marks and low marks groups as in the
case of computing the difficulty index of a multiple-choice question.

To compute the discrimination index (D) of an essay-type question, the following
formula is suggested by Nitko (2004):

D  Difference between upper and lower groups' average score
Possible range of score

Using the information from Table 9.1 but presenting it in the following format as
in Table 9.2, we can compute the discrimination index of the short-answer essay
question.

Table 9.2: Distribution of Scores Obtained by Students

Score 0 1 2 3 4 Total Average
Score
High marks group (n = 10) 0 0 1 4 5 34
17 3.4
Low marks group (n = 10) 13420
1.7

Note: n refers to the number of students.

The average score obtained by the upper group of students is 3.4 while that of the
lower group is 1.7. Using the formula as suggested by Nitko (2004), we can
compute the discrimination index of the short-answer essay question as follows:

D  3.4  1.7
4

 0.43

Copyright © Open University Malaysia (OUM)

TOPIC 9 ITEM ANALYSIS  205

The discrimination index (D) of 0.43 indicates that the short-answer question does
discriminate between the upper and lower groups of students and at a high level
(refer to Figure 9.3.) As in the computation of the discrimination index of the
multiple-choice question for a large group of students, a sample of students
comprising the top 27 per cent and the bottom 27 per cent may be used to provide
a good estimate.
The following are two possible reasons for poorly discriminating items:
(a) The item tests something else compared to the majority of items in the test;

or
(b) The item is poorly written and confuses the students.
Thus, when examining the low discriminating item, it is advisable to check
whether:
(a) The wording and format of the item are problematic; and
(b) The item may be testing a different thing than that intended for the test.

Copyright © Open University Malaysia (OUM)

206  TOPIC 9 ITEM ANALYSIS

ACTIVITY 9.2

1. The following is the performance of students in the high marks and
the low marks groups in a short-answer essay question.

Score 01234
High marks group (n = 10) 22312
Low marks group (n = 10) 32230

(a) Calculate the difficulty index.

(b) Calculate the discrimination index.

Discuss the findings on the myINSPIRE online forum.

2. A teacher gave a 35-item Economics test to 42 students. For
Question 16; 8 out of the 11 from the high marks groups got the
correct answer compared with 4 out of 11 from the low marks
group who got the correct answer.

(a) Calculate the discrimination index for Question 16.

(b) Does Question 16 have a high or low discrimination index?

Post your answers on the myINSPIRE online forum.

9.6 RELATIONSHIP BETWEEN DIFFICULTY
INDEX AND DISCRIMINATION INDEX

Theoretically, the more difficult or easier a question (or item) is, the lower will the
discrimination index be. Stanley and Hopkins (1972) provided a theoretical model
to explain the relationship between the difficulty index and discrimination index
of a particular question or item (refer to Figure 9.4).

Copyright © Open University Malaysia (OUM)

TOPIC 9 ITEM ANALYSIS  207

Figure 9.4: Theoretical relationship between difficulty index and discrimination index
Source: Stanley and Hopkins (1972)

According to the model, a difficulty index of 0.2 can result in a discrimination
index of about 0.3 for a particular item (which may be described as an item of
„moderate discrimination‰). Note that as the difficulty index increases from 0.1 to
0.5, the discrimination index increases even more. When the difficulty index
reaches 0.5 (described as an item of „moderate difficulty‰), the discrimination
index is positive 1.00 (very high discrimination). Interestingly, a difficulty index of
more than 0.5 leads to a decrease in the discrimination index.
For example, a difficulty index of 0.9 results in a discrimination index of about 0.2,
is described as an item of low to moderate discrimination. What does this mean?
The more difficult a question, the harder it is for that question or item to
discriminate between those students who know and those who do not know the
answer to the question.

Copyright © Open University Malaysia (OUM)

208  TOPIC 9 ITEM ANALYSIS

Similarly, when the difficulty index is about 0.1, the discrimination index drops to
about 0.2. What does this mean? The easier a question, the harder it is for that
question or item to discriminate between those students who know and those who
do not know the answer to the question.

ACTIVITY 9.3

1. What can you conclude about the relationship between the difficulty
index of an item and its discrimination index?

2. Do you take these factors into consideration when giving an
objective test to students in your school? Justify.

Share your answers with your coursemates in the myINSPIRE online
forum.

9.7 DISTRACTOR ANALYSIS

In addition to examining the performance of an entire test item, teachers are also
interested in examining the performance of individual distractors (incorrect
answer options) on multiple-choice items. By calculating the proportion of
students who chose each answer option, teachers can identify which distractors
are „working‰ and appear attractive to students who do not know the correct
answer, and which distractors are simply taking up space and not being chosen by
many students. To eliminate blind guessing which results in a correct answer
purely by chance (which hurts the validity of a test item), teachers want as many
plausible distractors as is feasible. Analyses of response options allow teachers to
fine-tune and improve items they may wish to use again with future classes. Let
us examine performance on an item or question (refer to Figure 9.5).

Figure 9.5: Effectiveness of distractors

Copyright © Open University Malaysia (OUM)

TOPIC 9 ITEM ANALYSIS  209

Generally, a good distractor is able to attract more „low marks‰ students to select
that particular response or distract „high marks‰ students towards selecting that
particular response. What determines the effectiveness of distractors? Figure 9.5
shows you how 24 students selected the options A, B, C and D for a particular
question. Option B is a less effective distractor because many „high marks‰
students (n = 5) selected option B. Option D is a relatively good distractor because
two students from the „high marks‰ group and five students from the „low marks‰
group selected this option. The analysis of response options shows that those who
missed the item were about equally likely to choose answer B and answer D. No
students chose answer C, meaning it does not act as a distractor. Students were not
choosing between four answer options on this item, they were really choosing
between only three options, as they were not even considering answer C. This
makes guessing correctly more likely, which hurts the validity of the item. The
discrimination index can be improved by modifying and improving options B
and C.

ACTIVITY 9.4

Which British resident was killed by Maharajalela in Pasir Salak?

Hugh Low Birch Brooke Gurney
AB
Options 47 C D No Response
High marks (n = 15) 63
Low marks (n = 15) 04 0

24 0

The answer is B.

Analyse the effectiveness of the distractors. Discuss your answer with
your coursemates on the myINSPIRE online forum.

Copyright © Open University Malaysia (OUM)

210  TOPIC 9 ITEM ANALYSIS

9.8 PRACTICAL APPROACH IN ITEM ANALYSIS

Some teachers may find the techniques discussed earlier as time consuming and
this fact cannot be denied especially when you have a test consisting of 40 items.
However, there is a more practical approach which may take less time. Imagine
that you have administered a 40-item test to a class of 30 students. It will surely
take a lot of time to analyse the effectiveness of each item and this may discourage
teachers from analysing each item in a test. Here is a method that shows you how
to do so:

Step 1
Arrange the 30 answer sheets from the highest score obtained to the lowest score
obtained.

Step 2
Select the answer sheet that obtained a middle score. Group all answer sheets
above this score as „high marks‰ (mark an „H‰ on these answer sheets). Group all
answer sheets below this score as „low marks‰ group (mark an „L‰ on these
answer sheets).

Step 3
Divide the class into two groups (high and low) and distribute the „high‰ answer
sheets to the high group and the low answer sheet to the low group. Assign one
student in each group to be the counter.

Step 4
The teacher then asks the class, „The answer for Question #1 is „C‰ and those who
got it correct, raise your hand.

Counter from „H‰ group: „Fourteen for group H‰
Counter from „L‰ group: „Eight from group L‰

Step 5
The teacher records the responses on the whiteboard as follows:

Question #1 High Low Total of Correct Answers
Question #2 14 8 22
Question #3 12 6 18
16 7 23
|
| nn n
Question #n

Copyright © Open University Malaysia (OUM)

TOPIC 9 ITEM ANALYSIS  211

Step 6
Calculate the difficulty index for Question #1 as follows:

Difficulty index  RH  RL  14  8  0.73
30 30

Step 7
Compute the discrimination index for Question #1 as follows:

Discrimination index  RH  RL  14  8  6  0.40
12 30 15 15

Note that earlier, we took 27 per cent of answer sheets in the „high marks‰ group
and 27 per cent of answer sheets in the „low marks‰ group from the total answer
sheets. However, in this approach we divided the total answer sheets into two
groups. There is no middle group. The important thing is to use a large enough
fraction of the group to provide useful information. Selecting the top and bottom
27 per cent of the group is recommended for a more refined analysis. This method
may be less accurate but it is a „quick and dirty‰ method.

ACTIVITY 9.5

Compare the difficulty index and discrimination index obtained using
this rough method with the theoretical model by Stanley and Hopkins
(1972) in Figure 9.4. Are the indexes very far out?

Share your answer with your coursemates in the myINSPIRE online
forum.

Copyright © Open University Malaysia (OUM)

212  TOPIC 9 ITEM ANALYSIS

9.9 USEFULNESS OF ITEM ANALYSIS TO
TEACHERS

After each test or assessment, it is advisable to carry out item analysis of the test
items because the information from the analysis would be useful to teachers.
Among the benefits they can get from the analysis are as follows:

(a) From the discussion in the earlier subtopics, it is obvious that the results of
item analysis could provide answers to the following questions:

(i) Did the item function as intended?

(ii) Were the items of appropriate difficulty?

(iii) Were the items free from irrelevant clues and other defects?

(iv) Was each of the distracters effective (in multiple-choice questions)?

Answers to the previous questions can be used to select or revise test items
for future use. This would improve the quality of test items and the test paper
to be used in future. It also saves teachersÊ time in preparing the test items
for future use because good items can be stored in the item bank.

(b) Item analysis data can provide a basis for efficient class discussion of the test
results. Knowing how effectively each test item functions in measuring the
achievement of the intended learning outcome and how students perform in
each item, teachers can have a more fruitful discussion with the students as
feedback based on the item analysis that is more objective and informative.

For example, teachers can highlight the misinformation or misunderstanding
reflected in the choice of particular distracters on multiple-choice questions
or frequently repeated errors on essay-type questions, thereby enhancing the
instructional value of assessment. If, during the discussion, the item analysis
reveals that there are technical defects in the items or the marking scheme,
studentsÊ marks can also be rectified to ensure a fairer test.

(c) Item analysis data can be used for remedial work. The analysis will reveal
the specific areas that the students are weak in. Teachers can use the
information to focus remedial work directly on the particular areas of
weakness. For example, based on the distracter analysis, it is found that a
specific distracter has a low discrimination with a great number of students
from both the high marks and low marks groups choosing the option. This
could suggest that there is some misunderstanding of a particular concept.
Remedial lessons can thus be planned to arrest the problem.

Copyright © Open University Malaysia (OUM)

TOPIC 9 ITEM ANALYSIS  213

(d) Item analysis data can reveal weaknesses in teaching and provide useful
information to improve teaching. For example, despite the fact that an item
is properly constructed, it has a low difficulty index, suggesting that most
students fail to answer the item satisfactorily. This might suggest that the
students have not mastered a particular syllabus content that is being
assessed. This could be due to the weakness in instruction and thus
necessitates the implementation of more effective teaching strategies by the
teachers. Furthermore, if the item is repeatedly difficult for the items, there
might be a need to revise the curriculum.

(e) Item analysis procedures provide a basis for teachers to improve their skills
in test construction. As teachers analyse studentsÊ responses to items, they
become aware of the defects of the items and what causes them. When
revising the items, they gain experience in rewording the statements so that
they are clear, rewriting the distracters so that they are more plausible and
modifying the items so that they are at a more appropriate level of difficulty.
As a consequence, teachers improve their test construction skills.

9.10 CAUTION IN INTERPRETING ITEM
ANALYSIS RESULTS

Despite the usefulness of item analysis, the results from such an analysis are
limited in many ways and must be interpreted cautiously. The following are some
of the major precautions to observe:

(a) Item discriminating power does not indicate item validity. A high
discrimination index merely indicates that students from the high marks
group perform relatively better than the students from the low marks group.
The division of the high and low marks groups is based on the total test score
obtained by each student, which is an internal criterion. By using the internal
criterion of total test score, item analysis offers evidence concerning the
internal consistency of the test rather than its validity. The validity of a test
needs to be judged by an external criterion, that is, to what extent the test
assesses the learning outcomes intended.

(b) The discrimination index is not always an indicator of item quality. For
example, a low index of discriminating power does not necessarily indicate
a defective item. If an item does not discriminate but it has been found to be
free from ambiguity and other technical defects, the item should be retained,
especially in a criterion-referenced test. In such a test, a non-discriminating
item may suggest that all students have achieved the criterion set by the
teacher. As such, the item does not discriminate between the good and poor

Copyright © Open University Malaysia (OUM)

214  TOPIC 9 ITEM ANALYSIS

students. Another possible reason why low discrimination occurs for an item
is that the item may be very easy or very difficult. Sometimes, this item,
however, is necessary or desirable to be retained in order to measure a
representative sample of learning outcomes and course content. Moreover,
an achievement test is usually designed to measure several different types of
learning outcomes (knowledge, comprehension, application and so on). In
such a case, there will be learning outcomes that are assessed by fewer test
items and these items will have low discrimination because they have less
representation in the total test score. Removing these items from the test is
not advisable as it will affect the validity of the test.

(c) This traditional item analysis data is tentative. It is not fixed but influenced
by the type and number of students being tested and the instructional
procedures employed. The data would thus change with every
administration of the same test items. So, if repeated use of items is possible,
item analysis should be carried for each administration of each item. The
tentative nature of item analysis should therefore be taken seriously and the
results are interpreted cautiously.

9.11 ITEM BANK

What is an item bank?

An item bank is a large collection of easily accessible questions or items that
have been administered over a period of time.

For achievement tests which assess performance in a body of knowledge such as
Geography, History, Chemistry or Mathematics, the questions that can be asked
are rather limited. Hence, it is not surprising that previous questions are
„recycled‰ with some minor changes and administered to a different group of
students. Making good test items is not a simple task and can be time consuming
for teachers. Hence, an item or question bank would be of great assistance to
teachers.

An item bank consists of questions that have been analysed and stored because
they are good items. Each stored item will have information on its difficulty index
and discrimination index. Each item is stored according to what it measures,
especially in relation to the topics of the curriculum. These items will be stored in
the form of a table of specifications indicating the content being measured as well
as the cognitive levels measured. For example, you will be able to draw from the

Copyright © Open University Malaysia (OUM)

TOPIC 9 ITEM ANALYSIS  215

item bank items measuring the application of concepts for the topic on
„electricity‰. You will also be able to draw items from the bank with different
difficulty levels. Perhaps, you want to arrange easier questions at the beginning of
the test so as to build confidence in students and then gradually introduce
questions of increasing difficulty.

With computerised databases, item banks are easy to access. Teachers will have at
their disposal hundreds of items from which they can draw upon when
developing classroom tests. This would certainly help them with the tedious and
time-consuming task of having to construct items or questions from scratch.

Unfortunately, not many educational institutions are equipped with such an item
bank. The more common practice is for teachers to select items or questions from
commercially prepared workbooks, past examination papers and sample items
from textbooks. These sources do not have information about the difficulty index
and discrimination index of items, nor information about the cognitive levels of
questions or what they aim to measure. Teachers will have to figure out for
themselves the characteristics of the items based on their experience in teaching
the content.

However, there are certain issues to consider in setting up a question bank. One of
the major concerns of the bank is how to place different test items collected
overtime on a common scale. The scale should indicate difficulty of the items, one
scale per subject matter. Retrieval of items from the bank is made easy when all
items are placed on the same scale.

The person in charge must also take every effort to add only quality items to the
item pool. To develop and maintain a good item bank requires a great deal of
preparation, planning, expertise and organisation. Though item response theory
(IRT) approach is not a panacea for the item banking problems, it can solve many
of these issues (IRT will be explained further in the next subtopic).

Copyright © Open University Malaysia (OUM)

216  TOPIC 9 ITEM ANALYSIS

9.12 PSYCHOMETRIC SOFTWARE

Software designed for general statistical analysis such as SPSS can often be used
for certain types of psychometric analysis. However, there are many software
available in the market specially to analyse the data from tests.

Classical test theory or CTT is an approach to psychometric analysis that has
weaker assumptions than item response theory and is more applicable to smaller
sample sizes. Under CTT, the studentÊs raw test score would be the sum of the
scores received on the item in the test. For example, Iteman is a commercial
software program while TAP is a free program for classical analysis.

Item response theory (IRT) is a psychometric approach which assumes that the
probability of a certain response is a direct function of an underlying trait or traits.
Under IRT, the concern is whether the student obtained each item correctly or not,
rather than the raw test score. The basic concept of IRT is about the individual item
of test rather than about the test scores. Student trait or ability and item
characteristics are referenced to the same scale. For example, ConQuest is a
computer program for item response and latent regression models and TAM is a
R package for item response models.

ACTIVITY 9.6

In the myINSPIRE forum, discuss:
(a) To what extent do Malaysian schools have item banks?
(b) Do you think teachers should have access to computerised item

banks? Justify.

 Item analysis is a process which examines the responses to individual test
items or questions in order to assess the quality of those items and the test as a
whole.

 Item analysis is conducted to obtain information about individual items or
questions in a test and how the test can be improved.

Copyright © Open University Malaysia (OUM)

TOPIC 9 ITEM ANALYSIS  217

 The difficulty index is a quantitative indicator with regard to the difficulty
level of an individual item or question.

 The discrimination index is a basic measure which shows the extent to which
a question discriminates or differentiates between students in the "high marks"
group and "low marks" group.

 Theoretically, the more difficult a question (or item) or easier the question (or
item) is, the lower will the discrimination index be.

 By calculating the proportion of students who chose each answer option,
teachers can identify which distractors are "working" and appear attractive to
students who do not know the correct answer, and which distractors are
simply taking up space and not being chosen by any student.

 Generally, a good distractor is able to attract more "low marks" students to
select that particular response or distract "high marks" students towards
selecting that particular response.

 An item bank is a collection of questions or items that have been administered
over a period of time.

 There are many psychometric software programs to help expedite the tedious
calculation process.

Computerised data bank Good distractor
Difficult question High marks group
Difficulty index Item analysis
Discrimination index Item bank
Distractor analysis Low marks group
Easy question

Copyright © Open University Malaysia (OUM)

218  TOPIC 9 ITEM ANALYSIS
Blood, D. F., & Budd, W. C. (1972). Educational measurement and evaluation.

Manhattan, NY: Harper and Row.
Nitko, A. J. (2004). Educational assessments of students. Englewood Cliffs, NJ:

Prentice Hall.
Stanley, G., & Hopkins, D. (1972). Introduction to educational measurement and

testing. Boston, MA: Macmillan.

Copyright © Open University Malaysia (OUM)

Topic  Analysis of

10 Test Scores

LEARNING OUTCOMES

By the end of the topic, you should be able to:
1. Differentiate between descriptive and inferential statistics;
2. Calculate various central tendency measures;
3. Explain the use of standard scores;
4. Calculate Z-score and T-score;
5. Describe the characteristics of the normal curve; and
6. Explain the role of norms in standardised tests.

 INTRODUCTION

Do you know that all the data you have collected on the performance of students
have to be analysed? In this final topic, we will focus on the analysis and
interpretation of the data you have collected about the knowledge, skills and
attitudes of your students. Information you have collected about your students can
be analysed and interpreted quantitatively and qualitatively. For the quantitative
analysis of data, various statistical tools are used. For example, statistics are used
to show the distribution of scores on a Geography test and the average score
obtained by a group of students.

Copyright © Open University Malaysia (OUM)

220  TOPIC 10 ANALYSIS OF TEST SCORES

10.1 WHY USE STATISTICS?

When you give a Geography test to your class of 40 students at the end of the
semester, you get a score for each student which is a measurement of a sample of
the studentÊs ability. The behaviour tested could be the ability to solve problems
in Geography such as reading maps, the globe and interpretation of graphs. For
example, student A gets a score of 64 while student B gets 32. Does this mean that
the ability of student A is better than that of student B? Does it mean that the ability
of student A is twice the ability of student B? Are the scores 64 and 32 percentages?
These scores or marks are difficult to interpret because they are raw scores. Raw
scores can be confusing if there is no reference made to a „unit‰. So, it is only logical
that you convert the score to a unit such as percentages. In this example, you get
64 per cent and 32 per cent.

Even the use of percentages may not be meaningful. For example, getting
64 per cent in the test may be considered „good‰ if the test was a difficult one. On
the other hand, if the test was easy, then 64 per cent may be considered to be only
„average‰. In other words, to get a more accurate picture of the scores obtained by
students on the test, the teacher should find out:

(a) Which student obtained the highest marks in the class and the number of
questions correctly answered;

(b) Which student obtained the lowest marks in the class and the number of
questions correctly answered; and

(c) The number of questions correctly answered by all students in the class.

This illustrates that the marks obtained by students in a test should be carefully
examined. It is not enough to just report the marks obtained. More information
should be given about the marks obtained and to do this you have to use statistics.
Some teachers may be afraid of statistics while others may regard it as too time
consuming. In fact, many of us often use statistics without being aware of it. For
example, when we talk about average rainfall, per capita income, interest rates and
percentage increase in our daily lives, we are using the language of statistics. What
is statistics?

Statistics is a mathematical science pertaining to the analysis, interpretation
and presentation of data.

Copyright © Open University Malaysia (OUM)

TOPIC 10 ANALYSIS OF TEST SCORES  221

It is applicable to a wide variety of academic disciplines from the physical and
social sciences to the humanities. Statistics have been widely used by researchers
in education and by classroom teachers. In applying statistics in education, one
begins with a population to be studied. This could be all Form Two students in
Malaysia which number about 450,000 or all secondary school teachers in the
country.

For practical reasons, rather than compiling data about an entire population, we
usually select or draw a subset of the population called a sample. In other words,
the 40 Form Two students that you teach is a sample of the population of Form
Two students in the country. The data you collect about the students in your class
can be subjected to statistical analysis, which serves two related purposes, namely,
descriptive and inference.

(a) Descriptive Statistics
You use these statistical techniques to describe how your students
performed. For example, you use descriptive statistics techniques to
summarise data in a useful way either numerically or graphically. The aim is
to present the data collected so that it can be understood by teachers, school
administrators, parents, the community and the Ministry of Education. The
common descriptive techniques used are the mean or average and standard
deviation. Data may also be presented graphically using various kinds of
charts and graphs.

(b) Inferential Statistics
You use inferential statistical techniques when you want to infer about the
population based on your sample. You use inferential statistics when you
want to find out the differences between groups of students, the relationship
between variables or when you want to make predictions about student
performance. For example, you want to find out whether the boys did better
than the girls or whether there is a relationship between performance in
coursework and the final examination. The inferential statistics often used
are the t-test, ANOVA and linear regression.

Copyright © Open University Malaysia (OUM)

222  TOPIC 10 ANALYSIS OF TEST SCORES

10.2 DESCRIBING TEST SCORES

Let us assume that you have just given a test on Bahasa Melayu to a class of
35 students in Form One. After marking the scripts, you have a set of scores for
each of the students in the class and you want to find out more about how your
students performed. Figure 10.1 shows you the distribution of the score obtained
by students in the test.

Figure 10.1: The distribution of Bahasa Melayu marks
The „frequency‰ column shows how many students scored for each mark shown
and the percentage is shown in the „percentage‰ column. You can describe these
scores using two types of measures, namely, central tendency and dispersion.

Copyright © Open University Malaysia (OUM)

TOPIC 10 ANALYSIS OF TEST SCORES  223

10.2.1 Central Tendency

The term „central tendency‰ refers to the „middle‰ value and is measured using
the mean, median and mode. It is an indication of the location of the scores. Each
of these three measures is calculated differently, and which one to use will depend
on the situation and what you want to show.

(a) Mean
The mean is the most commonly used measure of central tendency. When
we talk about an „average‰, we usually refer to the mean. The mean is simply
the sum of all the values (marks) divided by the total number of items
(students) in the set. The result is referred to as the arithmetic mean. Using
the data from Figure 10.1 and applying the formula given, you can calculate
the mean.

Mean  x  35  40  41  ............... 75  1863  53.23
N 35 35

(b) Median
The median is determined by sorting the score obtained from lowest to
highest values and taking the score that is in the middle of the sequence. For
the example in Figure 10.1, the median is 52. There are 17 students with
scores less than 52 and 17 students whose scores are greater than 52. If there
is an even number of students, there will not be a single point at the middle.
So, you calculate the median by taking the mean of the two middle points i.e.
divide the sum of the two scores by 2.

(c) Mode
The mode is the most frequently occurring score in the data set. Which number
appears most often in your data set? In Figure 10.1, the mode is 57 because
seven students obtained that score. However, you can also have more than one
mode. If you have two modes, it is bimodal.

Distributions of scores may be graphed to demonstrate visually the
relationship among the scores in a group. In such graphs, the horizontal axis
or x-axis is the continuum on which the individuals are measured. The vertical
axis or y-axis is the frequency (or the number) of individuals earning any given
score shown on the x-axis. Figure 10.2 shows you a histogram representing
the scores for the Bahasa Melayu test obtained by a group of 35 students as
shown earlier in Figure 10.1.

Copyright © Open University Malaysia (OUM)

224  TOPIC 10 ANALYSIS OF TEST SCORES

Figure 10.2: Graph showing the distribution of Bahasa Melayu test scores

SELF-CHECK 10.1

1. What is the difference between descriptive statistics and inferential
statistics?

2. What is the difference between mean, median and mode?

10.2.2 Dispersion

Although the mean tells us about the groupÊs average performance, it does not tell
us how close to the average or mean the students scored. For example, did every
student score 80 per cent on the test or were the scores spread out from 0 to
100 per cent? Dispersion is the distribution of the scores and among the measures
used to describe spread are range and standard deviation.
(a) Range

The range of scores in a test refers to the lowest and highest scores obtained
in the test. The range is the distance between the extremes of a distribution.

Copyright © Open University Malaysia (OUM)

TOPIC 10 ANALYSIS OF TEST SCORES  225

(b) Standard Deviation
Standard deviation refers to how much the scores (obtained by students)
deviate or differ from the mean. Table 10.1 shows the scores obtained by
10 students on a Science test.

Table 10.1: Scores on a Science Test Obtained by 10 Students

Marks x x x x  x 2

35 35 ă 39 = -4 (-4)2 = 16
39 39 ă 39 = 0 (0)2 = 0
45 45 ă 39 = 6 (6)2 = 36
40 40 ă 39 = 1 (1)2 = 1
32 32 ă 39 =-7 (-7)2 = 49
42 42 ă 39 = 3 (3)2 = 9
37 37 ă 39 =-2 (-2)2 = 4
44 44 ă 39 = 5 (5)2 = 25
36 36 ă 39 =-3 (-3)2 = 9
41 41 ă 39 = 2 (2)2 = 4
Sum = 390
¡ ( x ă x )2 = 153
Mean ( x ) = 39

N = 10

Based on the raw scores, you can calculate the standard deviation of a sample
using the formula given.

Standard deviation  x  x 2 153  17


N 1 9

 4.12

The steps in calculating the standard deviation are as follows:

(i) The first step is to find the mean, which is 390 divided by 10 (the
number of students) = 39;

(ii) Next is to subtract the mean from each score in the column labelled

x  x . Note that all numbers in this column are positive. The squared
differences are then summed and the square root calculated; and

Copyright © Open University Malaysia (OUM)

226  TOPIC 10 ANALYSIS OF TEST SCORES

(iii) The standard deviation is the positive square root of 153 divided by 9
and the result is 4.12.

To better understand what the standard deviation means, refer to
Figure 10.3 which shows the spread of scores with the same mean but
different standard deviations.

Figure 10.3: Distribution of scores with varying standard deviations
Based on Figure 10.3:
(i) For Class A, with a standard deviation of 4.12, approximately

68 per cent (1 standard deviation) of students scored between 34.88 and
43.12;
(ii) For Class B, with a standard deviation of 2, approximately 68 per cent
(1 standard deviation) of students scored between 37 and 41; and
(iii) For Class C, with a standard deviation of 1, approximately 68 per cent
of students scored between 38 and 40.
Note that the smaller the standard deviations, the greater the scores tend to
„bunch‰ around the mean and vice versa. Hence, it is not enough to just
examine the mean alone because the standard deviation tells us a lot about
the spread of the scores around the mean. Which class do you think
performed better? The mean does not tell us which class performed better.
Class C performed the best because approximately two-thirds of the students
scored between 38 and 40.

Copyright © Open University Malaysia (OUM)

TOPIC 10 ANALYSIS OF TEST SCORES  227

Skew refers to the symmetry of a distribution. A distribution is skewed if one
of its tails is longer than the other. Figure 10.4 shows you the distribution of
the scores obtained by 38 students on a History test.

Figure 10.4: Distribution of History test scores
There is a negative skew because it has a longer tail in the negative direction.
What does it mean? It means that more students were getting high scores on
the History test which may indicate that either the test was too easy or the
teaching methods and materials were successful in bringing about the
desired learning outcomes.
Now, let us look at Figure 10.5 which shows the distribution of the scores
obtained by 38 students on a Biology test.

Copyright © Open University Malaysia (OUM)

228  TOPIC 10 ANALYSIS OF TEST SCORES

Figure 10.5: Distribution of Biology test scores
There is a positive skew because it has a longer tail in the positive direction.
What does it mean? It means that more students were getting low scores in
the Biology test which indicates that the test was too difficult. Alternatively,
it could imply that the questions were not clear or the teaching methods and
materials did not bring about the desired learning outcomes.

SELF-CHECK 10.2

What is the difference between range and standard deviation?

Copyright © Open University Malaysia (OUM)

TOPIC 10 ANALYSIS OF TEST SCORES  229

ACTIVITY 10.1

1. What is the difference between a standard deviation of 2 and a
standard deviation of 5?

2. A teacher administered an English test to 10 students in her class.
The students earned the following marks: 14, 28, 48, 52, 77, 63, 84,
87, 90 and 98. For the distribution of marks, find the following:
(a) Mean;
(b) Median;
(c) Range; and
(d) Standard deviation.

Post your answers on the myINSPIRE online forum.

10.3 STANDARD SCORES

After having given a test, most teachers report the raw scores obtained by students.
For example, Zulinda, a Form Five student, earned the following scores in the end-
of-semester examination:
(a) Science: 80;
(b) History: 72; and
(c) English: 40.

With these raw scores alone, what can you say about ZulindaÊs performance on
these tests or her standing in the class? Actually, you cannot say very much.
Without knowing how these raw scores compare to the total distribution of raw
scores for each subject, it is difficult to draw any meaningful conclusion regarding
her relative performance in each of these tests.

How do you make these raw scores meaningful? Let us assume that the scores of
all three tests are approximately normally distributed.

Copyright © Open University Malaysia (OUM)

230  TOPIC 10 ANALYSIS OF TEST SCORES

The mean and standard deviation of the three tests are as shown in Table 10.2.

Table 10.2: Mean and Standard Deviation for the Three Tests

Subject Mean Standard Deviation
Science 90 10
History 60 12
English 40 15

Based on this additional information, what statements can you make regarding
ZulindaÊs relative performance on each of these three tests? The following are
some conclusions you can make:

(a) Zulinda did best on the History test and her raw score of 72 falls at a point
one standard deviation above the mean;

(b) Her next best score is English and her raw score of 40 falls exactly on the
mean of the distribution of the scores; and

(c) Finally, even though her raw score for Science was 80, it falls one standard
deviation below the mean.

Converting ZulindaÊs raw scores into Z-scores, we can say that she achieved a:

(i) Z-score of +1 for History;

(ii) Z-score of 0 for English; and

(iii) Z-score of -1 for Science.

10.3.1 Z-score

What is a Z-score? How do you calculate the Z-score? A Z-score is a type of
standard score. The term standard score is the general name for converting a
raw score to another scale using a predetermined mean and a predetermined
standard deviation. Z-scores tell how many standard deviations away from the
mean the score is located. Z-scores can be positive or negative. A positive Z-score
indicates that the value is above the mean, while a negative Z-score indicates that
the value falls below the mean. A Z-score is a raw score that has been transformed
or converted to a scale with a predetermined mean of 0 and a predetermined
standard deviation of 1. For instance, a Z-score of -6 means that the score is 6
standard deviations below the mean.

Copyright © Open University Malaysia (OUM)

TOPIC 10 ANALYSIS OF TEST SCORES  231

The formula used for transforming a raw score into a Z-score involves subtracting
the mean from the raw score and then dividing it by the standard deviation.

Z x x
SD

For example, let us use this formula to convert KumarÊs marks of 52 obtained in a
Geography test. The mean for the test is 70 and the standard deviation is 7.5.

Z  x  x  52  70  18  2.4
SD 7.5 7.5

The Z-score calculated for the raw score of 52 is -2.4 which means that KumarÊs
score for the Geography test is located 2.4 standard deviations below the mean.

10.3.2 Example of Using the Z-score to Make
Decisions

A teacher administered two Bahasa Melayu tests to students in Form Four A, Form
Four B and Form Four C. The two top students in Form Four C were Seng Huat
and Mei Ling. The teacher was planning to give a prize for the best student in
Bahasa Melayu in Form Four C but was not sure who the better student was.

Seng Huat Test 1 Test 2
Mei Ling 30 50
Mean 45 35
Standard deviation 42 47
7 8

Copyright © Open University Malaysia (OUM)

232  TOPIC 10 ANALYSIS OF TEST SCORES

The teacher could use the mean to determine who was better. However, both
students have the same mean. How does the teacher decide? Using a Z-score can
tell the teacher how far from the mean are the scores of the two students and thus
who performed better. Using the formula, the teacher calculates the Z-score shown
as follows:

Seng Huat Test 1 Test 2 Total
Mei Ling 30  42 50  47 -1.34

= -1.71 = 0.375 -1.07
7 8
45  42 35  47

= 0.43 = -1.50
7 8

Upon examination of the calculation, the teacher finds that both Seng Huat and
Mei Ling have negative Z-scores for the total of both tests. However, Mei Ling has
a higher total Z-score (-1.07) compared with Seng HuatÊs total Z-score (-1.34). In
other words, Mei LingÊs total score was closer to the mean and therefore the
teacher concludes that Mei Ling did better than Seng Huat.

Z-scores are relatively simple to use but many educators are reluctant to use it,
especially when test scores are reported as negative numbers. How would you like
to have your mathematics score reported as -4? For this reason, alternative
standard score methods are used such as the T-score.

10.3.3 T-score

The T-score was developed by W. McCall in the 1920s and is one of the many
standard scores currently used. T-scores are widely used in the fields of
psychology and education, especially when reporting performance in
standardised tests. The T-score is a standardised score with a mean of 50 and a
standard deviation of 10. The formula for calculating the T-score is:

T = 10(z) + 50

Copyright © Open University Malaysia (OUM)

TOPIC 10 ANALYSIS OF TEST SCORES  233

For example, a student has a Z-score of -1.0 and after converting it to T-score, you
get the following:

T = 10 (z) + 50
= 10 (-1.0) + 50
= (-10) + 50
= 40

When converting Z-scores to T-scores, you should be careful not to drop the
negatives. Dropping the negatives will result in a completely different score.

ACTIVITY 10.2

1. Convert the following Z-scores to T-scores.

Z-score T-score

+1.0

-2.4

+1.8

2. Why would you use T-scores rather than Z-scores when reporting
the performance of students in the classroom?

Share your answers with your coursemates on the myINSPIRE online
forum.

10.4 THE NORMAL CURVE

The normal curve (also called the „bell curve‰) is a hypothetical curve that is
supposed to represent all natural occurring phenomena. In a normal distribution,
the mean, median and mode have the same value. It is assumed that if we were to
sample a particular characteristic such as the height of Malaysian men, you will
find the average height to be 162.5cm or 5 feet 4 inches.

Copyright © Open University Malaysia (OUM)


Click to View FlipBook Version