The words you are searching are inside this book. To get more targeted content, please make full-text search by clicking here.

A report establishing the technical validity of Caliber - Callido's assessment for critical thinking, research and communication skills.

The report also establishes strong correlations between students' Caliber scores and academic achievement in the international curricula

Discover the best professional documents and content resources in AnyFlip Document Base.
Search
Published by Callido Learning, 2018-01-22 01:07:19

Caliber Validity and Correlations Report - Jan 2018

A report establishing the technical validity of Caliber - Callido's assessment for critical thinking, research and communication skills.

The report also establishes strong correlations between students' Caliber scores and academic achievement in the international curricula

Keywords: critical thinking assessment,research skills test,admissions test,atl skills test

Caliber

An Assessment of Cognitive Skills
A Technical Report

A measure of the cognitive processes that drive success
in classrooms, college and the workplace

John W. Young, Ph.D.

January 2018

Evidence Regarding the Validity
of the Caliber Assessment

Contents

Acknowledgements............................................................................ v
Abstract................................................................................................ vi
Introduction to Caliber.............................................................. 7
Why Measure Cognitive Abilities?...................................................7
Why Measure Cognitive Abilities Explicitly?..................................8
Development process for Caliber...................................................8
Evidence Regarding the Validity of the Caliber Assessment....9
References................................................................................ 18
Appendix................................................................................... 20

Page Intentionally Left Blank

Acknowledgements

A sincere note of thanks to the schools who have contributed data to enable this
research to happen: Aditya Birla World Academy, Bombay International School and
Dhirubhai Ambani International School

Caliber – A Technical Report v

Abstract

This technical report summarizes the accumulated evidence regarding the validity
of the Caliber assessment. Caliber is owned by Callido Learning, and is designed
to measure the cognitive abilities of students aged 12-18in three domains: critical
thinking and problem solving, research and information literacy, and communication.
There is a well-established association between these skills and later academic and
professional success. The evidence regarding the validity of Caliber is derived from
these three sources: information regarding the development of the assessment,
psychometric analyses of the items and assessment, and results from a concurrent
validity study. To ensure construct and content validity, the assessment development
process relied heavily upon a scope and sequence of skills developed in conjunction
with expert consultants.
The psychometric analyses showed that the items and forms functioned as designed
and that the results are similar to those observed for other high-quality, large-scale
assessments. Lastly, moderately high correlations (r=.50+) of Caliber scores with
several versions of IGCSE exam results for a sample of 97 students demonstrate the
concurrent validity of the assessment.

vi Caliber – A Technical Report

Introduction to Caliber

Caliber is an assessment which is designed to measure students’ cognitive abilities
in three domains: critical thinking and problem solving, research and information
literacy, and communication. Each of these skills has been divided into further sub-
skill strands for granularity.

The assessment is a general test of developed cognitive abilities for students in the age
range of 12 to 18. It requires no prior knowledge and is completely curriculum independent.

Why Measure Cognitive Abilities?

Cognitive skills are at the core of student success

The above-mentioned cognitive skills, critical thinking and problem solving, research
and information literacy, and communication, are core to a person’s success
throughout their academic and professional life. In school, leading bodies such
as Cambridge International Examinations and the International Baccalaureate
Organization design their assessments such that much of a student’s scores depend
on their critical thinking skills (compiled from Cambridge International AS and A Level
subjects, 2018; IBO Curriculum, 2018).

Large-scale standardized tests such as the SAT, ACT and LSAT Research suggests
are designed to gauge a student’s underlying cognitive abilities that a major reason
and academic reasoning skills rather than their content for low student
knowledge. A core pillar of college readiness is a student’s graduation rates at
ability to carry out independent research, to think critically, and universities is the
to problem-solve competently and independently. Research gap between the
suggests that a major reason for low student graduation typical high school
rates at universities is the gap between the typical high school curriculum and
curriculum and the demands of university learning (Conley, the demands of
2007), with students finding tasks at university significantly university learning
more challenging than those they undertook in high school (Conley, 2007)
and being ill-equipped to perform these tasks.

In the workplace, research shows that these cognitive skills are
the most important determinant of a candidate’s prospects.
In 2013, for example, Hart Research Associates surveyed CEOs and other C-suite
executives at more than 300 private sector for-profit and nonprofit organizations
(Hart Research Associates, 2013). 93% agreed that, “A candidate’s demonstrated
capacity to think critically, communicate clearly, and solve complex problems is more
important than their undergraduate major.” In Robert Wood Johnson Foundation’s
July 2009 Jobs to Careers initiative, Randall Wilson wrote: “Early assessment of critical
thinking maximizes workforce efficiency and increases the potential for learning and
educational effectiveness at all levels.” (Wilson, 2009).

Caliber – A Technical Report 7

In today’s knowledge economy, it is therefore more imperative than ever that there
is accurate and consistent measurement of cognitive abilities over time.

Why Measure Cognitive Abilities Explicitly?

The dominant approach to the development of critical thinking Research shows
skills has been to embed these skills within curricular content that students who
(Marina & Halpern, 2010). Whilst the embedding of these skills undergo explicit,
within content is undeniably an important aspect of developing direct instruction
the skills, research shows that students who undergo explicit, in critical thinking
direct instruction in critical thinking skills are more proficient skills are more
at forming the habits of mind required to think critically. The proficient at
reasons for this are two-fold: (i) when these skills are embedded forming the habits
within curriculum content, there is limited potential for transfer of mind required to
outside of that discipline; and (ii) it is rare for the delivery of think critically.
content to be consistently focused on the development of
these abilities, rather than focusing on the content at hand.

Given that explicit instruction in critical thinking skills delivers greater gains in
cognitive abilities (Bangert-Drowns & Bankert, 1990; Cotton, 1991; Dweck, 2002;
Halpern, 1998, 2003), it is important to have a tool which enables measuring and
monitoring these skills over time.

The measurement of cognitive skills using Caliber allows institutions to:

I. Benchmark performance against peers;
II. Identify instructional needs;
III. Measure the impact of instructional intervention;
IV. Inform the admissions process; and
V. Foster a culture which focuses on developing critical thinking skills by design.

Development process for Caliber

This assessment was developed after much research into existing tools. It was
found that there was a lacuna for a single tool which would comprehensively track a
student’s cognitive abilities over time in a curriculum-independent manner.

The objective was to develop an assessment which would present a true measure of
the core cognitive abilities which drive student attainment and professional success.
In order to achieve this, it was necessary to ensure that:

I. The test items are independent of the student’s prior knowledge or background
in any form;

8 Caliber – A Technical Report

II. The tool’s coverage of abilities was comprehensive across the skills;

III. The tool allowed users to track progress over a period of time; and

IV. Close alignment with key academic standards would make the data actionable.

The skills assessed by Caliber were determined by reviewing numerous well-accepted
standards. In the K-12 segment, the standards referred to include the Common
Core State Standards, P21 framework for 21st century skills, the International
Baccalaureate’s Approaches to Teaching and Learning, and 21st century skills as
defined by Cambridge International Examinations. (Common Core State Standards
Initiative, 2018; International Baccalaureate Organization,2018; P21, 2018;
Suto, 2013).

In the higher education realm, the standards referred to are David Conley’s college-
readiness standards (Conley, Redefining College Readiness, 2007) in the United States
as well as the ACER standards from Australia(Graduate Skill Assessment, 2018).

In order to make the data actionable, the need to provide visibility on student
performance not just at the broader skill level but at the sub-skill level was recognized.
Broader skills were thus divided into finer skill strands, with data being available on
student performance for each of these skill strands. This granularity enables the delivery
of tailored intervention for each student.

The broad skills – critical thinking and problem-solving, research and information
literacy, and communication – were sub-divided into skill strands which can be
measured independently. These strands drew from a number of established
taxonomies, the most influential being Simister’s Thinking Skills INSET (Simister,
2007). A fleshed-out scope and sequence, with descriptors and benchmarks for
each skill strand, enable the tracking of student progress over time.

Caliber – A Technical Report 9

Evidence Regarding the Validity of the Caliber Assessment

This part of the report documents the accumulated evidence regarding the validity
and reliability of Caliber. Caliber is an assessment which is designed to measure
students’ cognitive abilities in three domains: critical thinking and problem solving,
research and information literacy, and communication. These abilities have been
further sub-divided into sub-skill strands, and there are 13 such strands in total:

Critical Thinking and Problem Solving

qq Ability to define a problem
qq Interpretation and evaluation of evidence
qq Awareness of context and assumptions when evaluating evidence
qq Drawing accurate conclusions
qq Identifying suitable strategies for problem solving
qq Evaluation of strategies for problem solving
Research and Information Literacy

qq Determining the extent of information required
qq Evaluate information and its sources critically
qq Use information appropriately/Synthesizing information from different sources
qq Ethical use of information
Communication

qq Organization of thoughts and ideas
qq Awareness of context and purpose
qq Choice of language

The assessment is a general test of developed cognitive abilities for students in the
age range of 12 to 18. It requires no prior knowledge and is completely curriculum
independent. Audio-visual questions are delivered via computer and are comprised
of selected response items. Each question is tagged with the appropriate sub-skills,
and a student’s performance on a given question contributes to their score on all
associated sub-skills.

Each form of the Caliber assessment has 40 questions in total, with each of the
13 sub-skills tagged to at least 4 questions. Scoring is based upon the number of
questions answered correctly and there is no negative scoring. The assessment is
timed, with the maximum time allowed being 60 minutes, but it is not designed to
be time-pressured.

10 Caliber – A Technical Report

Validity Research on the Caliber Assessment

Test validity research is a well-established field and we utilized Kane’s validity
framework to construct an argument-based approach regarding the validity of the
Caliber assessment for its intended purpose (Kane, 2001, 2013). In Kane’s validity
framework, interpretations and uses of test scores that are clearly stated and are
well-supported by appropriate evidence are considered to be valid (Kane, 2013). In
contrast, interpretations or uses of test scores that are ill-defined or that involve
questionable inferences or assumptions are not considered valid. The goal of
validation research is to accumulate sufficient evidence so that an objective judgment
could be made on the completeness and coherence of the validity argument, the
plausibility of its inferences and assumptions, and the quality and accuracy of the
empirical evidence.

When any new assessment is first used operationally, it is critical to conduct validity
studies in order to evaluate and ensure that the test items, forms, and tasks
associated with the assessment are functioning as intended (see e.g., Young, Morgan,
Rybinski, Steinberg & Wang, 2013). The main goal of the validity activities and studies
described in this report is to evaluate whether the Caliber assessment is functioning
as intended and to provide evidence that can be used to judge to whether the claims
and conclusions that Callido Learning has made about Caliber are substantiated.
In this manner, objective evaluations about Caliber can be rendered and, where
needed, improvements can be made to the assessment.

In this report, the empirical evidence regarding the validity of Caliber is derived from
several different sources, which can be categorized as follows:

qq Evidence derived from the development of the assessment;

qq Evidence based on psychometric analyses of the items and assessment;

qq Evidence developed from a concurrent validity study.

Each of these forms of validity evidence is described below in greater detail. In addition
to the evidence described in this report, the validity of the Caliber assessment is
further supported by the creation and execution of a research agenda by Callido
Learning for the assessment. This research agenda provides guidance and rigor to
the research activities and ensures that a coherent plan for research on the Caliber
assessment exists and is maintained over time.

Validity Evidence Based on the Development of the Assessment

The development process of the Caliber assessment followed guidelines which
were suggested by classroom practitioners with expertise in developing test items
for critical thinking in the relevant age groups. The benchmark for demonstrating
mastery was based on a scope and sequence of skills developed in conjunction
with expert consultants from the American Institute of Enrichment, an American
organization based in Atlanta, Georgia, that develops rubrics for K-12 use.

Caliber – A Technical Report 11

The test development process explicitly avoids content-specificity to ensure the
assessment does not inadvertently measure a student’s prior knowledge but
is a true measure of the student’s skills in response to previously unseen/un-
encountered information. The assessment development guidelines suggested
multiple opportunities to test a sub-skill to minimize the likelihood of the final score
being impacted by guessing. Consequently, the 40 test items on each Caliber form
contribute to a composite measure of 13 identified sub-skills.

Selected-response items were deliberately chosen for use in Caliber. While the
benefits of open-ended tasks in the form of constructed response items were fully
considered, the development process avoided the use of this format due to a few
drawbacks in the context of this assessment:

I. The writing ability required by constructed response items introduces a
confounding element to the test (i.e., it is no longer possible to accurately
gauge whether an item is a true measure of the assessed skills or the ability to
demonstrate these skills in written form).

II. The evaluation of constructed response items is inherently more subjective
and there are consequently extensive administrative measures that would be
required in standardizing and moderating the evaluation process.

Expected reading ability necessary for Caliber is benchmarked to grade-appropriate
Lexile levels to ensure that lack of reading skills is not a barrier to demonstration of
other skills. In addition, to cater to test-takers of lower reading ability, key information
and prompts are presented in multiple audio-visual formats in addition to text.

Caliber is designed to provide criterion-referenced interpretations rather than
norm-referenced interpretations. This means a test-taker’s position is based solely
on their demonstration of ability and is independent of their ranking in any cohort.
The rationale for this is that it allows for the assessment to be used more broadly
for more applications.

Validity Evidence Based on Psychometric Analyses of the Items
and Assessment

Given that Caliber is a new assessment, it is important to conduct psychometric
analyses of the test items and test forms in order to assess whether the items and
forms are functioning correctly as designed. The analysis of Caliber was conducted
using Item Response Theory (IRT), which is presently the most commonly used set
of psychometric models for large-scale assessments. IRT encompasses several
psychometric models for the design, analysis, and scoring of tests, questionnaires,
and other similar instruments for measuring abilities, attitudes, or other variables.
IRT is a theory and model of psychological and educational testing that is based on
the relationship between an individual’s performance on a test item and his or her
levels of overall ability or proficiency that the item was designed to measure.

12 Caliber – A Technical Report

Compared with older models of psychological and educational testing, generally
known collectively as classical test theory, IRT has a number of important advantages:
(i) IRT explicitly models a test-taker’s performance at the individual level rather
than through group statistics, such as correlations. (ii) IRT specifies that individual
performance is based solely on a few variables, a test-taker’s ability level as well
as certain item characteristics. (iii) IRT models are probabilistic, in that they specify
the likelihood that a test-taker will answer an item correctly, thus providing a more
realistic representation of the test-taking process.

The psychometric analyses of the Caliber assessment were carried out using the
IRT 2-parameter logistic model (2-PL). The 2-PL model specifies that an individual’s
performance on a test item is based on the individual’s overall ability on the trait
being assessed and two parameters for each item, difficulty and discrimination. The
item’s difficulty represents the level of a student’s ability that is required to answer
the item correctly while item discrimination represents the rate of change in ability
necessary to increase a test-taker’s likelihood of answering the item correctly. The
logistic model is a type of regression model that is most commonly applicable when
the outcome variable (i.e., the probability of answering the item correctly) is bounded
between zero and one.

The 2-PL model is one of the most widely used IRT models and is currently used
to model student performance on large-scale assessments such as TOEFL iBT and
TOEFL Junior Standard (Young, Morgan, Rybinski, Steinberg & Wang, 2013). Sinharay,
Haberman & Jia (2011) compared the 2-PL and 3-PL models using TOEFL iBT data
and found the 2-PL preferable in terms of its performance in modeling students’
language proficiency. They also found that the 2-PL model had better performance
characteristics with smaller sample sizes. Note that in the 3-PL model, the third item
parameter models guessing as a factor in answering an item correctly.

Descriptive Information about the Items and Forms

Caliber is a computer-delivered assessment that is composed of 4-option, selected-
response test items. At present, there are two test forms, Form A and Form B, each
of which contains 40 operational test items, with a time limit of 60 minutes. For each
form, the test items are randomly drawn from a single item bank. In addition, the
forms contained a small number of items for pre-testing, which are not included in
the scoring of the assessment. Caliber uses the IRT 2-PL model for scoring and test
scores are reported on a scale for percent correct, ranging from 0 to 100.

Psychometric Analyses of the Items and Forms

To date, the two existing forms of the Caliber assessment have been administered
to more than 1,700 test-takers. The test-takers were students enrolled in schools in
six cities in India (Bangalore, Chennai, Delhi, Hyderabad, Mumbai, and Trivandrum)
as well as students in Jakarta, Indonesia. As these are international schools, students
represented a number of different nationalities. All of the test-takers were students

Caliber – A Technical Report 13

who were enrolled in grades 8 through 12 and ranged in ages from 12 to 18, with
a total of 966 students taking Form A and 740 students taking Form B. The average
age of the students who took Form A was 15.08 with a standard deviation of 1.17
while the average age of the students who took Form B was 15.14 with a standard
deviation of 1.33. Across both forms, the average age of the students was 15.11 with
a standard deviation of 1.24.

Analyses of the Caliber assessment indicated a very high degree of similarity in the
performance of students on the two test forms. On Form A, the average raw score
(mean number correct) was 20.14 with a standard deviation of 5.38 while the average
raw score for Form B was 20.14 with a standard deviation of 5.42. Across both forms,
the average raw score of the test-takers was 20.14 with a standard deviation of 5.40.
The distribution of scores on both forms was nearly symmetrical with a skewness
value of -0.06 for Form A, -0.13 for Form B, and -0.15 across both forms. For both
forms, the distribution of average score by grade level showed an expected pattern
of generally higher scores for students in the higher grades.

Psychometric analysis of the Caliber assessment showed that the items and forms
functioned as designed and intended. In the appendix of this report are shown the
graphical information for the following analyses:

qq Test characteristic curve for each form

qq Item characteristic curves for all items per form

qq Item information functions for all items per form

qq Test information function for each form

In addition, the individual item characteristic curve for every item on both forms
of the assessment was generated and examined but these are not included with
this report.

The test characteristic curves (TCCs) for Form A and Form B are shown as the first two
graphs in the report’s appendix. The general form of a TCC is that of a monotonically
increasing function. The primary role of the TCC in IRT is to provide a means of
transforming latent ability scores to true scores. In doing so, the score user is given
a result that is related to the number of items answered correctly on the test. If
an assessment is composed of items that are relatively difficult for a population
of test-takers, then the TCC is shifted to the right and test-takers will have lower
expected scores on the assessment than if easier items were used. Conversely, if
the assessment is composed of relatively easy items for the population, the TCC will
be shifted to the left and expected scores on the assessment will be higher.

For both Caliber forms, the TCC shows the relationship between a student’s
underlying ability and his or her expected score. Because the TCCs show a smooth,
monotonically increasing relationship between student’s ability and expected score,
this indicates that the assessment is functioning as designed, since one would expect

14 Caliber – A Technical Report

higher assessment scores from those with greater ability. Additionally, note that the
TCCs for the two forms are nearly identical, which enables the interchangeability of
the two forms.

The next two graphs in the appendix include the item characteristic curves (ICCs) for
all of the items on each form. The ICCs indicate, for a specific test item, the relation
between a student’s ability and his or her probability of answering that question
correctly. The slope of each ICC indicates the discrimination of the item while the
item’s difficulty is defined (in the 2-PL model) as the point on the ability scale where
a test-taker has a 50% chance of answering that item correctly. For almost all of
items on both forms of the assessment, the ICC is monotonically increasing and
has asymptotes on the probability scale of 0 and 1. A few items have an upper
asymptote below 1, which indicates that these are relatively difficult items for this
population of test-takers.

Almost all of the items have ICCs that are representative of items that are associated
with high-quality test items used in large-scale assessments. There are the possible
exceptions of two items on each form that appear to have negative discrimination
(which can be identified as the items with a negative slope for their ICCs). The cause
of the negative discrimination for these items appears to be the use of language in
the stem of the item that was unfamiliar to some test-takers. These items will not be
used in future forms of the Caliber assessment.

The item information functions (IIFs) demonstrate that, for both Forms A and B, the
assessment is entirely composed of items that capably measure students’ cognitive
abilities across the entire range of the score scale. As would be expected, the IIFs
show that most items provide their maximum information around the center of
the score scale. In addition, about half of the items in each form have noticeable
peaks for their IIFs, which indicates a high degree of item discrimination in the
region of the ability scale where the peak is located (the location of the peak is
related to the item’s difficulty). Highly discriminating items are desirable as they
provide a high degree of information about a student’s ability, particularly in the
range of the score scale where the IIF is at its maximum. The IIFs are also useful
because the sum of the IIFs for the items on a form comprise the test information
function for that form.

The test information functions (TIFs) are particularly important to review as they
demonstrate how well each Caliber form measures students’ cognitive abilities
across the score scale. Each TIF is computed as the sum of the IIFs across the score
scale. As a criterion for evaluating the Caliber test forms from an IRT perspective, we
chose two values for the TIF, 5 and 10, as standards by which to judge the adequacy
of test information and, by extension, the reliability of the assessment. A TIF value
of 5 is equivalent to a classical test theory reliability estimate of .80, while a TIF value
of 10 is equivalent to a classical test theory reliability estimate of .90 (Hambleton &
Lam, 2009). Both forms have TIF values of 5 or greater in the middle range of the

Caliber – A Technical Report 15

ability scale, where a high degree of information about a test-taker is most desirable.
As with the ICCs, the TIFs for these forms are typical of ones that are associated with
high-quality test forms used in large-scale assessments.

Summary of the Psychometric Analyses

The totality of the evidence from the psychometric analyses, in terms of the test
characteristic curve, the item characteristic curves, the item information functions,
and the test information function, indicates that both Caliber forms are functioning
as intended and that their psychometric characteristics are similar to those found
for other high-quality large-scale assessments. Additionally, the information gained
from these psychometric analyses are also useful as they will provide a basis for
improvements to future forms of the Caliber assessment.

Validity Evidence Based on a Concurrent Validity Study

Because Caliber is a relatively new assessment, it is useful to benchmark its results
against established assessments that measure the same or similar student traits.
Caliber can be validated concurrently against other assessments if students are
given both assessments at approximately the same time and correlation coefficients
are then calculated for the two sets of scores. This form of evidence is known as
concurrent validity and an approach based on the correlations between two sets of
assessment scores is a standard technique for determining the concurrent validity
of an assessment.

A sample of 97 students who had all taken the same Caliber assessment form
also provided their results from the International General Certificate of Secondary
Education (IGCSE) exams. The IGCSE is developed by Cambridge University and is
the world’s most popular international academic qualification for students, ages 14
to 16. These 97 students are from three private schools in Mumbai, India and are
currently enrolled in Grade 11. They took their IGCSE exams while at the end of
Grade 10 in May 2017. The largest cohort (57 students) is from Dhirubhai Ambani
International School with Bombay International School (15 students) and Aditya Birla
World Academy (23 students) making up the rest of the sample. All of the students
from the first two schools are now enrolled in the International Baccalaureate Diploma
Program (IBDP), while some of the students from Aditya Birla World Academy have
continued with the Cambridge A-level courses.

Two analyses were conducted, the first using as a criterion the total number of
A* and A exam scores (the two highest score levels) that a student earned on the
IGCSE exams, the second based only on the total number of A* scores. For the first
analysis (based on A* and A scores), the correlation with the Caliber assessment
scores was .50, equivalent to a R2 of .25 (R2 represents the proportion of variance
explained in one variable by another). For the second analysis (using only A* scores),
the correlation with Caliber assessment scores was an even higher .57, equivalent
to a R2 of .324.

16 Caliber – A Technical Report

To place these results into a larger context, it is useful to know

that these correlations are in the same range as the correlations The correlation of
that have been reported for SAT scores in predicting first- A and A* scores
year grades at American colleges and universities as well as with Caliber is 0.50,
IBDP exam scores (using total exam score) in forecasting first- equivalent to an R2
year university grades for a global sample of highly selective of .25 and an R2 of
universities. For the SAT, Beard & Marini (2015) reported that, .324 when using
for a sample of more than 220,000 students who first enrolled only A* scores.
in an American college or university in 2012, the corrected

multiple correlation of the three SAT section scores with first-

year grade-point average was .54. As this correlation was This is in the

corrected for restriction of range, it is important to note that same range as the

the raw multiple correlation was a much smaller value of .35. correlations that

Similarly, unpublished research on IBDP exam scores reported have been reported

a correlation of .52 for predicting first-year university grades in a for SAT scores in

sample of five universities that enrolled large numbers of IBDP predicting first-year

graduates. grades at American

Thus, based on the findings from this one sample of students colleges and
who took both the Caliber assessment and IGCSE exams, we universities

can state that there is strong, initial evidence for the concurrent

validity of the Caliber assessment. Additional concurrent and

predictive validity studies with additional samples of students are being planned

with some studies already currently in progress.

Summary

In this report, we have summarized the accumulated evidence to date regarding
the validity of the Caliber assessment. The evidence presented regarding the
validity of Caliber is derived from three different sources: information regarding
the development of the assessment; psychometric analyses of the items and
assessment; and findings from a concurrent validity study. First, in the assessment
development process, steps were taken to ensure the construct and content validity
of Caliber by utilizing a scope and sequence of skills developed in conjunction with
expert consultants. Second, the psychometric analyses showed that the items and
forms functioned as designed and that the results are similar to those observed for
other high-quality, large-scale assessments. Third, as evidence for the concurrent
validity of the assessment, moderately high correlations of Caliber scores with
several versions of IGCSE exam results were found for a sample of 97 students.
These correlations are similar in magnitude to those previously found for SAT scores
and for IBDP exam scores in predicting first-year university grades.

Caliber – A Technical Report 17

References

International Baccalaureate Organization. (2018). The IB Learner Profile. Retrieved
from International Baccalaureate Organization: http://www.ibo.org/benefits/learner-
profile/.

Bangert-Drowns, R., & Bankert, E. (1990). Meta-analysis of effects of explicit instruction for
critical thinking.Paper presented at the annual meeting of the American Educational
Research Association, Boston.

Beard, J., & Marini, J. (2015). Validity of the SAT for predicting first-year grades: 2012
validity sample.Statistical Report No. 2015-2. New York: The College Board.

Common Core State Standards Initiative. (2018). Read the Standards - Preparing
America’s Students for College & Career. Retrieved from Common Core State Standards
by Council of Chief State School Officers (CCSSO) and National Governors Association
Center for Best Practices (NGA Center): http://www.corestandards.org/read-the-
standards/.

Conley, D. T. (2007). Redefining College Readiness. Eugene, OR: Bill & Melinda Gates
Foundation.

Conley, D. T. (2007, April). The Challenge of College Readiness. The Prepared Graduate,
pp. 23-29.

Cotton, K. (1991). Close-up #11: Teaching thinking skills. Retrieved from Northwest
Regional Educational Laboratory’s School Improvement Research Series: http://
www.nwrel.org/scpd/sirs/6/cu11.html.

Dweck, C. (2002). Beliefs that make smart people dumb. In R. Sternberg, Why smart
people can be so stupid. New Haven, CT: Yale University Press.

Halpern, D. F. (1998). Teaching critical thinking for transfer across domains:
Dispositions, skills, structure training, and metacognitive monitoring. American
Psychologist, Vol 53, No. 4, 449-455.

Halpern, D. F. (2003). Thought and knowledge: An introduction to critical thinking (4th
ed.). Mahwah, NJ: Lawrence Erlbaum Associates.

Hambleton, R. K., & Lam, W. (2009). Redesign of MCAS tests based on a consideration
of information functions. MCAS Validity Report No. 18; CEA-689. Amherst, MA: University
of Massachusetts, Center for Educational Assessment.

Hart Research Associates. (2013, Spring). It Takes More Than a Major: Employer
Priorities for College Learning and Student Success. Liberal Education, Vol 99, No. 2,.

Kane, M. T. (2009). Current concerns in validity theory. Journal of Educational
Measurement, 38, 319-342.

Kane, M. T. (2013). The argument-based approach to validation. School Psychology
Review, 42, 448-457.

18 Caliber – A Technical Report

Marina, L. M., & Halpern, D. F. (2010, September 16). Pedagogy for developing critical
thinking in adolescents: Explicit. Thinking Skills and Creativity, 6, 1-13

P21. (2018). Framework for 21st Century Learning. Retrieved from P21 Partnership for
21st Century Learning: http://www.p21.org/our-work/p21-framework.

Simister, C. (2007). How to Teach Thinking and Learning Skills.Thousand Oaks, CA:
SAGE Publications Ltd.

Sinharay, S., Haberman, S., & Jia, H. (2011). Fit of item response theory models: A survey
of data from several operational tests. Research Report No. RR-11-29. Princeton, NJ:
Educational Testing Service.

Suto, I. (2013). 21st Century skills: Ancient, ubiquitous, enigmatic? Cambridge, UK: A
Cambridge Assessment Publication.

Wilson, R. (2009, July). Jobs to Careers. Princeton, NJ: Robert Wood Johnson Foundation.

Young, J. W., Morgan, R., Rybinski, P., Steinberg, J., & Y. Wang. (2013, September).
Assessing the test information function and differential item functioning for TOEFL Junior
Standard.TOEFL Young Learners Research Report TOEFL-YL-01 and ETS Research RR-13-
17. Princeton, NJ: Educational Testing Service

Caliber – A Technical Report 19

Appendix

This appendix section of the report contains the graphs from the psychometric
analyses of the test items and test forms and includes the following information:
qq Test characteristic curve for Form A and Form B
qq Item characteristic curves for all items on Form A and

Form B
qq Item information functions for all items on Form A and Form B
qq Test information function for Form A and Form B

Test Characteristic Curve for Form A

Test Characteristic Curve

30

Expected Score 20

10

0 -2 0 2 4
-4 Theta

Figure 1: Test Characteristic Curve for Form A based on a 2-Parameter

Logistic Model with 966 observations.

This graph indicates the relationship between a test-taker’s ability level (on the X-axis)
and expected score (out of 40 items) on Form A. A test-taker’s ability is represented by the
variable Theta, which is reported on a z-score scale. Z-scores are centered with a mean of
zero and a standard deviation of one and represent the standard deviation units that an
individual is above or below the group mean. In a normal distribution of scores, a z-score
of zero is equivalent to being at the 50th percentile; a z-score of +1.00 is equivalent
to being at the 84th percentile; a z-score of +2.00 is equivalent to being at the 98th
percentile; and a z-score of -1.00 is equivalent to being at the 16th percentile.

20 Caliber – A Technical Report

Test Characteristic Curve for Form B

Test Characteristic Curve

40

30

Expected Score 20

10

0 -2 0 2 4
-4 Theta

Figure 2: Test Characteristic Curve for Form B based on a 2-Parameter

Logistic Model with 740 observations

This graph indicates the relationship between a test-taker’s ability level (on the X-axis)
and expected score (out of 40 items) on Form B. A test-taker’s ability is represented
by the variable Theta, which is reported on a z-score scale. Z-scores are centered
with a mean of zero and a standard deviation of one and represent the standard
deviation units that an individual is above or below the group mean. In a normal
distribution of scores, a z-score of zero is equivalent to being at the 50th percentile;
a z-score of +1.00 is equivalent to being at the 84th percentile; a z-score of +2.00
is equivalent to being at the 98th percentile; and a z-score of -1.00 is equivalent to
being at the 16th percentile.

Caliber – A Technical Report 21

Item Characteristic Curves for Form A

Item Characteristic Curves

1

Probability .5

0 -10 -5 0 5
-15 Theta

Figure 3: Item characteristic curves for all items on Form A

This graph includes the item characteristic curves for the 40 items in Form A, one
curve for each item. For most of the items (38 out of 40), there is an increasing
relationship between a student’s ability level (as represented by Theta on the X-axis)
and the likelihood of answering that item correctly. The shape of the curves is
determined by the item parameters in the IRT 2-PL model.

22 Caliber – A Technical Report

Item Characteristic Curves for Form B

Item Characteristic Curves

1

Probability .5

0

-10 -5 0 5
Theta

Figure 4: Item characteristic curves for all items on Form B

This graph includes the item characteristic curves for the 40 items in Form B, one
curve for each item. For most of the items (38 out of 40), there is an increasing
relationship between a student’s ability level (as represented by Theta on the X-axis)
and the likelihood of answering that item correctly. The shape of the curves is
determined by the item parameters in the IRT 2-PL model.

Caliber – A Technical Report 23

Item Information Functions for Form A

Item Information Functions

1.5

1

Information

.5

0 -2 0 2 4
-4 Theta

Figure 5: Item information functions for all items on Form A

This graph includes the item information functions for the 40 items in Form A, one
line for each item. For an item, the degree of information is determined by the
item discrimination value. Higher information levels are generally desirable, but if
so, the location of the Theta scale where the information is concentrated will be
smaller. Item information contributes to the test information function, which is the
equivalent of test reliability in an IRT context.

24 Caliber – A Technical Report

Item Information Functions for Form B
Item Information Functions

Information .8
.6
.4 -2 0 2 4
.2 Theta
0

-4

Figure 6: Item information functions for all items on Form B

This graph includes the item information functions for the 40 items in Form B, one
line for each item. For an item, the degree of information is determined by the
item discrimination value. Higher information levels are generally desirable, but if
so, the location of the Theta scale where the information is concentrated will be
smaller. Item information contributes to the test information function, which is the
equivalent of test reliability in an IRT context.

Caliber – A Technical Report 25

Test Information Functions for Form A
Test Information Function

Information
23456

-4 -2 0 2 4
Theta

Figure 7: Test information function for Form A

This graph indicates the degree of information on Form A about a test-taker’s ability
level at various points on the Theta scale, which represents ability level. The test
information function is the sum of the item information functions for a particular
test form. The test information function is the IRT equivalent of test reliability and
a test information value of 5 is equivalent to a reliability coefficient of .80. Thus, for
Form A, in the Theta range of approximately -2.00 to +0.50 (from about the 2nd to
the 69th percentiles), there is a very high degree of reliability in the scores.

26 Caliber – A Technical Report

Test Information Functions for Form B
Test Information Function

Information
23456

-4 -2 0 2 4
Theta

Figure 8: Test information function for Form B

This graph indicates the degree of information on Form B about a test-taker’s ability
level at various points on the Theta scale, which represents ability level. The test
information function is the sum of the item information functions for a particular
test form. The test information function is the IRT equivalent of test reliability and
a test information value of 5 is equivalent to a reliability coefficient of .80. Thus, for
Form B, in the Theta range of approximately -1.50 to +0.50 (from about the 7th to
the 69th percentiles), there is a very high degree of reliability in the scores.

Caliber – A Technical Report 27


Click to View FlipBook Version