Models Of Online Assessment, Part 1: Computer Adaptive Testing
nd3000/Shutterstock.com

Computer Adaptive Testing (CAT): What It Is, Competencies, And Purpose

There are numerous advantages to online or technology-based assessments: They are efficient. They can be engaging for the learner (depending on their design). They can gather enough data to draw reliable conclusions without creating undue burdens on instructors, undue expense for an eLearning provider, and unduly delayed results for learners (for example, using Artificial Intelligence to assess the essay portion of state assessments for students).

Online or technology-based assessments can measure complex thinking and lower the cost differential because technology-based assessments take less time to score, store and share. They can enable quick turnaround of assessment data to the instructor and learners, allowing instructors to assess learner performance at a much more granular, detailed level. Finally, they can allow more reliable scoring and valid data interpretation (Burns et al., 2010; Burns, Christ, Kovaleski, Shapiro, & Ysseldyke, 2008: 18).

This article continues our assessment theme from April's article and focuses on technology-based assessments as part of a large-scale online course or program for students or teachers (versus short eLearning courses). This article focuses on Computer Adaptive Tests (CATs).

What is Computer Adaptive Testing?

Essentially, CATs are technology-based or online testing systems created by content specialists, psychometricians, programmers and web designers. They contain large test item banks and, when designed well, can attain high discriminating power. They do this through constant calibration or “adaptation.” First, the system sets an ability estimate (at a high confidence level like 95%) which is a threshold of where the examinee should be in terms of content knowledge (for a state assessment or a particular set of content in an online course). All measurement takes place against that estimate or threshold. The questions, pulled from the item bank, are then calibrated based on the examinee’s initial answers—either higher or lower than the ability estimate—to continuously assess the examinee’s true ability. The testing system continues to re-estimate the examinee’s ability and adjust the difficulty of the questions based on the examinee’s answers. Questions are continuously recalibrated, or “adapted,” for the examinee based on his/her previous answers. This leads to greater precision and less error in testing.

Computer Adaptive Tests: Why?

Computer adaptive tests (CATs) are advantageous for two reasons.. First, they can collect sufficient data for highly reliable results in a relatively short time by using the power of technology. Second, and more important, they have high discriminating power, meaning they can distinguish between high- and low-performing examinees. In theory, examinees who answer correctly are higher performers than those examinees who do not—and their answers are based on expertise versus random guessing. Assessments with low discriminating power do not distinguish between high and low performers or do so poorly. In assessments with low discriminating power, examinees may arrive at a correct answer through guessing.

As discussed in the previous section, CATs use item response theory, which measures the difficulty [1] of each test item as well as the probability that the learner will get it right. Because the testing system matches the difficulty of question to the student’s previous performance, so that scores are always comparable to the previous administration, no two students, even if seated next to one another and being assessed on the same content, would take the exact same test, though they would be assessed on the same constructs. Thus, CATs can eliminate redundant questions and questions that are too easy or too difficult and zero in on a student’s performance range, thus reaching reliable conclusions in a very short time (Burns et al., 2008: 18).

Computer Adaptive Tests: How?

CATs can adjust or “adapt” the level of difficulty of a test based on a student’s responses and are thus more efficient, targeted, and precise than “regular” tests. However, they are extremely complex, time-intensive and resource-intensive to create.

Competencies And Test Items

First, they demand a set of competencies across subject areas with several indicators per competencies. Competencies are often categorized as Level 1 (factual knowledge), Level 2 (understanding, logic, etc.), and Level 3 (knowledge transfer, application, and modeling).

Lower-level competencies (knowledge) are easier to measure than higher-level competencies (the ability to apply knowledge). As such, the latter often involve constructed response items (open-ended) versus selected response (closed) items used to measure lower-level knowledge. Higher-level items often involve "item cloning" are test items designed to
measure the exact same construct but with the substitution of random elements (names, locations, etc.). Item cloning generates item pools so examiners have more questions resulting in more cost-effective implementation of computerized adaptive testing.

It is hard to develop items in general and hard to develop all types of needed items for each indicator. For example, some indicators cannot be tested in written form; some can be tested using only purely mathematical tasks, which are not connected with real-life situations, etc. There are items which have not one correct answer and it makes scoring difficult and which may involve the use of rubrics.

Purpose Of The Test

Next, online course providers need to know the purpose of their test—whether diagnostic, formative, or summative.  Diagnostic tests demand multiple items with one operation (diagnosing gaps, mistakes and addressing needs). They may require simple tasks that give direct information about whether or not a student is able to do something.  In contrast, summative tests far more complex and require multiple items to measure competency.

Type Of Test

Third, online programs need to know what type of test (fill-in-the-blank versus multiple choice) they will use. This goes back to the issue of test items. Multiple-choice tests need more test items while constructed-response tests need fewer items. Constructed-response tests need fewer items. Highly precise tests need more test items than estimates of achievement, and if the range of skills to be measured is broad, such tests also need a larger pool of items with increasingly difficult items about a particular domain.

Final Word

You've made it! Not the sexiest reading, but CATs are good to know about when it comes to online testing. Their rigor and complexity are what makes computer adaptive testing so powerful—but costly and hard to design well-which is why we see little of them apart from large-scale online programs.

In the next series of articles, we look at other models of online, and technology-based, assessments.

For a complete set of references for this article, go here.

Footnote: 

[1] “Difficulty” indicates the level of challenge in answering a particular test item correctly. It is determined by the percentage of students likely to get the test item correct. Test items with values near 100 indicate easy test items, while test items with values near 0 indicate difficult ones. Most test items fall into the 60–70 percent range.

Close