Evaluating Your Online Learning Program (Part 1)

Evaluating Your Online Learning Program (Part 1)
Summary: Evaluation is often one of the weakest areas of any eLearning program. This article (the first of two parts) talks about ways to evaluate online programs. Here we examine evaluation—what it is, types of evaluations, measures, indicators, and instruments.

What You Need To Know About Evaluating Your Online Learning Program: Part 1

Evaluation is often one of the weakest areas of any eLearning program. There may be no standards against which to evaluate. Outcomes may not be defined. The purpose may not be determined, and questions about who benefits (teacher-learner, school, or student) may not be developed. Furthermore, the program may have been designed with no stated goals or objectives against which it can be measured, or the evaluation may have been designed after the program began. The capacity and resources to conduct an evaluation may be limited or nonexistent—and worse, high attrition rates may render any evaluation unreliable [1], invalid [2] and generally meaningless. Combine these issues within the nontraditional setting of eLearning, and the design and implementation of rigorous and meaningful evaluations are often severely handicapped.

Evaluations of any educational technology program often confront a number of methodological problems, including the need for measures other than standardized achievement tests, differences among students in the opportunity to learn, and differences in starting points and program implementation.

Many eLearning programs circumvent these issues by simply failing to evaluate their online offerings or by doing so in the most perfunctory fashion. Many eLearning programs, if funded by government or donor agencies (as in my work), may need to concern themselves only with monitoring and evaluation, which traditionally looks at inputs (number of learners trained) versus outcomes (number of learners who implement a strategy) or impact (how learner achievement has changed as a result of learners’ professional development).

The Importance Of Evaluation

Yet continual monitoring and rigorous, well-designed evaluations are critical to the success of any eLearning program (iNACOL, 2008). Well-designed and implemented evaluations inform eLearning policymakers, planners, funders, and implementers about the strengths and weaknesses of programs and indicate what assumptions, inputs, and activities should change and how.

Evaluation results help to improve programs and determine which ones should be maintained, changed, or closed. Without well-designed and rigorous evaluations, we cannot make claims about the effectiveness or ineffectiveness of a program. Without evaluation, we have no idea whether an eLearning program really works. And if a program does fail, a good evaluation can help planners and designers understand and learn from the failure [3].

Because evaluation is so critical to the success of eLearning programs, this article—and the one that follows next month—suggests several techniques for evaluating the effectiveness of any eLearning system. This article focuses on traditional "educational" online learning programs—such as university-level online courses or online teacher professional development programs--versus short-courses or informal training or corporate-level training.

The Importance Of A Good Evaluation Design

Designing a good evaluation is critical. The design here refers to set of specifications about which groups to study, how many units are in a group, by what means units are selected, at what intervals they are studied, and the kinds of comparisons that are planned (Weiss, 1998: 87). Well-designed evaluations with well-designed instruments and valid analyses of data generally provide valid and reliable results. Poorly designed evaluations do not.

Like the Instructional Design, a good evaluation design begins with the end in mind. Backward mapping evaluation is a three-step evaluation design technique in which each step is integrated with and builds on the other two steps (Rossi, Lipsey, & Freeman, 2004:91).

  • Step 1 (who?). This begins with audience and purpose: Who will use this information and for what purpose—not who is interested in the findings, but who will actually use them? Once this has been determined, evaluators and eLearning providers can move on to the second step.
  • Step 2 (what?). This focuses on question development: What will this audience want to know exactly? Once evaluation questions have been determined, they should be ranked in order of importance.
  • Step 3 (how?). Once the audience, purpose, and evaluation questions have been developed, online program stakeholders can determine what information is required to answer these questions, the source of such information (interviews, observations), the method for collecting information, and a plan for collecting and analyzing these data.

As mentioned above, evaluations often begin with a question: What are we doing? How are we doing? Why are we doing what we are doing? How are we accomplishing a task? The type of evaluation essentially depends on the type(s) of evaluation question(s) asked. Straightforward, “what” questions typically lend themselves to quantitative designs. Process-based questions such as “how” and “why” lend themselves to qualitative designs. Questions that ask for both types of information lend themselves to mixed-method designs.

Quantitative Evaluations

Quantitative evaluation designs are often concerned with one fundamental question: Are the resulting changes and outcomes, or lack thereof, the result of the particular intervention? In other words, were the outcomes due to the program, or would they have occurred anyway because of a number of other factors (Weiss, 1998)? One way to attempt to ascertain this answer, that is, to eliminate any rival or confounding explanations[4], is to create an experimental design. Experimental designs often, though not always, use random or probabilistic sampling. For instance, when evaluating the efficacy of an online professional development program, an evaluator may randomly select one group of learners to participate in an online program. This is the treatment group. Another group of learners, the control group, may be randomly selected to participate in another kind of professional development. The results of each type of professional development are then compared. By choosing a random set of learners and comparing them with other learners receiving a particular intervention, an experimental evaluation can answer with reasonable certainty whether the effects are the result of the program or due to some other explanation. This probabilistic sampling can help evaluators generalize and transfer findings from a small, randomly chosen control group to a whole population.

In a quasi-experimental design, treatment learners are compared with control learners who match up with the treatment learners in all major indicators except the treatment. However, quasi-experimental designs cannot rule out rival explanations. Like an experimental evaluation, quasi-experimental designs often, though not always, use probabilistic sampling.

Qualitative Evaluations

In contrast, evaluation questions that focus on “why?” or “how?” involve a qualitative design. Qualitative evaluations typically seek to answer the questions, “How did ‘it’ happen?” or “Why did ‘it’ happen?” Qualitative evaluations are narrative, descriptive, and interpretive, focusing on in-depth analysis of an innovation through the use of a purposive sample. In contrast to random or probabilistic samples, purposive or purposeful samples are chosen because they promise to provide rich information that can inform the evaluation. Such samples, or cases, can be selected because they are either representative of the group, are atypical of the group (outliers), or represent a maximum variation of the group. Every other component of the evaluation (methods, sampling, instruments, measures, analysis) flows from this basic design. However, unlike results from an experimental evaluation, results from a qualitative evaluation are not generalizable.

One common approach, and output, of a qualitative evaluation, is a case study, a rich descriptive analysis of a particular person, set of persons, or program; these elements are often known as “key informants”. Case studies attempt to understand how and why the program (online or otherwise) resulted in change, impact, or a set of outcomes. It attempts to do this by mining the experiences of these key informants.

Mixed-Method Evaluations

Mixed-method evaluations combine the designs of both quantitative and qualitative evaluations. They combine the “what” and numerical focus of a quantitative evaluation with the “how,” “why,” and narrative focus of a qualitative evaluation.

There is no one best evaluation method. The type of evaluation design used—quantitative, qualitative, or mixed-method—again depends on what the audience for the evaluation will want to know. It will depend on understanding how, why, when, and where to generalize findings, as well as on available resources and data analysis capacity. Analyzing quantitative data, especially for large datasets, demands statistical analysis software and a deep knowledge of statistics and quantitative methodologies. Analyzing qualitative data involves an understanding of inductive and/or theoretical (deductive) coding, pattern matching, and the use of qualitative analysis software.


All evaluations, whatever their design, need good measures. A measure is a source of information or data that can be expressed quantitatively to characterize a particular phenomenon. Performance measures may address the type or level of program activities conducted (process), the direct products and services delivered by a program (outputs), and/or the results of those products and services (outcomes). They may include a customized program or project-specific assessments. Measures may be poorly understood and therefore incorrectly analyzed, thus resulting in meaningless or misleading evaluation data.


All evaluations, regardless of type, also require indicators. An indicator is a piece of information that communicates a certain state, trend, or progress to an audience. It defines the data to be collected to measure progress so that the actual results achieved can be compared with the originally designed results. Kozma and Wagner (2006: 21) note the importance of developing core indicators in evaluations. Core indicators are context-specific ways to understand inputs and outcomes of a program or project that we may or may not be able to observe directly, such as the following:

  • Input indicators—for example, the type of equipment and/or software and/or organizational design features of an eLearning program
  • Outcome indicators—for example, student and teacher impact (affective, cognitive, and behavioral)
  • Demographic and socioeconomic indicators—for example, enrollment rates, literacy, gender, etc.
  • Cost indicators—for example, fixed and recurrent costs


Every evaluation is fraught with some level of error, and every instrument has its own set of intrinsic weaknesses. Therefore, all evaluations should use multiple types of instruments—surveys, focus groups, interviews, observations, and questionnaires—in order to capture and analyze data from as many different angles as possible to triangulate the data most effectively. This triangulation is critical for arriving at inferences or interpretations that are as valid and accurate as possible.

This article has provided general background information on evaluations. We'll continue this topic of evaluating online programs next month with three specific evaluation frameworks.


For all references in this article, see:

Burns, M. (2011, November). Evaluating distance programs, pp. 252-269. In Distance Education for Teacher Training: Modes, Models and Methods.


  1. An evaluation instrument is considered reliable if the instrument can be used repeatedly with different groups of similar subjects and yield consistent results.
  2. Validity refers to the accuracy of an assessment—whether or not it measures what it is supposed to measure. There are generally (at least) 3 types of validity. One is content validity—the extent to which the content of the test matches the instructional objectives. The second is construct validity—the extent to which a test, instrument, or assessment corresponds to other variables, as predicted by some rationale or theory. A third is criterion validity—the extent to which scores on the test are in agreement with some externally established criterion/criteria. Evaluators also talk about concurrent validity, predictive validity, and face validity.
  3. Increasingly, programs and projects are making their failures public in an effort to learn from, and help others learn from, such failures. See here.
  4. Rival explanations may include maturation (for example, students just get better because she becomes more experienced), attendance at another class, or contact with a mentor. Without eliminating such rival explanations, interpretations and explanations become confounded, that is, they are attributed to one cause when in fact they may be the result of several causes.