Feb. 8

Setting the Record Straight

By Stephen Klein and Richard Shavelson and Roger Benjamin

As participants in the debate regarding appropriate strategies for assessing learning in higher education, we agree with some of the statements Trudy Banta made in her Inside Higher Ed op-ed: “A Warning on Measuring Learning Outcomes.” For example, she says that “it is imperative that those of us concerned about assessment in higher education identify standardized methods of assessing student learning that permit institutional comparisons.” We agree. Where we part company is on how that can best be achieved.

Banta recommends two strategies, namely electronic portfolios and measures based in academic disciplines. One of the many problems with the portfolio strategy is that it is anything but standardized and therefore unable to support institutional comparisons. For instance, the specific items in a student’s portfolio and the conditions under which those items were created (including the amount and types of assistance the student received) will no doubt differ across students within and between colleges. In short, the portfolio is not standardized and therefore cannot function as a benchmark for institutional comparisons.

The problem with Banta’s second strategy, discipline specific measures, stems from the vast number of academic majors for which such measures would have to be created, calibrated to each other (so results can be combined across majors), and updated, as well as the wide differences of opinion within and between institutions as to what should be assessed in each academic discipline. Banta is concerned that “if an institution’s ranking is at stake [as a result of its test scores], faculty may narrow the curriculum to focus on test content.” However, that problem is certainly more likely to arise with discipline specific measures than it is with the types of tests that she says should not be used, such as the critical thinking and writing exams employed in the Collegiate Learning Assessment (CLA) program, with which we are affiliated.

Thus, while we agree with Banta that there is a place for discipline specific measures in an overall higher education assessment program, the CLA program continues to focus most of its efforts on the broad competencies that are mentioned in college and university mission statements. These abilities cut across academic disciplines and, unlike the general education exams Banta mentions, the CLA — which she does not mention by name, but is implicitly criticizing — assesses these competencies with realistic open-ended measures that present students with tasks that all college graduates should be able to perform, such as marshalling evidence from different sources to support a recommendation or thesis (see Figure 1 for sample CLA scoring criteria and this page for details).

We suspect that Banta’s criticism of the types of measures used in the CLA program stems from a number of misperceptions about their true characteristics. For example, Banta apparently believes that scores on tests of broad competencies would behave like SAT scores simply because they are moderately correlated with each other. However, the abilities measured by the CLA are quite different from those assessed by the general education tests discussed in Banta’s article, such as the SAT, ACT and the MAPP. Consequently, an SAT prep course would not help a student on the CLA and instruction aimed at improving CLA scores is unlikely to have much impact on SAT or ACT scores.

Moreover, empirical analyses with thousands of students show that the CLA’s measures are sensitive to the effects of instruction; e.g., even after holding SAT scores constant, seniors tend to earn significantly higher CLA scores than freshmen. Differences are in the order of 1.0 to 1.5 standard deviation units. These very large effect sizes demonstrate that the CLA is not simply assessing general intellectual ability.

Banta also is concerned about score reporting methods, such as those used by the CLA, that adjust for differences among schools in the entering abilities of their students. In our view, score reporting methods that do not make this adjustment face very difficult (if not insurmountable) interpretative challenges. For example, without an adjustment for input, it would not be feasible to inform schools about whether their students are generally doing better, worse, or about the same as would be expected given their entering abilities nor whether the amount of improvement between the freshmen and senior years was more, less or about the same as would be expected.

The expected values for these analyses are based on the school’s mean SAT (or ACT) score and the relationship between mean SAT and CLA scores among all of the participating schools. This type of “value added” score reporting focuses on the school’s contribution to improving student learning by controlling for the large differences among colleges in the average ability of their entering students.

Banta objects to adjusting for input. She says that “For nearly 50 years measurement scholars have warned against pursuing the blind alley of value added assessment. Our research has demonstrated yet again that the reliability of gain scores and residual scores — the two chief methods of calculating value added — is negligible (i.e., 0.1).”

We suspect the research she is referring to is not applicable to the CLA. For example, the types of measures she employed are quite different from those used in the CLA program. Moreover, much of the research Banta refers to uses individual-level scores, whereas the CLA program uses scores that are much more reliable because they are aggregated up to the program or college level.

Nevertheless, it is certainly true that difference scores (and particularly differences between residual scores) are less reliable than are the separate scores from which the differences are computed. But how much less? Is the reliability close to the 0.1 that Banta found with her measures or something else?

It turns out that Banta’s estimates are way off the mark when it comes to the CLA. For example, analyses of CLA data reveal that when the school is the unit of analysis, the reliability of the difference between the freshmen and senior mean residual scores — which is the value added metric of prime interest — is a very healthy 0.63, and the reliability of institutional level residual scores for freshmen and seniors are 0.77 and 0.70, respectively. All of these values are conservative estimates (see Klein, et al, 2007 for details). Even so, these values are far greater than the 0.1 predicted by Banta, and they are certainly sufficient for the purpose for which CLA results are used, namely obtaining an indication of whether a college’s students (as a group) are scoring substantially (i.e., more than one standard error) higher or lower than what would be expected relative to their entering abilities.

Banta concludes her op-ed piece by saying that “standardized tests of generic intellectual skills which she defines as ‘writing, critical thinking, etc.’ do not provide valid evidence of institutional differences in the quality of education provided to students. Moreover, we see no virtue in attempting to compare institutions, since by design, they are pursuing diverse missions and thus attracting students with different interests, abilities, levels of motivation, and career aspirations.”

Some members of the academy may buy into Banta’s position that no standardized test of any stripe can be used productively to assess important higher education outcomes. However, the legislators who allocate funds to higher education, college administrators, many faculty, college bound students and their parents, the general public, and employers may have a different view. They are likely to conclude that regardless of a student’s academic major, all college graduates, when confronted with the kinds of authentic tasks the CLA program uses, should be able to do the types of things listed in Figure 1. They also are likely to want to know whether the students at a given school are generally making more or less progress in developing these abilities than are other students.

In short, they want some benchmarks to evaluate progress given the abilities of the students coming into an institution. Right now, the CLA is the best (albeit not perfect) source of that information.

Stephen Klein is director of research and Roger Benjamin president of the Council for Aid to Education, and Richard Shavelson is Margaret Jacks Professor of Education at Stanford University.

Comments

Measuring What and For What Purpose?

It is important to consider the ceiling effect when attempting to measure value added. Any assessment, including the CLA, is designed to assess skills and abilities to a certain level. Consequently, it cannot detect student abilities beyond that level. Given that students arrive at college at varying levels of ability, the test can only detect gains to a certain point. It is one thing to “control for input” and it is another thing to validly assess gain. The CLA cannot validly assess value-added at institutions that admit high ability students. Moreover, it cannot validly assess gain for any student capable of performing beyond the level to which the CLA assesses ability.

But the issue of validity has more important dimensions. Even if the CLA or any other assessment could accurately assess gains in abilities of college students, it’s design and implementation do not readily lend itself to providing information back to colleges regarding how students are gaining those abilities. Assessments that cannot readily be tied to program and process effectiveness are not cost-effective.

Also with regard to validity, it is important to question whether general learning outcomes, per se, should be the primary focus of external accountability. Perhaps colleges and universities should be held more accountable for assessing learning outcomes with integrity and rigor and the policy and the policy-making and consuming public should focus more on the general consequences of learning (e.g., leading productive, constructive, and engaged lives as citizens of a democracy) rather than the general learning outcomes. In this way, faculty can take proper responsibility for ensuring a high quality teaching and learning environment while leaving some room for students to take responsibility for how they take advantage of that environment.

With regard to specific proficiencies, such as for flying planes, performing surgery, and leading a congregation, I’d prefer that the expert practitioners in those areas determine proficiency rather than public policy-makers.

Vic Borden, AVP, Planning, IR, & Accountability at Indiana University, at 10:25 am EST on February 8, 2007

A couple questions

I took a look at the CLA test criteria, and unless I’m reading them wrong they only seem to involve language and reasoning skills. There are no basic math or scientific competencies assumed. I don’t know what to make of this. Dos this mean that we’re not worried about how our students are doing in science and math, or that we respect those disciplines too much to develop non-disciplinary specific tests for them—but that there’s no such respect for the humanities or social science? Or is it that we are nervous about disciplines whose knowledge production isn’t easily measured in quantitative terms?

Speaking of the last possibility: I notice that the writing exams are computer graded by something called “E-Rater.” I wonder if our students don’t appreciate the beauty of writing so much when they discover we just want them to write good enough for the machines (in every sense).

RM, at 10:25 am EST on February 8, 2007

Reply and question

Does this mean that the CLA is now in favor of its test being used as a comparative analytical tool to measure differences between schools? Also, does it mean CLA test results should be made public? If so, is this a change in CLA policy? It seem that if the answers are yes, it would be a big change in the CLA position.

robert morse, Director of Data Research at U.S. News, at 10:25 am EST on February 8, 2007

Luckily, I’m in a scientific field (math) where content (and, hopefully, its understanding) are still paramount. As has been stated in comments on previous articles, standardized testing is easier in such cases. There is also less of the value-added issue: students with deficiencies take more courses, take longer to finish, and at the end are supposed to roughly exit with similar content knowledge, although their abilities and creativity may differ quite a bit. There were a few caveats about content not being standard even in the sciences. While true, this can easily be handled with modern technology — a common core can be easily agreed upon, and then the institution can add applicable sections to create custom exams. Scores can be reported according to sections or even down to the problem level. I personally would support such testing, as there is way, way too much sliding by going on these days. The only accountability is how many students get through the courses and how many graduate — and we’re getting way too good at meeting those kinds of standards.

On the other hand, it is hard to trust nebulous general, conceptual, intellectual exams such as those discussed in the article. To use those as precise instruments, as the gold standard of academic accountability, would be foolish indeed.

Bob at State U., at 11:16 am EST on February 8, 2007

It looks like most of CLA is about SAT

if I understand correctly the other document (Klein et al. 2007) pg. 9.: It says “When the school is the unit of analysis, the SAT by itself accounts for about 70% of the variance in performance task scores. The SAT plus self-reported effort accounts for another 3% to 7%, depending on the sample (e.g., freshmen vs. seniors)."So, if we know the incoming students’ average SAT score, we don’t need to worry about which school they are attending. Because if the school required a higher SAT score, then their CLA will be higher. The school doesn’t need to do many changes in the campus. Just increase the SAT scores... No curricular improvement is needed.

Holy F., at 11:16 am EST on February 8, 2007

We’re talking about specifics

” .. Perhaps colleges and universities should be held more accountable for assessing learning outcomes with integrity and rigor ..”

Having interviewed hundreds of people — basic skills (math, spellling, grammar, history) are lacking. It indicates lack of preparation from K-12 and college.

I don’t know how much simpler that can be explained. Also, taking time to “look up the answer” on the Internet takes time — if it found correctly.

At bottom: few people want to do the hard work — reading, memorization, thinking. Constructing portfolios over long periods of time can be helpful — except that there is not unlimited amounts of time to do tasks.

As to this ” .. with regard to specific proficiencies, such as for flying planes, performing surgery .. I’d prefer that the expert practitioners in those areas determine proficiency ..” — yes, expert pilots and expert surgeons handle those. No one suggested otherwise — I can’t think of anything more frightening than having pilots without basic math skills.

C. Bigsby, at 11:46 am EST on February 8, 2007

SAT and CLA

Benjamin and Chun (2003) reported that “a school’s average score on the CLA measures also correlated highly with the school’s average SAT score (r =.90), yet we found statistically significant institutional effects after controlling on SAT” (p. 28). It seems that the easiest way for a school to raise their CLA scores would be to raise the average SAT scores of entering students. Benjamin, R. & Chun, M. (2003). A new field of dreams: The collegiate learning assessment project. Peer Review (5), 26-29.

Jeremy, at 11:55 am EST on February 8, 2007

So Vic, if I read your comments correctly, you would be in favor of federal and state policy-makers measuring holding institutions accountable based on measuring the lifetime outcomes of students? For example, based on what you say ” general consequences of learning (e.g., leading productive, constructive, and engaged lives as citizens of a democracy) “, relevant measures might include:

* net changes in personal wealth over time as a measure of productivity;

* net contribution or rate of contribution to an idealized society based on type of occupation or levels of volunteer effort or levels of financial contribution;

* crime rates, in various categories, by graduation cohort;

* measures of occupation change, frequency of change.

These are just really rough ideas to work with, but this the first time I have heard you suggest institutions should be held accountable for what happens after students leave. If this is what you mean, I think a lot of policy folk would jump at your offer. After all, higher ed has been saying for years that education reduces crime, enhances personal wealth, creates a more engaged citizenry, improves governing...unfortunately, no one can really demonstrate that those things happen.

So, let’s talk....

Tod Massa, at 11:55 am EST on February 8, 2007

Methods and Sample

Based on Klein’s (2007) article (http://www.cae.org/content/pdf/CLAMethodarticle_012607.pdf):There can be as few as 10 students in freshmen or seniors from a school. How can that small group represent a school? Moreover, without random sampling how can the voluntary group represent the school?

If the students’ performance on the CLA is not counted or evaluated by the school, why would students spend so much time and effort to do their best or to do their even normal effort? There is a nice article on this issue: “Association Between Motivation and General Education Standardized Test Scores” from http://arc.missouri.edu/NewsandArticles/AIR_2005_Cole_Bergin.pdf

Holy F., at 12:00 pm EST on February 8, 2007

Jobs Related to Setting the Record Straight

or search for jobs directly.