Friday, December 05, 2008

Are Value-Added Effectiveness Measures Good Enough to Use for Compensation Decisions?

There’s a great deal of attention being given to using test scores to measure teacher performance these days, recent announcements from the Gates foundation ensure this will be high on the national agenda in coming years. But recent studies show that the value-added measures contain significant amount of error. Which raises questions: how can imperfect measures be incorporated into high-stakes decisions like teacher pay? How good is good enough?

Reformers have been waiting for longitudinal data systems to be implemented to provide value-added data to support improvements to compensation and retention decisions. The data is now there in several states, but the quality of the new information may not be as good as many of us had hoped. Just before Thanksgiving, two new studies were released that show the lack of stability of value-added measures of teacher effectiveness over time. The first by Dan Goldhaber looks at North Carolina data to see if pre-tenure teacher effectiveness (measured by the value-added gain of a teacher’s students) is a good predictor of their effectiveness post-tenure (here). The study showed that a teacher ranked in the bottom quintile of teacher effectiveness has a 32 percent chance of being in the bottom quintile post-tenure. While this is better than random (random would be around 20 percent), it is not much better than random. At the same time, 11 percent of the poor performers pre-tenure (bottom quintile) end up being in the top quintile post-tenure. The measure is a little more consistent at identifying top performers – 46 percent of top performers are top performers post-tenure (see Table 1 for all measures of pre and post value-added effectiveness). Goldhaber also looked at using the first 3-years of data to predict outcomes, and the predictive power does not change much.

Table_1a.pdf

A second paper by Tim Sass shows similar results from California and Florida studies. (here)
This paper focuses on whether value added measures of teacher quality are stable enough to use for compensation decisions. It shows similar results as the Goldhaber study over time. The lack of stability over time may not be surprising given the group of students a teacher gets each year is random. The data can not measure whether a teacher had a particularly disruptive class in the first year, and a better group of students the next. So, the randomness of classroom make-up may have a lot to do with these results. The Sass study shows that while measurable, student characteristics explain some of the differences in value-added effectiveness, but most of the differences across time are unexplained (See Table 2 for complete effectiveness measures).

figure_2.pdf

The part of the Sass study that caused me the greatest concern was how inconsistent these value-added measures are across tests. Students in Florida take two tests annually. They take a low stakes norm-referenced test and a high stakes standard-aligned test. Sass looks at how stable these value-added measures are across these two tests. So for this comparison, the random draw of students is the same for any given teacher. While these results look a little more stable (43% of bottom quintile teachers remain in the bottom quintile on the other exam), they are not as stable as you would hope. If just switching the exam moves 5 percent of teachers from the bottom of the distribution to the top, it would likely make teachers question the validity of the measure reflecting true effectiveness.

These papers and a few others suggest that value-added measures are not very consistent over time, and may not be the panacea for which some reformer have been hoping.

How Good is Good Enough? Now you would think that the bar for improving teacher compensation and tenure decisions would be pretty low. The current compensation structure is based almost exclusively on a teacher’s years of experience and college credits/advance degrees. Advance degrees have been consistently shown to have no impact of teacher effectiveness. For experience, teachers appear to improve their craft slightly over the first two to three years, but additional experience does not seem to have any impact. Clearly moving to value-added compensation could more accurately reward effective teachers than the current system. However, if a compensation system were based partially on these value added measure, I think that teachers would perceive the outcomes above as too arbitrary. It also makes me think that principals and mentor teachers could do a better job of predicting effectiveness than last year’s test results. (See Brian Jacobs on this question – principals seem to do pretty well on the identifying teachers at the top and bottom of the distribution, but their measurement is less predictive than prior year’s value-added (here). Of course this is not an either or choice. Can principals armed with value-added test results do an even better job than either one alone? What about a combination of principal evaluations, mentor teacher evaluations and value-added? Are there more rigorous evaluation methods like those of the Teacher Advancement Program or others better predictors than the value-added measures? (See Ed Sector Report on Teacher Evaluation here) As with all good research, it leads to more research. And with Gates interested in these topics, more research is likely to be on its way.

4 comments:

Anonymous said...

Wondefful, wondeful post! I'll print out the studies and study them carefully before commenting. Its clear, however, that Manwaring has a great head on his shoulders. He's demonstrating wisdom as well as sociasl science professionalism.

This is just one more reason why I'm growing more optimistic about NCLB II. Finally we are asking the right questions.

Anonymous said...

The predictive values of pre-tenure VA are actually quite high if you consider the fact that about 55% of bottom quintile teachers pre-tenure stay in the bottom TWO quintiles post-tenure and about 75% of top quintile teachers pre-tenure stay in the top TWO quintiles post-tenure. That's pretty good reliability of post-tenure impact that is above or below average. In addition, the fact that about 20% of top quintile teachers pre-tenure end up in the bottom quintile post tenure is a strong argument against tenure as a concept. Why should teachers have a decision-point early in their careers that guarantees job security indefinitely thereafter as opposed to a rigorous but fair evaluation system that applies throughout their careers (like other professionals)?

Anonymous said...

How good is good enough? Your statement "Advance degrees have been consistently shown to have no impact of teacher effectiveness" is interesting and I'd like to see the source of that. This seems to be counter to the research that teachers with more extensive education in their content (advanced degrees, I assume) are more effective at teaching that content.

The feeling I get is that the recommended approach might be to identify early the quality teachers and fire the rest (since they don't improve with time or education).

Regarding 'evaluations': these are uniformly the (biased) opinions/judgments of poorly trained observers (administrators or mentors). I would rather see the observations based on objective data on teacher and student behavior such as gathered by the eCOVE Observation Software. At least it would be consistent and an objective basis for decisions.

Anonymous said...

Regarding Dan's comment --

Taking the outliers (bottom quintile) and saying that the results are "not bad" because *half* of bottom-scorers stayed in the bottom *2* quintiles (nearly half the distribution)....well, that's just spin. It is NOT "high predictive value."