Wednesday, January 23, 2008

Value-Added Comes of Age

About four and a half years ago, I was working on a policy paper focused on a developing and controversial method of measuring teacher effectiveness called "value-added." Created by Dr. Bill Sanders in Tennessee in the mid-1990s, the essence of value-added is pretty simple: Using annual standardized test scores, look at the prior achievement history of a given teacher's students and, based on that, statistically predict how well they're likely to do in the current year. Then calculate the ratio of their actual performance to the predicted performance. Teachers with a ratio greater than 1.0 are more effective than average, those with a ratio of less than 1.0, less. This gives teachers credit for making a lot of progress with previously underperforming students, and doesn't give them progress for coasting with previously high-performing students. Whatever external factors impact performance--poverty, family life, etc.--are implicitly controlled for, because the prediction model is based on the prior performance of the students themselves.

Around the same time, I was reading Michael Lewis's new book, Moneyball, which I had bought the day it was published, being a big fan of both the author and the Bill Jamesian approach to thinking about baseball. And at some point I realized that the underlying premise of Moneyball and the promise of value-added were the same: using empirical data to fundamentally change and improve a labor market. Instead of relying on human observations of characteristics, with all the biases and errors that result, focus on outcomes instead. The paper was released the following year, my first and only real contribution to the teacher quality debate, and while in retrospect I'm kind of embarrassed by the length, I think the ideas hold up pretty well. (The original draft included a whole section explicitly drawing the Moneyball parallel, but it was excised in the editing process, and yes, I'm still bitter.)

So it's interesting to see (I take no credit for this) the New York City school system announcing a plan to start calculating value-added scores for some of its teachers. Just like in Tennessee, the idea is pretty straightforward:


The city’s pilot program uses a statistical analysis to measure students’ previous-year test scores, their numbers of absences and whether they receive special education services or free lunch, as well as class size, among other factors. Based on all those factors, that analysis then sets a “predicted gain” for a teacher’s class, which is measured against students’ actual gains to determine how much a teacher has contributed to students’ growth.

What they're going to do with the data, however, is unclear. One option is to use it for making tenure decisions, which, research has shown pretty conclusively, is a good idea--it seems clear that if your value-added scores put you among the very worst teachers in your first few years--like the bottom 3%, say--the odds of you ever becoming a good teacher are quite low. As Harvard's Tom Kane notes, "It seems hard to know who is going to be effective in the classroom until they are actually in the classroom.

Or you could simply put the data out there and let market forces work. Deputy School Chancellor Chris Cerf said:


“If the only thing we do is make this data available to every person in the city — every teacher, every parent, every principal, and say do with it what you will — that will have been a powerful step forward. If you know as a parent what’s the deal, I think that whole aspect will change behavior.”

Crucially, this would be good for the best teachers. One of the biggest problems with the teacher labor market is that the top teachers--the ones who are one or more standard deviations above the mean in terms of effectiveness--are criminally underpaid, and have no way of demonstrating their real value to the labor market. Their unions, however, are totally aghast at the prospect. Randi Weingarten, head of the United Federation of Teachers (and rumoured to be next head the national AFT) said:


“Any real educator can know within five minutes of walking into a classroom if a teacher is effective."

This is the equivalent of the scouts and general managers in Moneyball who were always on the lookout for the "good body," the "five-tool guy," the player who just looked like a major leaguer. As everyone now knows, they were profoundly mistaken, and people like the Oakland A's Billy Beane were able to exploit the market distortions that resulted.

What we're seeing in New York City today is all the major challenges of 21st century K-12 teacher policy being played out in real time. Value-added methods are still very much in development, subject to limitations of standardized tests, among many things. But in the long run, there will only be more, better information about student performance, along with newer, faster ways of analyzing that information and drawing increasingly accurate conclusions about how well teachers are doing their jobs. At some point the methodological debates will be resolved and the margins of error whittled down the satisfaction of reasonable people.

That will have profound implications for the way teachers are hired, paid, trained, assigned--perhaps for the nature of the profession itself. Much current teacher policy is logically derivative of extremely limited or absent information--if we can't accurately measure teacher effectiveness, then pay everyone the same. If we can't know how well teachers will perform when they arrive in the classroom, throw up a lot of regulatory and process barriers to entry in terms of training and certification. The shift from information scarcity to abundance will change that logic, and eventually the policies themselves. New York City is a sign of things to come.

Update: Eduwonkette compares this to the infamous Tuskegee syphilis experiment, and then says she's not actually making the comparison she just made. The privileges of anonymity, I suppose. Sherman Dorn chooses a different horrible disease (botulism) to make his point--which is that the NYC value-added process may or may not have severe methodological flaws. It might, I don't know, I guess we'll find out. But, per above, methodological issues can be worked out, and anyone who thinks the hysterical reaction to the value-added initiative stems from a deep and abiding concern for statistical integrity is willfully not paying attention.

Update II: Dorn updates and points out that the "botulism" reference wasn't hysterical, fair enough, I was referring to the folks at UFT but that wasn't clear. He also says:

The claim that "methodological issues can be worked out" is evidence that Carey hasn't read the writings of professional researchers who point out that growth models are no holy grail.

I've read the research (which the UFT habitually misrepresents) pretty carefully. The people who've looked at the Sanders model have generally concluded that it does what it says it does: identify teacher effects, given appropriate caveats about statistical margins of error. It's true that they say it's no holy grail, which is unsurprising in that there's no such thing as a holy grail. There is not now nor will there ever be perfect information about teacher effectiveness; teaching is far too complicated for that. The only responsible approach to using value-added data--or any other data that purports to gauge teacher effectiveness--is to be cognizant of the amount of likely error and craft policies accordingly. And of course there should also be diligant work to improve the methods themselves, which nobody believes have reached an apex of refinement.

Given that, a fair response to the NYC announcement would be something like this: "We support efforts to fairly evaluate teacher effectiveness and recognize that objective evidence of student learning growth must play an important role in that process. We emphasize that the results of evaluation methods based on standardized test scores will be subject to significant degrees of statistical error, which much be appropriately taken into account, particularly when such information is used in the context of employment matters such as tenure and compensation. The best process will combine information from multiple methods, including peer and principal evaluation, and will preserve teachers' professional rights. We look forward to working in concert with management to develop such policies, which should include rewarding the most effective teachers for the vital work they do."

The actual response from the UFT was nothing like that. Rather, it reflects a principled opposition to the use of test scores of any kind in evaluating teachers. Again, this is not an argument about methodology; it goes much deeper than that. Talking about methods in a good faith attempt to reach the goal of better information is one thing, but the holy grail standard is all about making perfection the enemy of the good.

No comments: