Thursday, August 21, 2008

Margins of Error

Kevin Drum wrote a good post a couple of weeks ago about statistical illiteracy in the media, viz. the widespread tendency to characterize election poll results in which one candidate's percentage point lead is equal to or less than the poll's statistical margin of error (MOE) as a "statistical tie" or "dead heat." Kevin notes:

...probability isn't a cutoff, it's a continuum: the bigger the lead, the more likely that someone is ahead and that the result isn't just a polling fluke. So instead of lazily reporting any result within the MOE as a "tie," which is statistically wrong anyway, it would be more informative to just go ahead and tell us how probable it is that a candidate is really ahead. Here's a table that gives you the answer to within a point or two:


As Kevin notes, if Obama is up by three points, and the MOE is three points, it's 84% likely that he'll be the next President of the United States. That's very different than 50% likely, i.e. an actual tie.

This is directly relevant to education because most states use precisely the same statistical techniques when deciding whether a school has made Adequate Yearly Progress (AYP) under No Child Left Behind. If, say, 65% of students need to pass the test in order to make AYP, and only 62% pass, but the state determines an MOE of 4 percentage points, then the school makes AYP because the score was "within the margin of error."

This is silly for two reasons. First, unlike opinion polls, NCLB doesn't test a sample of students. It tests all students. The only way states can even justify using MOEs in the first place is with the strange assertion that the entire population of a school is a sample, of some larger universe of imaginary children who could have taken the test, theoretically. In other words, the message to parents is "Yes, it is true that your children didn't learn very much this year, but we're pretty sure, statistically speaking, that had we instead been teaching another group of children who do not actually exist, they'd have done fine. So there's nothing to worry about."

Second, per Kevin's chart above, the idea that scores that fall below the cutoff but within the margin of error are statistically indistinguishable from actual passing scores is incorrect. This is particularly true given that, while opinion polls almost always use a 95% confidence interval to establish their MOEs, most states use a 99% confidence interval for NCLB, which results in substantially larger margins of error around the passing score. But states do it anyway, because many of them basically see NCLB accountability as a malevolent force emanating from Washington, DC from which schools need to be shielded by any means necessary.

Think of it this way: let's say your child is sick and you bring him to the doctor. After the diagnosis is complete, you and the doctor have the following conversation:

Doctor: My diagnosis is that your son has pneumonia and needs to be hospitalized.

You: That's terrible! Are you sure?

Doctor: Well, there are few absolute certainties in medicine. It's possible that he only has bronchitis. But I'm pretty sure it's pneumonia.

You: How sure?

Doctor: 84% sure.

What would you do? Would you (A) Check your son into the hospital? Or would you (B) Say "Hey, there's a 16 percent chance this whole thing will work itself out with bedrest and chicken soup. Let's go that way."

States implementing NCLB nearly always choose option (B). That's because they see the law as a process for making the lives of educators worse, not what it actually is: a process for making the lives of students better.

6 comments:

Corey Bunje Bower said...

It wouldn't be 84% likely that Obama is the next president, it would be 84% likely that more people in the population currently plan on voting for him.

And the MOE around test results doesn't reflect the fact that not every kid was tested, it reflects the fact that that one test is only a sample of their ability -- if 100 different tests were given, the scores would differ each time -- and recognizes that a student's score would vary each time they took a test based on how they're feeling and what questions happen to be on the test.

Kevin Carey said...

When you ask states themselves why they use confidence intervals, they refer to sampling error, not testing error.

It's true that tests only assess a subset of what students need to know, and test results are subject to measurement error. But the proper way to account for measurement error is in setting cut scores on the test. No states requires students to get 100% correct to pass. So when policymakers decide what score score is good enough, that's the place to make allowances for the imprecision of the instrument. If you think students need 80% correct to be proficient, set the cut score at 75% to account for error. But having done so, don't then proceed to tell parents that schools have met AMOs under NCLB when in fact they have not.

Anonymous said...

As a statistician, I refer you to the very important Todd Snider song, "Statisticians Blues". It's friday afterall:

Live with band: http://www.youtube.com/watch?v=3d_VU2XUP-E&feature=related

Live by himself: http://www.youtube.com/watch?v=BMQdtyot38s&feature=related

Enjoy.

Anonymous said...

I found this post amusing given your past attempts at statistical wizardry. Let's re-tell the story, applying the Carey 51% action rule.

You: That's terrible! Are you sure?

Dr: Well, there are few absolute certainties in medicine...But I'm pretty sure it's pneumonia.

You: How sure?

Dr: 51% sure.

What would you do? Hey, what matters is your kid is sick! If there's a 51% chance he needs to be treated for pneumonia, then we begin treatment--now! It sure beats the status quo.

Another lesson learned from Q&E. When it comes to statistics--kids, don't try this at home.

Corey Bunje Bower said...

Regardless of what they call it, it's still the right idea. Imagine that the cut-off for AYP is that 50% of students in a school are able to pass a test. If you test everybody in the school ten times, you might have 45% pass one time and 55% pass another. But the school is only tested once. So if 45% of students in the school pass, we can't really be sure that the school didn't live up to expectations -- it might have just been a testing day that was at the bottom of the distribution.

Ian said...

If you think students need 80% correct to be proficient, set the cut score at 75% to account for error. But having done so, don't then proceed to tell parents that schools have met AMOs under NCLB when in fact they have not.

So are you saying that the schools have already done this, or that they should do this?

It's true that if you have a 100% sample of test scores you don't need a confidence interval. You have sampled your entire population. That said, no one is interested in the population of test scores - they are interested in what the students know. You have only sampled what it is that students know. Ignoring the systematic error associated with the instrument, we still have a random error associated with the measurement. Different versions of a test may be identical on average, but individuals will perform differently on different versions of a given test. Even with a given test, individuals will perform differently at different times. You might be hungry one day, sleepy another, and in peak mental condition the third.

One could, of course, fudge the standards up front, as you suggested, and build in a "margin of error". But why build junk like that into the system? Why not use something more reliable - like a margin or error? You know - something that you can actually calculate from the data...