Hypothesis Testing

You may not know much about Type 1 and Type 2 errors in statistics. You’ve almost certainly never heard it referred to in the context of carbon offsets and additionality testing.  That’s unfortunate, because Type 1 and Type 2 errors (false negatives and false positives respectively) are an artifact of almost any testing we do on any topic where certainty is not an option.

Almost any test in almost any field that yields a “yes” or “no” result — whether we’re talking about a jury trial, pregnancy, or social welfare program eligibility — will in practice yield "false positive" and "false negative" results in addition to "correct" results. How well a particular test works is defined by how frequently it yields a correct result in comparison to the proportion of false positives and false negatives.

Let’s look at the jury trial example. We can see that the likelihood of an innocent person being found guilty involves both a probability distribution, and a determination of where to set “standard of judgement.” Being found guilty when you are innocent involves Type 2 error (false positives). We can all agree that this is a bad thing.

The other side of the equation — a guilty person who is found not guilty — also involves a probability distribution. This is Type 1 error - a false negative. We can all agree that too is a bad thing. But is it more or less bad than Type 1 error? You can see in the figure that the evidentiary threshold as drawn results in roughly equal amounts of Type 1 and Type 2 errors. That is not inevitable, however. The evidentiary threshold could be made further to the left or to the right, resulting in very different magnitudes of Type 1 and Type 2 error. Note that there is no way in statistical hypothesis testing to simultaneously shrink both errors, much less eliminate them. As you squeeze down on Type 1 error, you will magnify Type 2 error, and the reverse is also true. This is an immutable rule of statistical hypothesis testing.

What does this have to do with carbon offsets and additionality? Well, everything. Approving carbon offsets involves EXACTLY the same statistical testing challenge as establishing an evidentiary threshold for trials.

We all want to see an additionality test that gets additionality right. In reality, there will always be false positives, in other words projects allowed into the offset pool as “additional” when they really aren’t. There will almost always be false negatives, in other words projects excluded from the offset pool as non-additional despite actually being additional. The challenge is to figure out the right balance of those two errors given policy objectives.

Additionality testing will produce both Type 1 and Type 2 errors, and it is often very difficult to determine the size of the errors. What’s key to recognize is that there is one absolute rule: as one tries to eliminate the proportion of "fake offsets" through stricter additionality testing, you will inevitably increase the magnitude of "real but excluded offsets." In other words, if you’re most concerned about the environmental integrity of offsets, and you design your additionality testing to minimize the number of non-additional reductions allowed into the market, be prepared for a lot of false negatives  (reductions that really are additional but fail your screening criteria). Obviously a lot of false negatives will mean a more expensive offset program, since more and more legitimate reductions will be excluded.

Moreover, it’s not just the absolute numbers of potentially additional and non-additional offsets: it’s the relative magnitude of supply and demand for those offsets. Without additionality testing, an offset market dominated by non-additional reductions is pretty much inevitable.