Tuesday, May 28, 2019

Oh No, This Wasn't Just An "Average Polling Error"

As previously noted, Australian opinion polling has just experienced its first clear predictive failure, in pick-the-winner terms, in a federal election since 1980.  Every campaign poll by four different pollsters (one of them polling under two different brands) had the Labor Opposition ahead of the Liberal-National Coalition (as it had been for the entire term), and yet the Coalition has won an outright majority.  Moreover, polls in the final weeks were extremely clustered, with 17 consecutive polls (plus an exit poll) landing in the 51% to 52% two-party preferred range after rounding, a result that is vanishingly unlikely by chance.  No pollster has yet made any remotely useful contribution to explaining this clustering - those who have even commented have generally said they didn't do it and it must have been somebody else.

The general reaction has been dismay at this unusual level of pollster error in a nation where national polls have a proud record of accuracy.  The Ninefax press, as I call them (SMH/The Age), have even announced that they now have no contract with their pollster, Ipsos, or with any other pollster.  (This may just be for show, since in the past Fairfax often took long breaks in polling after elections.)  News Corp is, for now, standing by Newspoll.  The Association of Market and Social Research Associations has announced a review, although this may be of little value as its only member who is involved is Ipsos. 


However, as previously noted briefly, there has been a bit of a what's-all-the-fuss-about response from some overseas observers used to large polling failures.  Especially, there is one article by Nate Silver that I think I should respond to, and also the article in the Economist and tweets by Harry Enten that he links to.  Not because I'm generally inclined to have a go at a website that I tend to enjoy and find useful (within a 95% confidence level, anyway) but because I feel that if I don't refute this train of thought, pollsters may get away with using it to cover their rear ends, and go on giving the Australian public the mushroom treatment.

The core claims being made by the what's-all-the-fuss-about brigade are:

1. Pre-election polls were around 51-49 on two party preferred (Silver)
2. The result was around 49-51 giving a spread error of about four points, or a 2PP miss of two points. (Silver, The Economist)
3. The historical average primary vote spread error for major parties in Australian polling is about five points. (Enten)
4. Since the actual 2PP spread miss was only four points, this is all nothing exceptional and the Coalition should have been given about a 1/3 chance of winning. (Silver)
5. The correct read of the contest was to show that it was pretty much a tossup and the pundits should have known this. (The Economist)
6. Journalists (Silver) and/or pundits (The Economist) misread it as a very likely win to Labor and are overreacting and showing ignorance by calling it a massive polling failure.

Some of these claims are clearly incorrect and others are severely missing nuance.  All the errors and simplifications cut in the direction of reducing what has happened to a simple hot take that is wrong.

What the pre-election polls said

The final pre-election polls by the five pollsters were two 51.5s (Newspoll (the last and largest) and Essential), two 51s (Ipsos and YouGov-Galaxy) and a 52 (Morgan).  It may seem like nitpicking to object to rounding a range with an average of 51.4 to 51, but doing so reduces the actual size of the error on the 2PP gap between the parties by 0.8 points, making the what's-all-the-fuss-about case look just a little better than it is.  However you model it, the final average of the polls is not just 51, it's 51-point-something.

(Australians usually talk about the 2PP miss as the difference between the expected 2PP and the actual, so if the polls have 52 and the result is 50, that's a 2-point miss for us.  But Americans don't have 2PP, so are used to talking about the error on the estimate of the differences in the parties.  They would have the difference between 52-48 and 50-50 as a four-point spread difference, +4 vs 0.)

What the actual result was, and what the error was

I would have thought Nate Silver would have been familiar enough with the gradual creep in the US Presidential popular vote caused by late counting in California especially to know that not all electoral systems give you the national vote overnight, or even five days later.

In Australia the two-party count is especially slow partly because postal votes can still be accepted until thirteen days after polling day.  These tend to favour the conservatives, and counting of them in divisions that have been won by large margins tends to be a low priority.  Not only does the post-count tend to result in a degree of Coalition drift because of postals, but many divisions are not actually included in the two-party count for several weeks after the election, because the top two candidates are not Labor vs Coalition.  At this election there are fifteen such "non-classic" seats.  These broke 10-5 to the Coalition on 2PP in 2016, and in 2016 the 2PP was 54.8 to Coalition in these fifteen currently omitted divisions compared to 49.9 in the rest.  Even without these divisions, the 2PP has been steadily climbing so far and is now at 51.71%.  Antony Green has projected it to finish around 52.1.  Even 52.2 is still possible.  However, sometimes the swing in non-classic divisions is different to elsewhere, because one party or other will not be making any effort.  Also, @sorceror43 has pointed out on Twitter that several of the remaining seats are inner-city seats where we are likely to see a swing to Labor.

We won't know the size of the error for several weeks, but what Nate Silver calls a four-point spread error on the polled 2PP is likely to be at around a six-point spread error.  It might be seven, but I doubt it. [Update 18/6: It finished at 5.86%.]

As for the spread error on the primary votes of the major parties, that's more settled (but still not finished yet) and is currently running at 5.5%. [Update 11/6: it finished at 5.3%.]

The historical record of Australian poll errors

Harry Enten has downloaded data from the database used in a well known paper by Jennings and Wleizen on polling errors through time, and says:

"If you go back since 1943(!), the average error in the final week of polling between the top two parties in the first round has been about 5 points. The average was off by 5 this year"

We don't have a first round in Australia, but never mind that.

Let's start by talking about 1943, exclamation mark and all.  The Jennings-Wleizen dataset's data point for 1943 comes from one single reading, dated about three weeks out from the election, and presumably an Australian Gallup Poll (later Morgan Gallup, then Morgan).  It has an 18-point spread error. I'm not familiar with that result but I'm aware of another Gallup dated May-June 1943 that had "only" a 15-point spread error.  In any case, that data point apparently from one poll taken weeks out from an election while the country was at war carries the same weight in a crude average of errors by year as more recent cases where there have been many polls in the field in the final days.  And this one ancient outlier by itself distorts the mean annual spread error in the final polls by half a point.  The median spread error based on aggregated voting intention in the database on election day is only 3.3 points.

I thought I'd also look at another outlier in the database, 1984, for which it has a 12.54% spread error.  Only that should be 12.22, because the Country Liberal Party has been incorrectly omitted from the Coalition tally.  Also, the last change to the aggregated standing of the parties occurs 16 days out from the election, from which point the database holds Labor at 53.55 and Coalition at 38.15.  But there was a Morgan of November 24-5 (6-7 days out) that had Labor at 51 and the Coalition at 40, "only" an 8.46% spread error, so I can only suspect that poll is missing from the database.   Also, while 1984 was a very bad miss on the margin by any perspective (it was much closer than pollsters expected), there was a special contributor to the failure in that year.  The introduction of above-the-line voting in the Senate produced a 4.7-point spike in mostly unintended informal voting in the Reps, generally accepted to have hurt Labor more than the Coalition.

Even assuming that we take this database completely at face value, it's not credible to be just taking the long-term view in considering how bad an error the 2019 failure is.  The reason is that unlike polling industries in many other places, which have struggled to deliver much improvement over many decades, Australia has shown dramatic improvement over time (at least according to this dataset!).  Enten does note this improvement, but I don't think he gives it enough credit:



(It's statistically significant at about p=0.02, and likewise if you ditch the three obvious outliers.)

But more; there was a sudden shift in polling accuracy for the elections since 1984 that happens to correspond exactly with the introduction of the major telephone pollster Newspoll.  The average spread error (according to this database) was 6.2 points before Newspoll, and 2.3 points thereafter.  That is overstating it given outliers and data quality issues with the dataset for the older polls, but those who are used to looking at them know that they were getting better over time but still pretty ropey compared to, well, what we were used to until this one.  Here's the graph again with the Newspoll era demarcated and this year's howler shown in red:



For the previous 11 elections the average spread error on the major party primaries had been 2.3 points +/- 1.4.  This year's miss is 2.3 standard deviations outside that average.  It's not an average miss, by the standards of recent decades; it's a rogue miss.

On the 2PP, things get even worse, because two of the larger errors on the primary spread according to the dataset (1990 and 2013 - re which see below) were offset by substantial shifts in preference flow from minor parties - in 2013's case pretty much exactly.  But in 2019, there appears to have been a substantial preference shift in the Coalition's favour, regarding which Newspoll's/Galaxy's much-maligned (including in one case by me) adjustments look likely to have been pretty close to the mark. 

And finally, the more I've looked at the Jennings-Wleizen dataset, the more I doubt that it's completely fit for purpose.  I've found many minor replication issues and substantial missing data problems. Looking at its error for 2013 I thought "that's not right!" I found it had both the Coalition and Greens final primaries wrong (albeit by only a tenth of a point apiece) but also for 2013 there were no entries in the poll field beyond 12 April 2013, meaning that for the rest of the term it gives a model with Coalition 47.5 Labor 32 Green 10.5 (none of the final polls had such a gap.) I get the correct figures for 2013 at about 44.2-33.6 (a 1.6 point spread error, not 3.3), but it depends on which polls you use.  A time cutoff issue, maybe?

The Coalition's chances of winning? Not the point

It doesn't really matter whether you think the Coalition's chances of winning should have been set at around one in four (my estimate) or one in three (Nate Silver's estimate) or nearly one in two, or one in eight, or anything other than one in a very large number.  The point is that the poll failure goes way beyond just the fact that the wrong party won.  Had the Coalition polled, say, 50% 2PP and snuck over the line in minority after weeks of negotiation, I wouldn't have called that an exceptional poll failure (apart, that is, from the herding.)  Such an outcome was within the realistic range based on recent polling.  The upset is not just that the wrong party won, but that they won well, with a swing in their favour that looks like being almost the largest 2PP swing to a sitting government since 1966.

What should expectations have been?

The Economist's article bases its case that pundits should have seen this coming on a model by Peter Ellis (Free Range Stats).  Ellis's Bayesian model (which came complete with seat probability estimates for individual seats) had the election as basically a tossup, which obviously looks a lot better now than models that gave the Coalition some chance of winning but had a small Labor majority as a most likely outcome. (Another site, Buckleys & None, also had a rather cagey outlook, though I am much less clear on its methods.) However, The Economist's article seems unaware that polling-based models forecasting a likely ALP victory even existed, let alone that they were the dominant view among Australian psephologists: "Where America’s leading election forecasters gave Mrs Clinton anywhere from a 70% to 85% chance of winning the presidency, the odds of Mr Shorten becoming PM were much closer to 50%."  (Never mind also that some prominent election forecasters in the US gave Clinton much higher chances of winning than 85%.)

Ellis's model differs from the models that forecast a probable Labor victory in one important respect: its treatment of house effects.  House effect refers to the idea that a given poll has a tendency based on past results, or comparison to other pollsters, to produce results that are more favourable for one side than the other - compared to what actually happens.  In this case, his model's calculations had the pollster consensus as likely to overestimate Labor by about a point, leading it to project a 95% confidence range of 48.6% to 52% for Labor's two-party preferred.  The midpoint of this range is 50.3% to Labor, compared to the final polls averaging 51.4, so Labor has been docked about a point.  And off such a 2PP, Labor would indeed have had a small edge, but the outcome would have been very uncertain.

It's not the case that other modellers ignored the question of house effects. It's rather that we looked at it when revising models at the start of the term and didn't see that much in it.  Different modellers can legitimately take different approaches here because of the large number of subjective decisions involved.  For instance: should we consider just federal elections (meaning our house effect assessments are based off a tiny sample) or should we consider performance by the same pollsters in state elections as well?  (All pollsters in the field for it had a major miss in the other direction in the Victorian state election recently.) Should we treat Newspoll as still the same poll as it was pre-2015 when in 2015 it was transferred to a different company, switched from live landline polling to a mix of robopolling and online polling, increased its sample size, became curiously under-dispersed and then over time changed its preferencing methods?  The only thing that didn't change at Newspoll was the questions.

What should we do when the same company (eg Morgan) constantly chops and changes its methods, even within the same term, or when a pollster that has a history of skew to one side suddenly produces a run of polls that display skew to the other side (a la Morgan following Malcolm Turnbull's ascension)?  And so on.   Models that assumed strong house effects did better at this election than those that did not, but at the previous two elections this was not the case.  Going back several elections there is a history, that I sometimes call the "Labor Fail Factor", of Labor often doing worse than its polling (with rare exceptions such as 1969 and 1993), but given that the pollsters who used to poll and those who poll now are almost totally different, how much weight should be placed on that?

As I write the Labor 2PP is on 48.29 and is projected to if anything finish lower, so even after the Free Range Stats model docked Labor a point based on its assessment of house effects, the 2PP (but not the seat tally) still looks likely to fall outside its 95% confidence level.  That's a sign of how severe a polling failure this is by our modern standards.  [2021 Edit: I initially wrote here that the model was assuming similar error rates to previous elections, as, eg, FiveThirtyEight's models do, but Ethan in comments has corrected this and suggested the issue was assuming polls were failing independently rather than together - see comments.]

Maybe we should all (as the Free Range Stats model in effect did) have taken the sudden late-campaign agreement of the ultimate dinosaur poll, Morgan Face-to-face, with the pollster consensus as a stronger sign that the pollster consensus was wrong.  But Morgan had so many red flags (small sample size, poor accuracy at past elections, tendency to ad hoc and post hoc release of results etc) that my response was to downweight it so that it had very little influence.

What analysts really should have done, and I regret not doing it, was stopped playing this game completely and gone on strike.  The recent record of accuracy in the national polls is based on aggregation across multiple poll series with polls behaving independently of each other, both internally and externally.  This was violated by cross-pollster herding in the final week in 2016 but they got away with it that time, perhaps because nothing was happening. When the whole of the final three weeks was blighted by blatant clustering and when voting intention had been shifting rapidly before that, we should have just refused to even try to project this election off polling at all.

Who called it a massive polling failure?

It wasn't "journalists", it was me!  The words "massive polling failure" were written by me in my Saturday night live blogging and picked up by journalists.  Antony Green likewise called it a "a bit of a spectacular failure of opinion polling".  Having a go at journalists because they quoted local experts who know our polling industry's record and know how reliable it has been for decades is a cheap shot.  There are plenty of other cheap shots in what Nate Silver in fairness does describe as a rant, so here's one in the opposite direction: when you've been hammering the theme of sloppy media coverage of routine polling errors for long enough, everything looks like a nail.

Silver suggests that "journalists" misread the polls because of a cosmopolitan bias perhaps connected to climate change and notes this election had a strong urban-rural divide (newsflash: they always do).  I suggest with equally little evidence that confirmation bias from a lifetime of jumping on sloppy reporting of poll projections of close results in the US has caused some overseas analysts to assume it's the same thing without looking closely enough.  As I wrote on Twitter, having Americans telling us how to interpret our polling failures is almost as bad as having Brits tell us how to manage our forests.  We will decide what polls have failed in this country and the circumstances in which they have failed!  (Actually, I welcome any input from anyone anywhere, but would appreciate it if they made a proper effort.)

Did "unthinkability bias" cause many forecasters and some pundits to underweight evidence that the election was much less predictable than it seemed?  Perhaps, but I don't think this was down to any inner-city partisan bias among the commentariat, or to views on climate change.  Indeed the commentariat generally (like the voters it seems) were often unenthused with the performance of Bill Shorten as Opposition Leader, and with Labor's performance on issues of commentariat concern.  Rather, it was more that the Government had violated strong priors, like the supposed golden rule of politics "disunity is death", by ripping itself apart over leadership for the second term in a row, and the previous government that did that had been harshly punished.  Also, the government's record of polling atrociously was hard to ignore fully, because previously when long-trailing governments have suddenly improved in polling it has often been temporary and followed by reversion towards the term mean.  It's as hard to throw away some of the priors involved as it was for US observers to dispense with "the party decides" when it came to Donald Trump.

The overseas articles have also employed the term "pundit" strangely.  As I understand it from Nate's book, a pundit is the opposite of an expert.  A pundit predicts outcomes via insider talk, the vibe from party sources, general instinctual feels and often a heavy loading of partisan bias.  The pundit's approach isn't data-driven. But the only example of a pundit explicitly named in the coverage is Andy Marks, a political scientist who declared a "virtually unquestionable" Labor win.  Many media pundits actually played up a close election (as they usually do, even when it isn't).  Australia's most notorious pundit, Alan Jones, not only correctly predicted this outcome but also publicly staked his career on it.  Pundits who were right may have been in the stopped-clock category ("person who nearly always predicts Coalition win predicted Coalition win!"), a la almost-always-wrong pundit Bob Ellis predicting Labor's win in Queensland 2015. However, claims about what "pundits" thought should be based on proper content analysis of the media, not generalisation from one example who was actually a qualified expert.

Coming Soon (Probably)

This article is more than long enough for now, but I am intending future articles on:

1. What data pollsters should be publishing about polling to help them recover public confidence that their methods are appropriate.

2. My own response to this predictive failure in terms of future directions for polling coverage on this site and media comments.  (I don't regard predictions as a major part of my business here and neither it seems do readers.)

12 comments:

  1. I wonder if you've seen this article, Kevin: https://www.theage.com.au/federal-election-2019/nation-s-most-influential-pollster-can-t-explain-election-disaster-20190527-p51rhc.html

    ReplyDelete
  2. Comment from Michael Quinnell:

    *******************************************************************


    Brilliant article, Kevin. I read two sites a lot - your's and fivethirtyeight. So it has been interesting to compare your in depth analysis to the so far shallow analysis of the fivethirtyeight team.

    I actually e-mailed fivethirtyeight on 14 April pleading for some preview coverage (as they regularly do, at least at a high level, of UK General elections) and mentioned that it might be useful to their US readers even to shine a light of what nationwide "instant runoff" voting looks like (what US audiences call compulsory preferential voting). However, I never heard back.


    I then contacted them on evening of Thursday 23 May, after hearing a (perhaps half) joking reference to doing a live Australian podcast (I said I could arrange a venue for them). I also linked to your original polling error article, where you referenced Nate Silver's initial tweet (something I highlighted). But alas, still no response! Perhaps your new article will cause them to at a minimum re-think the haste of their initial views and look through their e-mail traffic at the same time.

    ReplyDelete
  3. Hi Kevin,

    State by state polling aggregates before the election showed only small swings to Labor except in Queensland. This suggests a huge error in Queensland (perhaps around 15 points using Nate's calculation) and pretty small ones elsewhere. Was the problem just the one state?

    ReplyDelete
  4. Kevin ,
    Andrew Gelman has promised an article on it. Can'r wait although after your efforts I can't see what he can add now

    ReplyDelete
  5. Hi Kevin,
    Interesting take, especially the point about the introduction of Newspoll having significantly improved polling accuracy (till 2019).
    I'm curious about the data you showed in this article - from recollection, the 2004 polling average was about as wrong as 2019 (due to Morgan polls showing leads for Labor when the Coalition would win by 52-48).
    The only polls I've been able to dig up from that time are Newspoll (50-50), Morgan (51-49 and 51.5-48.5 in favour of Labor), Galaxy (52-48 in favour of the Coalition), which when averaged give a 2pp of 50.8% for Labor. Using the method you've described in the article, that's a polling error of about 7 points, which seems much larger than the 2019 error of about 5.9 points.
    My guess is that the data set you used either contains some Coalition leaning polls, which would reduce the error, or it didn't include the Morgan polls. As I've been unable to find ReachTEL or Nielsen polling for 2004, I lean towards the former explanation; even so, it still seems like 2004 was a significant polling error approaching that of 2019.

    ReplyDelete
    Replies
    1. ReachTEL didn't exist in 2004. Nielsen was 54-46 to Coalition off primaries of 49-37. It's also worth noting that the database being discussed used the major party spread error, not the 2PP error, as its estimate of how wrong polls were. This is significant with Newspoll at least, which had primaries of 45-39, a major party spread error of 3.1 points, but its error on 2PP spread was 5.5 points. This was largely because Newspoll used respondent preferences that year. By previous-election preferences, the 2004 final Newspoll 2PP would have been 51-49.

      Delete
    2. (note the error on 2PP spread is double the error on 2PP - so a 50-50 poll when the 2PP is 52.74 is a 2PP spread error of 5.5).

      Delete
    3. Hi Kevin,
      Good point about the error being major party spread rather than 2pp spread - I missed that in the article.

      Still, I wonder if major party spread is really the best way to measure polling errors in a single-member, preferential-voting electoral system like Australia's - if a pollster got the major primaries right, but completely bungled up the minor parties' relative shares of the vote (e.g. massively over/under-estimating the Greens), the 2pp could still be significantly off.

      On your note about respondent vs previous-election preferences, I think it's up to the pollster which version they use, and they should be judged on the published 2pp as well as the primary votes. After all, it is the 2pp which is reported on by the media, and it is the 2pp which makes or breaks governments.

      Delete
    4. I agree with all that. I tend to weight the 2PP accuracy strongly in assessing the accuracy of Australian polls because it is the figure that is most used to predict the outcome and that pollsters are judged on. But for the purposes of this article I was pointing out how the dataset used by overseas observers to say our poll failure was no big deal did not actually support that conclusion.

      Ipsos overestimating the Greens' primary instead of Labor's - something it had a monotonous habit of doing - is a good example of why major party spreads are not always ideal for judging poll errors in Australia. Another problem can be when the pollsters' reading of all the party votes is largely accurate but their preferencing assumptions are wrong, eg Queensland 2015 where most of the 2PP error was caused by preference shifting. That said the extreme preference shift seen in that election seems to be more of a risk in optional preferential voting elections than compulsory preferencing ones.

      Delete
  6. Hi Kevin,

    I thought you might find this interesting, but I don't think the Free Range Stats model's error had much to do with incorrect assumptions on polling accuracy. Their model actually doesn't seem to use past poll accuracy as far as I can tell (in fact if I'm not mistaken, they calculate the uncertainty in their model using sqrt(2) * theoretical sampling error, i.e. sqrt2 * sqrt((p * (1-p))/n)).

    The primary reason for why their model was overconfident (as far as I can tell), is that the model assumed there was little correlated error between polls. I've been back-testing my vote model, and if I set the final 2pp prediction for 2019 to 50.3% ALP, the MoEs are, respectively:

    Assuming no correlated error:
    ALP 48.8% - 52%

    Pretty similar to what the final FRS model said. Now, if I assume the amount of correlated error is equal to the average correlated error in polling from 1990 - 2016:

    Assuming a historically average amount of correlated error:
    ALP 47.9% - 52.9%

    Some extra over-confidence probably also came from under-estimating or failure to model shifts in preference flows. If I turn on my preference flow shifts model (assumes that there is uncertainty in pref flows roughly equal to what they have historically been), the MoE changes to:

    Assuming historically average amount of correlated error + historically average amount of preference flow uncertainty:
    ALP 47.5% - 53.2%

    You can probably dispute how accurate historical uncertainty is - the Greens' pref flows are probably not going to shift as much as they have in the past - but the broad idea of keeping in mind that 2pp is usually estimated from past preference flows which can and do shift (instead of treating it as if it was a sampled proportion) is probably correct.

    Hence, I think that in this case, the FRS model was probably roughly correct on the accuracy of the individual poll (although multiplying the theoretical sampling error by a factor to account for non-sampling error, instead of using historical poll accuracy is certainly an interesting approach, and I'd be interested in seeing how that plays out in future elections).

    The problem with the FRS model (and almost certainly with the Buckleys and None model too, their 2pp distribution was very overconfident and they were saved by a vote-to-seat model which assumed the map was biased against Labor: https://www.buckleysandnone.com/how-it-works-part-two/) was more of an assumption which is common in statistical theory, i.e. that repeated measurements of an unknown variable are mostly independent. In reality, when polls stuff up, they all tend to do so in the same direction, even if that direction is not predictable prior to the election, and I think it's important to account for that in election modelling.

    ReplyDelete

The comment system is unreliable. If you cannot submit comments you can email me a comment (via email link in profile) - email must be entitled: Comment for publication, followed by the name of the article you wish to comment on. Comments are accepted in full or not at all. Comments will be published under the name the email is sent from unless an alias is clearly requested and stated. If you submit a comment which is not accepted within a few days you can also email me and I will check if it has been received.