Thursday, September 22, 2016

Federal Election 2016: Best And Worst Pollsters

It's been a long time coming but the recent finalisation of the 2016 House of Reps election results means it's time to present my in-depth review of how the various pollsters did.  At the 2013 election there was a widespread belief that the polls might be totally wrong, but it turned out they were accurate.  Since that election there has been a massive turnover in Australian polling methods and companies (such that only two pollsters went to this election doing the same thing as last time) and there were more reasons for concern, but the miracle has continued.  Australian national opinion polls have again proved highly accurate.  However, the picture in individual seat polling was not such a pretty story.

As usual I will present my awards in three categories.  This article is quite numbery of course, and is rated 3/5 on the Wonk Factor scale.

Best Final Poll

Final polls are what most people look at in judging which poll is the best, but as I noted last time, this is risky.  A pollster might just be lucky, but worse still, election eve is the time when the most other polls are in the field, making it easiest in theory to fine-tune assumptions or even blatantly herd results by looking at what everyone else is getting.

In Australia, coming up with a good two-party preferred vote estimate is extremely important, because that is the part that is most used to predict the seat result.  Getting the primaries for Labor, the Coalition, Greens and the generic Others right is also important.  A lot of pollsters polled the NXT vote specifically at this election but this was rather hard to do (because they ran only in selected seats outside SA), and I don't want to mark pollsters down if they failed to predict a vote of 2% correctly.  So I've just measured errors for all Others combined.

Again, I've used root mean square error (the RMSQ(2) column) to assess the accuracy of the polls - the lower the better.  Again I've included the 2PP figure, with a weighting of four so that errors in the 2PP make as much difference as errors in all the primaries put together.  As it turns out, just using the primaries produces the same order and similar ratios between the different polls.  Here's the ranking (click for slightly larger/clearer version):

(L = landline, M = mobile. Blue for within 1%, blue in bold for closest to the pin.  Notes: Community Engagement published no 2PP estimate, so one was derived using 2013 election preferences.  Also if Ipsos' respondent-preferences score of 51% to Labor is used then Ipsos' RMSQ(2) score becomes 1.70).  

Overall, the national polls were about as accurate as in 2013.  However at the pointy end they were slightly better: Newspoll did even better than the old Newspoll in 2013, and ReachTEL and Essential both narrowly beat the 2013 second-place-getter.  Excluding Morgan (which stopped polling a month out apart from one 2PP result released later without the primaries) and one new pollster, everyone did at least fairly well.  That includes the lone Lonergan taken nearly two months from election day.

The new Newspoll's performance was just stunning and it is an easy winner here.  Indeed, without publishing decimals, it was the most accurate result possible, not just on some parties but every party and the 2PP as well.  Overall the results should be the final end of the anti-robodial argument in Australia.  If it wasn't clear enough last time, it is quite obvious now that robopolling can produce excellent results if the pollster knows what to do with their data.

Best Tracking

As noted above, I don't like just using final polls to assess a pollster's performance.  It's useful to know if a poll provided reliable figures throughout the election leadup, and not just at the death.  The latter applies especially if there is suspicion a poll might be herding.

As discussed last time, ideally a poll should provide a good guide throughout the campaign, should produce neutral readings rather than readings biased to one side, and should avoid excursions away from the average.

This election, however, has provided more problems than 2016 in trying to evaluate whose tracking was the best, and it's hard to avoid it becoming a slightly subjective exercise.  Against a backdrop of relatively little change through the whole campaign, some pollsters that had been producing readings leaning to one side relative to the others stopped doing so in the last two weeks.  Also, the variation in 2PP figures between the different polls (which was low throughout the campaign) reduced at the end, with everyone polling in the final week releasing 50, 50.5 or 51 to Coalition.  This all produced a strong appearance of herding.

Aggregation sites generally had the Coalition's 2PP a shade higher than it ended up. In my case the difference was 0.4 points, as my final aggregate to two decimal places was 50.76.  0.17 points of this difference was caused by preference-shifting.  Was the remainder caused by sample error alone, by the pollsters being less Labor-friendly following the change in Prime Minister than they appeared, or by herding by some pollsters?  It is impossible to say.  Here I'll treat the herding charge as innocent til proven guilty, and just dock all my aggregate readings by 0.4 for this purpose.

Another issue is that my aggregate contains weightings in favour of certain polls based on their past performance, but if the quality of polls has changed markedly then those weightings might count against picking that up.

Anyway, these are the ratings I came up with for the six pollsters to produce at least three polls, by the same method as last time and using polls from the first week in May onwards:

For Skew (the tendency to lean to one side or the other on average), the closer to zero the better (negative equals skew to Labor).  The ratings flag Morgan's 2PPs as skewed to Labor but the others as close enough to neutral.  For Error (the difference from the aggregate) and Departure (the tendency to fall on one side or the other over consecutive polls), the lower the better.  These scores pick up that despite its good final poll, Essential does not seem to have been so accurate through the campaign, and also seems to have leant to one side (Labor) over consecutive polls more than the other pollsters bar Morgan.  To some degree this will be because Essential is downweighted in my aggregate for its poor 2013 performance, and therefore its readings didn't affect the aggregate as strongly as other pollsters, but a completely neutral method would still pick up the same issue to some degree.  All the same, Essential's tracking was much better than in 2013.  Indeed, tracking generally was better than in 2013, but that's no surprise because there was so little happening on the 2PP front!

Galaxy tops this category for the second election in a row, but it only published three polls all campaign, two of them in the final weeks, and this undermined its use in tracking the election.  This was doubtless due to Galaxy also running the Newspoll brand, which was the second-best performer on tracking and ran through the campaign.  As both are run by the same company I declare Galaxy/Newspoll joint winners of the best tracking category. It's not clearcut though: under some possible assumptions (such as late swing), ReachTEL could beat one or both.

Best Seat Polls

This is always where things get messy!

Seat polls were published by Galaxy, Newspoll, ReachTEL, Morgan, Community Engagement and Lonergan.  However the last two were commissioned polls only, and some of the ReachTELs were commissioned.  Including commissioned polls in a survey of accuracy is impossible because we don't know what polls by the same pollster might have not been released by clients.  Indeed some clients are probably more likely to release commissioned polling if it's wrong.

Morgan's seat polls were very messy: they consisted of compiled electorate sampling over a long period, and lacked any 2PP figures.  A high proportion of them focused on non-classic seats (or seats that looked like they might be non-classic seats), which made assessing their accuracy even harder.

There is also the eternal question of how we judge the accuracy of a seat poll.  A good example is ReachTEL's 10 June poll of Grey, which showed NXT leading the Liberals 54:46.  But in the end the Liberals retained the seat 52:48.  Does this mean the poll was badly wrong? Not necessarily.  It could well be that the Liberals had been taking the seat for granted, and that poll results like this gave them the kick up the pants they needed to get serious about retaining the seat.

Yet there were also some seats where the pattern of multiple polls showed no sign of the eventual outcome.  Bass had Coalition 2PPs of 51 (ReachTEL 11 May), 49 (GetUp! ReachTEL 31 May), 52 (Newspoll 15 June), and 50 (ReachTEL 23 June).  On 2 July the Coalition scored just 44. Macarthur had Coalition 2PPs of 49 from ReachTEL and Galaxy in May and 50 from Galaxy and Newspoll in June.  Labor won with 58.3% of the two-party vote.  Seats just don't sit flat all campaign then blow out by 6-8 points in the final two weeks.  The polling in these seats, by multiple pollsters, was systematically wrong.

In William Bowe's excellent polling reviews (PollBludger, Crikey - latter may be paywalled) he drew attention to a few aspects of the seat polls.  Firstly, the public polls tended to skew to the Coalition compared to the results.  Secondly, they had a higher margin of error compared to the actual result (about seven points) than their sample size suggested.  Thirdly, they tended to look "too much like the last election and too little like this one".  The bigger the swing to Labor, the more wrong they were.

I am going to provide more detail on an aspect of the third point.  The most striking characteristic of the public seat polls for me, especially of those from Galaxy/Newspoll, is that their projections of seat swing were underdispersed (not variable enough).  The various public Galaxy/Newspoll seat polls for classic seats gave an average projected swing to Labor per survey of 2.1 points with a standard deviation of 1.57.  Those from ReachTEL gave 1.45 +/- 2.29 (by published respondent preferences) or 1.38 +/- 2.40 (by last-election preferences).  The actual swing at the election per sample (so if a seat was polled twice by a pollster I include it twice) was 3.7 +/- 4.14 for the Galaxy/Newspoll-surveyed-seats (n=36) and 4.31 +/- 3.93 for the ReachTEL-surveyed-seats (n=19).

This is really odd.  Even with a really large sample size per seat, there would be a standard deviation of about three points in swings between seats, because this is the sort of variation that we get on election day.  But even if that variation between seats didn't exist, random sample noise in polls of the size taken by Galaxy/Newspoll should have created a standard deviation of around 2.15 points.  The projected Galaxy/Newspoll seat poll swings were even less variable than would have been expected based on random chance if there was a uniform swing nationwide.  It is the sort of behaviour that might be expected if these polls were not pure polls but employed some form of outlier prevention, such as including a weighting for the national picture or the previous result.  If there is an explanation for how this can happen with purely random polling I'd be interested to see it.

On to my attempts at measuring seat poll accuracy for the public pollsters.  I am treating Galaxy/Newspoll as a single poll, and I am just using the last poll for each pollster for any seat for which that pollster had multiple public polls.  For Morgan, because they failed to publish 2PPs I have quickly calculated 2013-election 2PPs for their seats, except for the seat of Grey (where they stated that their numbers pointed to an NXT win, although the numbers actually didn't do this, so I have treated it as a projected 2PP of 50.01% to NXT.)

The groundrules that apply are similar to last time, but I am including classic and non-classic seats together since all pollsters polled some of the latter.  To avoid rewarding fence-sitting, a poll with a 2PP of 50-50 is treated as having the right winner only if the 2PP was 52-48 or closer.  Also a poll which points to the wrong final two candidates is marked as wrong even if it had the correct winner, unless the incorrect candidate was within four points of making the final two.  (This is to mark down stuff like Morgan's sample projecting the Greens as second in Gellibrand). 2PPs by the Liberals against the Greens or Labor are treated as interchangeable.

I am also including an average error figure as well as a skew figure, so that the size of errors in non-classic 2PPs can be considered.

This is what I get.  I've treated Galaxy and Newspoll as separate for this table - although they are basically the same thing for seat poll purposes it's always possible there is some difference in requirements from The Australian on one hand and the tabloids on the other.  The "Skew" column gives the average 2PP skew to Coalition compared to the results (for classic seats), the "Error" column give the average 2PP error for all seats, the "Correct" column gives the percentage of polls that had the right winner(s) by the standards mentioned above, and the "Easiness" column gives an idea of how easy it would have been for a perfect poll with a typical sample size to get the right result.  (The higher, the easier - a score of 100 means a perfect poll would never fail on any seat).

Polls are listed in alphabetical order.  This is not a ranking.
Overall, Morgan seems to have surprisingly published the least skewed seat polls, but their polls were also the most erratic.  Indeed when I converted their polls to 2PP I found that nine of them were correct within a point, but there were also massive howlers like an 18 point error in Wentworth.  Morgan had the highest proportion of correct seat polls, but this is meaningless because the seats they tackled were, on average, much easier than those tackled by the other pollsters.  They're also not allowed to crow about it, because they didn't publish seat 2PPs.

Of the others, Newspoll has a slightly higher percentage of correct results, and the closest gap between "correct" and "easiness" scores, but also a larger skew.  Galaxy is best on error, second-best on skew and second-best on relative correctness, but the average age of its included data was 10 days compared to 17 for Newspoll, 20 for ReachTEL and goodness knows what for Morgan.  ReachTEL also probably had more than its fair share of hard-to-poll seats and seats where its polls caused a campaign reaction, so being slightly shaded by Galaxy and Newspoll doesn't really tell the whole story.

(A brief word about the released commissioned seat-polls - William includes some data for them here, finding that they actually performed no worse on the whole than the public polls at this election.  For those his spreadsheet includes, I get their success rate in predicting the outcome by the same standards used in the public polls at 60% - although the release of polls without defined 2PPs in some cases again doesn't help.)

Reluctantly because of the underdispersal issue, I'm giving Galaxy a narrow win in this division.

Now it's time for ...

Pollster(s) of the election!

The Galaxy/Newspoll stable has come out on top of all three categories here in some form or other.  The final Newspoll was incredibly good and the differences between the leading pollsters on everything else were not so great.  Seat polling has still not redeemed itself after the 2013 disaster, but national polling is in excellent shape.  As Galaxy and Newspoll are both administered by Galaxy, these brands are the joint winners for this year.

Of the remainder ReachTEL has performed rather well and is clearly in the top three with the Galaxy and Newspoll brands.  Essential produced a good final poll but strong reservations remain about its tracking behaviour.  Ipsos' debut was not bad, but it has work to do on its persistently inflated Greens vote.  The minor pollsters didn't greatly embarrass themselves, while Morgan was a hot mess with the odd redeeming feature.  I have to give Morgan the wooden spoon just for pulling out of national polling with a month to go, leaving me with no yardstick to measure its performance for its future (if there is one.)

I will very soon be restarting my 2PP aggregate, and adjustments will be made based on these results.

No comments:

Post a Comment