Wednesday, June 27, 2018

Is Seat Polling Utterly Useless?

Advance Summary

1. Seat polls have received bad publicity because of poor results at the 2013 and 2016 federal elections, and in some other recent elections such as the WA Darling Range state by-election.

2. Because it is clear that seat polls are not very accurate, it is common for posters on social media to dismiss them out of hand as useless or so misleading as to be worse than useless.

3. Indeed, seat polls at the 2016 federal election shows they were so inaccurate that they had greater average 2PP errors than simple models based on uniform swing and national polling.

4. However, in 2016 hybrid models combining seat polling and uniform swing would have been more accurate than either seat polling or a uniform swing model alone.

5. The correct use of publicly available seat polling seems to be not to ignore it entirely, but rather to aggregate it with other sources of information including national modelling.

6. Seat polling should be most useful for races that are difficult to predict by normal means, but seat polls may be unusually inaccurate in those races too.

7. The biggest problem with seat polling is the reporting that treats unreliable seat polls as definitive verdicts that one side or the other is "winning".  In fact they are weak indicators and need to be reported in the context of other evidence.



---------------------------------------------------------------------------------------------------------
(Warning: This article has a fair bit of dry mathsy stuff in it and has been rated Wonk Factor 4/5.  It helps to be aware there is a difference between average error (what it says on the label) and margin of error (the margin within which 95% of results should fall).)

In the leadup to the "Super Saturday" federal by-elections, some unexpected seat-polling results have been met with widespread derision on social media.  Results like a 54-46 to Liberal ReachTEL in Braddon, and a 52-48 to LNP ReachTEL in Longman, fly in the face of a historic pattern that governments don't win opposition seats  in by-elections (especially not when they're already polling badly).  Given the recent history of seat polls performing badly at federal elections, while national polls have performed well, it's been very easy for people to dismiss these findings out of hand.

The widespread disbelief in seat polls has been advanced by the recent publication of Simon Jackman and Luke Mansillo's excellent analysis (PDF download) of the performance of seat polls at the 2016 election.  Jackman and Mansillo found, among other things, that seat polls at that election were so bad that they should be treated as if their sample size was one-sixth what it actually was.  Errors on the primary votes were especially severe:

"These are extremely large inflation factors; for instance, a seat specific poll estimating Labor support that claims to have a margin of error of ±3 per cent ought to be considered as having a confidence interval of ±8.1 per cent."

At the 2013 federal election, seat polls were even worse than in 2016, displaying a massive average pro-Coalition skew.  A big miss by ReachTEL in the Darling Range by-election (about 7% out two-party preferred, for a poll taken only one week from the election) won't do wonders for any confidence still out there in seat polling.

So we know seat polls are pretty bad, and there are plenty of reasons being suggested as to why that might be so, but that doesn't answer an important question: are they so bad that they're actually useless?

I'll give an example of why seat poll data shouldn't automatically be thrown away just on account of it being more erratic than it seems it should be.  Suppose I know nothing about two candidates for an upcoming election and I do a poll with sample size 800 and get a 50-50 result.  Now, someone else comes along using exactly the same methods with a sample size 200 and gets 55-45 to one candidate over the other.  Obviously, no-one should believe the smaller sample.  But if the two polls are combined, the expected result is 51-49 to the candidate who led in the smaller poll.  So long as the smaller poll was conducted using the same methods, it is better to combine the two samples rather than to throw away data.

However, suppose we find out that the smaller poll might not have been conducted by the same method, but might have been conducted using some method that would have caused a skew to one of the candidates.  Then, unless we could establish how large that skew was, we should ignore the smaller poll.

Even if (per Jackman and Mansillo) a seat poll with sample size 750 should be treated as really having a sample size of 125, that alone is not enough reason to ignore it totally.  It might still be reasonable to allow it to nudge our opinion about the seat slightly one way or the other.

Seat Polls Vs Uniform Swing

If seat polls were really good face-value sources of information about upcoming results, one thing we should expect them to do is to out-predict a really simple model such as uniform swing.  The simplest version of a uniform swing prediction is to estimate a nationwide swing based on polling, and assume exactly that swing will occur in every seat.

I had a look at the 2016 seat polls to see if they did that.  I looked at 70 seat polls from 38 classic-2PP (ie Liberal/National vs Labor) seats taken during the (long) 2016 campaign itself.  I only looked at the 2PP estimates because that is what determines who is claimed to be winning the seat (thus, polls without a published 2PP estimate were ignored).  I ignored who commissioned the polls and what 2PP preferencing method each pollster used.

On this basis the average absolute error of these polls (ignoring which direction it went in in each case) was 3.21 points.  That means a margin of error (not the same thing) of about 6.3 points.

However, if we plug in the national 2PP swing that happened (3.13 points) into a uniform swing model and match it to each of the seat polls (which means counting several seats more than once), the average error of the seat polls drops to 2.97 points.  That is, the seat polls performed worse than assuming there was no variation between the different seats at all.

Of course, that's not a fair test.  Nobody can know exactly what the national swing can be in advance, and if someone was doing the same thing using a national swing estimate that was a bit wrong, presumably they would make worse errors and might not beat the seat polls?

Well, actually, no.  In fact, most remotely reasonable 2PP swing estimates (and some remotely unreasonable estimates too!) would have still beaten the seat polls by the humble uniform swing method.  Any estimate of the national swing between 2.36 points (ie 51.15% 2PP for Coalition) and 5.39 points (ie 48.12% for Coalition) would have done the job.  The smallest error would have been 2.81 points, off a swing estimate of 4.10 points (49.41%), on account of the seat polls including a few seats where the Government got mugged by monster swings.

Free Data Outperforms Seat Polling!

So suppose you were a newspaper staffer and you commissioned a seat poll of one of these seats in 2016.  You then wrote up an article about the seat using no source of predictions except for the seat poll.  It turns out that on average, in 2PP predictive terms, you spent money to obtain a 2PP result from a pollster that was less accurate than if you'd just said "National polling aggregates says there is a 3% swing against the government, based on which the 2PP result expected in (seat blah) is (blah)".  If being accurate in predicting election results is what media are purchasing seat polling for, then they're wasting their money on achieving a worse result.  (And lest anyone think 2016 was especially bad for seat polling, 2013 was much worse.)

Moreover, a uniform swing model is one of the more primitive models that is ever freely available. On average, it in turn will be outperformed by models that also include retirement and sophomore effects for sitting members, and possibly statewide federal polling breakdowns (though I haven't tested the latter over multiple elections.)

I don't think media sources that commission polling really care that much about this - I think they mainly commission polls to have something interesting to report, and that news sources might even prefer an inaccurate but startling result to an accurate but boring one.  Some media (or activist groups) might well even be happy if their seat polls tended to skew to one side or other by a point or two.  But if news sources do care about using seat polls to do forecasting then they need to think about using them correctly.

How Seat Polls Should Be Used

Seat polls are unreliable data, but that doesn't mean they are worthless data.  In the case of the 2016 election, it's easy to test this by comparing the predictions of (i) the seat polls and (ii) the uniform swing model with (iii) a hybrid model that is a weighted average of both.

So, plugging in the actual 2016 swing of 3.13%, we already know the uniform swing model beats the seat polls when it comes to average error.  However, the hybrid model using each given seat poll in turn beats the uniform swing model provided that the weighting given to the seat polls is not more than 77%.  The best result in that case (an average error of 2.85 points) would have come from giving the seat polls a 52% weighting and the national swing model a 48% weighting.  By playing around with both the national swing and the seat poll weighting it's possible to get the average weighting down a little further (eg a 4.4% swing with a 41% weighting for the seat polls gets it down to 2.77 points).

 However the exact "best" assumption set for an election in retrospect doesn't mean anything (and using a particular election to find what would have been the best value for that election, then applying it in the future, creates a big danger of overfitting).  The point is that using two sources of imperfect data (seat polls and some kind of national or state swing based model - hopefully a better one than just uniform swing) is likely, with any reasonable set of assumptions, to work better than just using one.

So, for instance, supposing that modelling done without seat polling suggests Party X is likely to win 54:46, but a seat poll shows Party Y leading 51:49, a news report on the seat poll shouldn't say that the seat poll shows Party Y headed for victory.  Rather, it should say that seat poll result, while inconclusive raises a question about how secure the seat is for Party X.

On the other hand, if the existing modelling suggests Party X is expected to win 51:49 but a seat poll shows Party Y ahead 58:42, a news report on the seat can say that Party Y appears to be headed for a much stronger than expected result in a seat which established modelling does not provide much of a guide to.

And if the existing modelling says, say, 52.5:47.5, but the seat poll is 52:48 the other way, then the correct reading of the two in combination may be that the seat is anyone's guess.

Is Repeat Seat Polling More Accurate?

One might also normally expect that where there are multiple seat polls of the same seat, averageing their samples will provide a more accurate result than otherwise.   Therefore if we had a result that was not what modelling off the national polls expected, that result would be weighted more highly if it was the average of five or six different seat polls rather than just one.  This is what I assumed in my own 2016 modelling of individual seat results, so that when a seat had had many seat polls, my prediction was mainly determined by them.

Unfortunately 2016 just didn't support that view.  Bass was polled four times and the average of the four 2PPs was wrong by 7.6 points.    Macarthur was polled three times and the average was wrong by 8 points. Dobell was polled four times and the average was wrong by 4.06 points. Lindsay was polled six times and five of the six polls had the wrong winner, with the average still being wrong by 2.6 points (which may not sound like much but is much higher than would be randomly expected).  Overall there was no difference in 2016 between the error for single seat polls in seats polled only once (2.83 points) and the error for the average of multiple polls in seats polled more than once (2.82 points).  Which is another way of saying that individual polls in seats that received more attention tended to be worse than those in less heavily polled seats!

Cases Where Seat Polling Should Be More Useful

Seat polling should in theory be more useful than otherwise in cases where the modelling otherwise available is worse.  The first example of this is non-classic contests.  It is very difficult to model Coalition vs Independent or Labor vs Green contests in particular seats from the state of national polling.  Sometimes, as with the Nick Xenophon Team in 2016 or One Nation in the 2017 Queensland state election, one can have a go at it by using a combination of polling and the results of some other election.  Unfortunately, here the track record of seat polling seems to be even worse than for classic seats.  Independent challengers sometimes surge as polling day approaches, their profile increases and the feeling of an upset grows.  Labor vs Green (and for that matter Liberal vs Green) seats are often inner-city seats with high levels of enrolment churn and a lot of uncontactable voters.

The other one, since the matter is so topical, ought to be by-elections.  So for instance a current challenge is to try to predict Longman and Braddon, which are both opposition-held seats where both government and opposition are contesting.  However, the standard deviation of 2PP swings in such contests (historically) is six points, so the margin of error on the average swing to oppositions goes  into double digits.  Even taking out factors that explain some of the variation (such as whether the federal government is polling well or poorly at the time), a seat poll taken reasonably close to the by-election should still be more accurate than such a model.

While ReachTEL's disaster in Darling Range is getting a lot of bad social media press at the moment, it is worth bearing in mind that the company's poll of Canning after the removal of Tony Abbott had a 2PP error of less than two points, and this was also true of both its polls in Bennelong.

The Mayo by-election is interesting because it is both a by-election and a non-classic contest, making it completely unmodellable by normal means.  The best one could attempt by way of a model was to assume that the Centre Alliance vote might decline in accord with what happened in the SA state election (on which basis Rebekha Sharkie could have been in trouble).  However multiple polls showing very large leads for Sharkie are enough evidence to destroy this narrative and replace it with a view that Sharkie should easily hold the seat.

3 comments:

  1. "we know seat polls are pretty bad, and there are plenty of reasons being suggested as to why that might be so". Could you do a page summarising those reasons, and giving your evaluation of which ones are plausible, please Kevin? How can random, or well-selected, samples of 1600 people across the country be somewhere near accurate while similar samples from a smaller geographic area have much bigger errors? Nothing I learned in Theory of Stats in 1958 (yes, 1958!) does anything to explain that. (Of course the examples in that course were things like the diameter of ball bearings rather than the idiosyncratic things called people, but even so - same sample size should give same MoE - shouldn't it???)

    ReplyDelete
    Replies
    1. Excellent suggestion. I'll aim to do that in the near future (maybe the next week or so).

      Delete
  2. I'll be looking forward to it. (And not that it matters but I think I did Stats in 1959 not 58!)

    ReplyDelete