But now ...
This was, in general, a pretty good election for the pollsters (most of them anyway) and a bad election for those who wanted to outguess them.
For basic information about the different polls and their methods, and a lot of general stuff about polling methods and debate surrounding them, you may find the recently updated Field Guide To Australian Opinion Polls useful.
I'll consider pollster quality in three categories:
1. Best Final Poll
A lot of prestige is attached to getting a final poll right, because it is the one time when it is easiest to look at the actual results and work out which pollster was correct. But in a way, it's a bit silly, because the last poll is the easiest one to get right, because it is the one where the most other polls are available to enable fine-tuning of assumptions.
There are two important tasks for a final poll: getting the primary votes of the various forces right, and coming up with a good 2PP estimate, by whatever method. Some pollsters polled the PUP vote (and the vote for certain other parties) independently of the generic Others category, but the PUP vote surged in the final days and was difficult to poll accurately. I have chosen to treat all polls the same for this comparison by measuring primary totals as Coalition, Labor, Green and (all) Others. I've considered that the 2PP is the most important figure of all so I have attached quadruple weight to it in the calculation (meaning that errors in the 2PP carry the same weight as the same error made in each of the other four figures at once). The indicator of error being used is otherwise a common one in assessing the accuracy of mathematical models, root-mean-square-error.
Only the last poll published by each pollster is used, even if that pollster published polls by a range of formats. Only the pollster's headline method of computing the 2PP is used, whether I think it was a good one or not. Also, the pollster's published figures are used, even if the pollster may have had more accurate figures that they rounded.
Some may also notice that there is no mention of sample size in these calculations. There's a reason for that: pollsters should strive to eliminate error by using large samples. I'm not interested in the argument that a given poll was within its margin of error and should hence be considered accurate, if the only reason it was within MOE is that it used a small sample size. Using big sample sizes on election eve is an important part of getting a good result.
Here's the outcome:
Possibly the use of RMSQ is a bit harsh on errors of this kind, but instead using just the average of errors without squaring them (but with the same quadruple loading for 2PP) compresses the ratings but does not alter the order, except that AMR and Galaxy become tied.
Overall, the final Newspoll primary and 2PP poll was extremely accurate; AMR, Galaxy, ReachTEL, Morgan and Nielsen were all good to varying degrees, Morgan had a brilliant 2PP off mediocre primaries, Essential's result was just plain mediocre, and the Lonergan mobile-only exercise was an interesting experiment but predictively a disaster. The pollsters taken together (excluding Lonergan) had an average skew of 0.28 points to Labor in their final 2PPs. It's notable that every pollster overmeasured the Green vote, a long-running theme in Australian state and federal polling.
I declare Newspoll the easy winner in this category.
2. Best 2PP Tracking In The Rudd-Return Era
This category measures how accurately a poll relates to the general body of polling released during the leadup to the election, specifically from the return of Kevin Rudd as Prime Minister since that is the period over which I kept data.
There are three things I would ideally like a pollster to do:
1. To accurately measure the 2PP through the period measured, so that not only is the final poll near the mark but also the pollster's output throughout is a good guide. A pollster that produces ratings that are often close to the aggregated average is more useful than one that bounces too much. If the average is steady at 50% 2PP then 49-51-51-50-49 is better than 47-53-54-48-48.
2. To produce readings that do not show a bias to either side. A pollster that has an average close to the aggregated average is more useful than one that has an average above or below it. If the average is steady at 50% 2PP then 49-51-50-49-51 is better than 50-51-51-51-51.
3. To avoid excursions away from the aggregated average. If the average is steady at 50% 2PP, then 49-51-49-51-49-51 may be bouncy, but is better than 49-49-49-51-51-51 which creates a false impression of a sudden shift in the middle.
Providing a 2PP figure is not just a polling exercise but also a modelling exercise. While the usual standard is to use preferences from the previous election (which in this case resulted in overestimates), it is up to the pollster to decide how to convert primary votes into 2PPs and also whether to publish 2PPs to decimal places. Although in some cases it is possible to convert the primary figures published into 2PPs to decimal places, I am only using the 2PP figure published.
To measure aspect 1 I've simply used the average error of the Coalition 2PP vote compared to my aggregate over the full campaign (expressed as a positive or negative figure). To measure aspect 2 I've used RMSQ error as for the first section, and to measure aspect 3 I've taken the cumulative error of each set of three consecutive polls by the same pollster and found the RMSQ error of those (RMSQC). I've also found a rating which is a single compound of the three forms of error through the probably gimmicky method of taking the RMSQ of the three error ratings - the lower the rating the better. (It doesn't really matter since the order is the same as for aspects 2 and 3, which also produce the same ordering as each other.)
The Skew column measures skew to Coalition; a negative figure means skew to Labor. Here, the much-lampooned Essential actually came up with the least skewed average in this period, although Galaxy was also very close to neutral. But while Galaxy tracked the aggregate with uncanny precision, Essential was one of the worst pollsters in terms of average error and average cumulative three-poll error. To say that it was accurate because it was unbiased on average is a little bit like saying that a clock stopped at midday displays the time more accurately across a 24-hour period than one that is ten minutes slow.
Newspoll doesn't do nearly as well on this one as on the final poll question. It generally displayed a lean to the Coalition during the election leadup, though none of its polls were more than two points from the aggregate. It's probable that Newspoll's primary vote tracking was actually more accurate than many (if not all) the others but that the difference in preference behaviour between the 2013 election and the 2010 election meant that Newspoll's modelled 2PPs were a little too high. In this sense their performance was better than the above table shows.
Galaxy, however, was no more than 0.6 points from the aggregate on any of its polls, and also was the only poll for which the average cumulative error was much less than the average single-poll error. Its performance is so good that it should probably be tested for drugs. Nielsen also performed very well from a limited sample, while Morgan's respondent-allocated readings for the SMS/online/face-to-face version of multi-mode were completely useless and Morgan's last-election readings for those not too great on average either.
That said, if we just look at Morgan's five three-mode polls after the election was called, their last-election preference 2PPs were brilliant, with a skew of -0.2 and mean errors better than Galaxy's.
Also, this category is limited to pollsters who produced at least three polls using the same methods. Morgan produced two polls late in the campaign that did not use Face-to-Face (one used SMS, online and phone surveying, while the other was described simply as "Telephone and Internet" and may have been the same). Had there been a third such poll it is likely that the respondent-allocated version of that method would have scored very well on this assessment.
So, while Morgan's performance considered over the whole period from the return of Rudd to the election was ALP-skewed and uneven, that doesn't tell the whole story. The problem with Morgan is that with so many methods and two different reporting conventions for each method, it's hard to know which one poll-watchers should follow. Respondent-allocated preferences, which performed poorly, are the company's chosen headline rate.
Anyway, Galaxy is the Usain Bolt of this category.
3. Best Seat And Local Multi-Seat Polling
This category measures the quality of individual seat and multi-seat polling, both of which were very contentious at this election because they differed from the national polls so much. Here, there are again competing objectives:
1. To correctly project the seat winners.
2. To provide a basis for a correct projection that takes into account changes between when the poll was published and election day.
3. To provide a fairly accurate indication of the 2PP result of the seat, again with changes between the poll date and election day considered.
However, because of the risks involved in single-seat polling, especially low sample sizes, it would be a bit harsh to use RMSQ in these assessments. I'm therefore instead using just the average of raw errors to get a bias figure for each pollster.
I also think it's important not to over-credit pollsters if they just pick easy seats to poll (or are commissioned to poll such seats) that end up being won by massive margins. Pollsters should be rewarded if they have polled difficult seats despite the increased risk of being wrong. Newspoll, for instance, shouldn't get more than a smidgin of credit for polling easy marks like Lyne and New England and getting them correct. So for each poll I measured the notional chance of getting it right based on the election result if one used an unbiased poll with a sample size of 580 (the median size for all the seat polls). The lower the total Easiness score, the harder the seats sampled by that pollster, in theory.
I've used only those polls published during the election campaign proper. I've not used seats that were serious non-classic 2PP contests (Indi, Melbourne, Denison) since only ReachTEL attempted any of these in a non-commissioned poll during the campaign (and they are particularly difficult to get right.)
I've calculated a percentage of seats that each pollster correctly tipped, but I've also calculated an adjusted 2PP for each seat (adjusted by the change in the national 2PP between the taking of the poll and the final election) and seeing how well pollsters did once their polls were adjusted in this manner. An example is Franklin (Tas). ReachTEL, which picked the correct winner in every seat in Tasmania, had Franklin at 51:49 to Labor at a time when the national 2PP was about 52.0 to Coalition. All else being equal, this seat would have swung by 1.5 points to the Coalition like the rest of the nation did, and the Liberals would have had a more than 50% chance of winning it if the original poll was perfectly accurate. A person using the Franklin poll to try to predict the outcome of Franklin, and taking into account national swing might thus be fooled into tipping a Liberal victory (and indeed I overrode my own model when it did this.) As it turned out, Labor retained Franklin easily with 55.1%.
To discourage fence-sitting unless strictly justified, if a poll had a 2PP result of 50:50, I counted that as a hit if the result was within two points of 50:50 and a miss otherwise. For batched seat polls I derived an expected number of seat wins from the given 2PP and allocated credit proportionally. So, for instance, if a poll points to a party winning 3 of 7 seats and they actually win 5 of 7, that counts as 71% correct for that seat.
To get a figure for the predictive usefulness of each pollster's seat tips, I took the average of their raw strike rate and the modified strike rate, and then scaled it up or down according to the difficulty of the polls surveyed by that pollster.
Finally I found the average skew (Coalition 2PP in poll - actual 2PP result) on both a raw and adjusted basis.
ReachTEL had the best raw strike rate, with 11/13 correct calls, but their adjusted strike rate is less good because of Franklin (mentioned above) and also Blaxland, which becomes a projected tie when adjusted for national 2PP tracking (Labor won this seat extremely easily.) The set of ALP-vs-Coalition polls tackled by ReachTEL were also polls that the final results suggest were a little easier than those tackled by others. It can also be seen that Galaxy just shades JWS on adjusted score because while their raw score was lower, their adjusted strike rate was marginally better and they polled slightly harder seats. But overall, given the small sample sizes, the differences between Galaxy, Newspoll, JWS and ReachTEL in their success at predicting seat results are insignificant (and for Lonergan there aren't enough data). Also, while Newspoll emerges the winner on adjusted score, it is possible my methods of assigning partial credit for aggregate polls are on the kind side.
The differences in polling skew are more significant and decisive. Every pollster that polled seat or local polls favoured the Coalition on average, and 2PP drift to the Coalition between when the polls were taken and the election made a lot of the seat polls look better than they actually were. On the whole, seat polls at this election were around 3.7 points too friendly to the Coalition at the time they were taken. The average actual difference from the seat polls to the election results, of 2.3 points, flatters them.
Seat polls are the most challenging to assess in this way. The argument is always advanced that all the pollster has captured is voting intention at the time and that if the result is different on election day that probably means one side carpet-bombed, sandbagged or just lost interest. But that applies to every pollster in a range of seats, and there must be some reason why some of them get much more accurate 2PPs than others. (And no, it's not about polling method exclusively - Galaxy were robopolling for their seat polls.)
It's not as clearcut as the first two categories but in my view Galaxy has earned the title of the best seat pollster at this election.
And now ... *drumroll* ...
Pollster Of The Election!
It's a little difficult to compare results across categories but Newspoll produced by far the best final poll while Galaxy produced by far the best tracking through the campaign. Although not all pollsters engaged in seat polling, Galaxy and Newspoll were the best two seat pollsters, with Galaxy displaying the least skew to the Coalition by nearly a point, albeit with a slightly less impressive rate of picking the winner. I also think that the section won most comfortably by Galaxy, 2PP tracking, is the most important of the three. It is therefore my view that Galaxy Research wins the title of best overall voting intention pollster of the 2013 federal election, with Newspoll a fairly close second.
Of the others, Nielsen tracked well but it would be nice if they had polled more often, ReachTEL's national results were good but they have work to do on local-level skewing (ditto to the latter JWS), AMR performed credibly given their limited experience, Morgan were a mixed and messy bag and Essential were quite disappointing. As for Lonergan, the above omits one national poll that was right on the 2PP at the time, but I don't think that can save them from the dreaded wooden spoon.
No comments:
Post a Comment
The comment system is unreliable. If you cannot submit comments you can email me a comment (via email link in profile) - email must be entitled: Comment for publication, followed by the name of the article you wish to comment on. Comments are accepted in full or not at all. Comments will be published under the name the email is sent from unless an alias is clearly requested and stated. If you submit a comment which is not accepted within a few days you can also email me and I will check if it has been received.