Friday, August 3, 2018

"Margin Of Error" Polling Myths

A lot is said about "margin of error" in public discussion of opinion polls, and nearly all of it is wrong.

This is another piece I've been meaning to write for quite a while.  There are many other articles about this on the web, but I'm not aware of any that make all the points I'd like to make and make them in an Australian context.

The concept of "margin of error" is one that is commonly talked about in discussion of polls.  It is often used to (rightly) deflate breathless media babble about movements in polls that are supposedly caused by whatever the media pundit wants to see as the cause, but in practice are often nothing more than random variation from sample to sample.  Possum's Trends, The Horserace and Random Numbers (2012) was a classic debunking of that sort of commentariat nonsense, written in the days of the older, much bouncier Newspoll.

Unfortunately, of all the things that get talked about in polling, margin of error is probably the concept that best shows that "a little learning is a dangerous thing".  People grasp the basic principle and misapply it constantly - and are assisted by pollsters, the media and even the Australian Press Council's otherwise good (and too often disregarded) poll-reporting guidelines in doing so.



What is margin of error?  

Margin of error is a way of describing the uncertainty involved in random sampling.  If we want to know how common a certain character  is among members of a population we can sample random individuals one by one and keep a running total.   Early in the trial the total percentage of individuals that have been recorded as having that character will bounce around wildly but over time, if our sample is large enough, the percentage will gradually settle down.

Here's a fake-poll example of this.  Suppose that at a certain time Australians are exactly evenly split between the Liberal and Labor Parties.  A pollster does a poll of 1000 completely randomly selected voters, asking them which party they prefer and keeping a running percentage.  Every voter has a preference and answers the question.  As the poll progresses the running percentage might change something like this:


Early in this fake poll example the running percentage is wildly inaccurate. After two voters the Liberal Party has a running 2PP of zero and after five voters it is still only on 20%.  After 29 voters it has shot up to 58.7%.  Even after 200 voters it is still at 55.5%, but it gradually settles down closer to the correct 50%.  In this case it finished up at exactly 52%, for a two-point random sample error compared to the right answer.  As it happened, by sheer bad luck it was less accurate at the end than after about 700 samples, but it only got slightly worse towards the end. Early in the sample, if the running percentage hit the right answer by chance it would quickly bounce away from it.

Because it's a random sample, it could have finished up more accurate than it did, or it could have been worse.  It's just as possible Labor could have been in front for most of the running tally.  It's possible even that all 1000 voters selected randomly could have been Labor voters, but it's astronomically unlikely.

Margin of error - as used to refer to polls - answers the following question - based on the number of items I've sampled, and the result I've got so far, how close should I have a 95% chance of being?  So, for instance, with a result of 55.5% after 200 voters, the pollster might be thinking that the voters prefer the Liberal Party.  But plugging it into a margin of error calculator (here's a nice one, here's a good Australian one or there are plenty of simpler ones around.) the pollster finds that the margin of error on that result after that many samples is 6.9%, so a 50-50 result is still within the margin of error.  So it's too early to be even mildly confident that that's the case.

After 1000 interviews, the Liberal Party is still in front (52-48) but the margin of error is 3.1%, so a 50-50 result is still within the margin of error.  If we said, based on these 1000 interviews, that we were confident the Liberal Party was more popular, we would be jumping to conclusions prematurely - and in this case, we'd be wrong.  However, if we still had 52-48 after 5000 interviews, we'd be getting more seriously confident (provided our methods were right) that the Liberal Party was the more popular choice.  The more respondents we sample, the closer to the right percentage we will usually expect to get.

Which sounds like (and is) a handy concept, but when it comes to the concrete details of applying it to opinion polling, there are a great many ways to go wrong.

Below I give examples of the kind of incorrect statements that are often made about margin of error in discussions of polling, whether by pollsters, the media or people following the polls.  In each case I explain why the statement is wrong.

1. "The poll of (insert number) voters has a margin of error of (insert percentage)"

Among the more reputable media sources it is common to see this form of reporting, but alas it is very misleading.  Firstly, it's oversimplifying the concept of margin of error, which actually varies depending on the percentage support for a party as well as the sample size.  Variation is higher for results around 50-50 (highest at exactly 50-50 if it is a pure sample of voter choice between two alternatives) and lower for results close to 0 or 100.  So for instance while the MOE for a sample size of 1000 and a 50% result is 3.1%, for a 10% result (say, what the Greens often poll) it's only 1.9%.  Some pollsters deal with this issue by reporting the MOE for a 50% result as the maximum margin of error, though uninformed readers might think that means the poll can't possibly vary more than that.

The MOE also varies a little based on the size of the population from which the sample is drawn, though that really only starts to bite when sampling small state-level seats.  And another source of minor error in working out the MOE is rounding - most pollsters will round their percentages to the nearest whole number, whereas the MOE concept applies to the unrounded figure.

The bigger problem is that opinion polls in the real world violate the assumptions that margin of error calculations depend on.

Firstly, polls are not true random samples.  Some respondents are much easier to contact than others (whether the polling method is by landline phone, mobile phone, online or face to face).  Of those who can be contacted, some are more willing to take part than others.  Even if the pollster sees nothing unusual in terms of the demographic characters of those who take the survey (after controlling for age, gender and whatever else they decide to control for) it's still possible that those who are taking the survey vary from the norm.  For instance they may be more politically engaged and it may be that people with more interest in politics are more likely to vote for a given party.

Purely online polls like Essential don't even sample the whole population - they sample from a very large panel, and the various non-random ways these panels are assembled might also result in a political skew at certain times.  It may be that on given issues, a person who likes filling out surveys on the internet is more likely to prefer a given party's view than the public at large.   There isn't any way to know how much these issues might create errors in the final poll.

Secondly, pollsters in fact find that the respondents willing to take the sample are demographically unrepresentative.  For instance for landline polling they are very much more likely to be older rather than younger, and one robopoll I've seen reports from (Lonergan) had twice as many female voters as male.  The pollster can and ideally should get around this by using quotas (eg once they have enough voters in certain demographics they then ask specifically for voters in the age and gender groups they're missing) but this is expensive and more time-consuming.  Alternatively, they can use scaling.  However, scaling (which weights some respondents more highly than others) means that variation in a small portion of the overall sample has a larger impact on the outcome, and this has the effect of reducing the effective sample size and hence increasing the effective margin of error.

Thirdly, polling error can be caused by factors other than random noise, unrepresentativeness and scaling.  It can also be caused by faulty poll design - in the way questions are asked, the list of possible parties the respondent hears or sees, or (for a 2PP figure) the way the pollster distributes preferences from minor parties.

Fourth, while the points above suggest that polls should have a somewhat larger effective margin of error than their in-theory MOE, some polls don't actually behave like that.  Some show less poll-to-poll movement than would be expected even if they were truly random samples and there was no change in underlying voting intention.  This may be because such polls are not pure polls but are using unpublished modelling methods to adjust their results and make them less bouncy.  Or, depending on the poll's methods, it could be caused by a design issue such as over-frequently re-polling the same people.  Whatever the cause, we call this sort of poll "underdispersed".

So there is a very great deal wrong with the most basic statements of "margin of error" as applied by pollsters.  It's a theoretical figure based on an inapplicable model and that may not have anything to do with reality.   The best takeaway is that all else being equal, a large sample by a given pollster is better than a small sample by that pollster.  However a large sample by a bad pollster isn't necessarily better than a smaller sample by a good one.

Also see my comments on this as applied to seat polls in Why Is Seat Polling So Inaccurate?

2. "The vote for (insert party) increased by (insert percentage) which is (within/outside) the poll's margin of error"

This is also quite often seen from more reputable media, who try to stop readers from jumping to conclusions that a poll shows real movement when the movement might well be random noise.  Unfortunately this one doesn't scrub up too well either.  Margin of error is used for determining the 95% confidence interval of a single poll - not for determining the statistical significance of a statement about change between two different polls.  When you are comparing two polls, there is the possibility that one varies from the real value by close to the margin of error one way, and the other varies from it by a modest amount the other way - the net difference can be larger than the margin of error of one poll, but it can still be down to an unremarkable level of random noise across the two of them.

So for instance, if one perfectly conducted poll with a sample size of 800 returns a 50-50 result and another returns a 54-46 result, each poll is outside the margin of error of the other.  But the change from poll to poll is actually not statistically significant even at the "weak" significance level (p=0.05).  And the use of such p-values to provisionally demonstrate significant change is contentious in statistics anyway, and needs to be handled carefully.

If one watches polls over a long time, sooner or later an apparently significant shift is highly likely to happen by chance anyway, even when there is no change in voting intention.  Sudden large shifts that are connected to events that seem like they should change people's minds are one thing, but sudden jumps out of the blue might still be meaningless poll-to-poll bouncing.

3. "All the polls are within the margin of error of 50-50 so the 2PP race is a statistical tie"

"Statistical tie" is a largely American term that refers to the parties being close enough that there is not statistically significant evidence one party is ahead.

As noted above, margin of error refers to a single poll.  If you have multiple polls that have similar results then this is like having a much larger combined sample, and the effective margin of error of the body of combined polls is smaller than for a single poll.  If you have many polls showing the same party modestly ahead (say 52-48) and those polls don't have any design flaws that would cause a skew to that party, then there is significant evidence that that party really is ahead.  If the average result of many polls is very close, you might not have statistically significant evidence that that party is in front, but it may still be the most likely explanation.  After all, unlike my fake poll example at the top, there's no reason for a strong assumption that the underlying score is in fact 50-50.

4. "As the fall in party X's vote was well within the margin of error, it's just random movement and there's nothing to see here"

This sort of claim is very common.  It has the same problem as mentioned in statement 2, but even after that's addressed, it has another problem.  Testing for significant differences between two samples determines whether there's significant evidence of a change in the underlying vote.  A negative result doesn't prove there is no change, it just shows that there is not enough evidence for confidence that underlying voting intentions have changed.

Actually, all non-tiny shifts in the average of recent polling (as corrected for recency, pollster quality and the house effects of given poll series) are meaningful in a way.  The reason for that is that we can never truly know what the underlying vote share of the political parties is.  Even on the day after an election (when nobody much polls anyway), it may be significantly different to what it was on election day.

When a new result comes out that is different to other recent polls, there often isn't enough evidence to know that voting intention has changed.  That's not the point - the point is that the best estimate of current voting intention is different, and that can affect projections of party chances at an upcoming election.

For instance, suppose the only thing I know about a contest is a poll by a reliable pollster with a sample size of 1000 and the poll result is 50-50.  Now a new poll comes out and one of the parties is ahead 52-48.  That is not significant evidence that voting intention has changed between the two polls.  But while my best estimate of the underlying voter intention was 50-50 after the first poll, after the new evidence from the second, my best estimate will be at least 51-49 (perhaps slightly higher because the second poll is more recent.)

Indeed it's quite common for those who like to naysay modest changes in polls to ignore a sustained and significant decline by their party over some time, just because that decline consists of poll-to-poll changes that are each, in themselves, insignificant.

5. "This poll is different from the average of all other polls out at the moment by way more than its margin of error.  It's plainly a rogue poll, so I'll ignore it."

A rogue poll is a poll that is outside the poll's theoretical margin of error.  In theory this should happen 1 in every 20 times.  In practice it should be somewhat commoner because of the issues mentioned in point 1.  In underdispersed polls (also see point 1) it very rarely or never happens.

We certainly shouldn't take a rogue-looking poll as a reliable sign of where voting intention is at: if a poll differs sharply from the average of all other polls at the time that's a very good sign that that poll isn't right.  But if its rogueness is down to nothing but sample error, then its data are still valid and should still be taken on board to some degree.  Consider the following string of polls by the same pollster:

50-49-57-51-50

Unless there is something amazing going on that could cause voting intention to go up and down that quickly, then the 57 is a rogue poll.  But provided that the methods were the same each time, if we pool all five polls together the average is 51.4 instead of 50.  Perhaps we shouldn't give the rogue poll equal weighting to the others, since its rogueness might have been caused by a spreadsheet error, or a short-lived interviewer quality issue if it is a live phone poll.  But to give it no weighting at all is to throw away data without being sure there was anything wrong with that data collection.

The other issue here is that sometimes sharp movements in polling have obvious causes.  When something major has happened between the previous polls and a recent one that is very different to them, the recent poll might just be the first poll to have captured the response to that.  Caution is necessary here because most things that are thought of as influencing voting intention don't.  However we can only be sure a poll is rogue after seeing the polls after it, as well as the polls before.

6. "This poll says 53-47 but its margin of error is 4.2 points.  It might just as easily be 49-51 or 57-43.  What a useless poll!" 

The problem here is the "just as easily".  All else being equal, polls will follow a bell-curve distribution, meaning that while the results that are close to the poll's MOE away from the result are within the 95% confidence range, they're also a lot less likely than values closer to the actual poll result.    Assuming this poll was all we knew about a race and the poll was perfectly conducted, there would still be a 92% chance that the party shown as ahead by the poll was indeed ahead.

7. "The polls in this election were mostly wrong by more than the margin of error. But that's just because the real margins of error are much larger than the pollsters say they are!"

This is a close relative of error 3.  Sometimes this is used to explain polling failures in cases where the polls expect a certain outcome and are nearly all wrong by a large amount in the same direction.  Even after allowing for real margins of error being larger than in theory, we'd expect random error to result in some polls being well above the outcome, some being well below, and some being more or less correct.  If the polls are heavily clustered on one side and well away from the result, there has been some form of systematic poll failure.  It might be caused by assumption errors common across all polls (a larger risk in voluntary voting systems), by very late voter intention change (unlikely) or by pollsters engaging in "herding" to make their polls more like those other pollsters are releasing.

8. (added) "This poll publishes data to one decimal place.  But its margin of error is 3% so it should round to the nearest whole number."

This misunderstanding results from misapplying the concept of significant figures, in which one doesn't report something measured by a device (like a set of scales) beyond the level that device can measure to.  One also typically avoids creating an impression of false accuracy by taking a number previously rounded to a small number of significant figures, dividing it by something and giving the result with far more decimals.

Pollsters often like to publish data to whole numbers, because using decimal points can look untidy and can lead to commentators banging on about trivial changes in voting percentages, like a party being up 0.3%.  But in doing so they're actually throwing data away and publishing an estimate of a party's standing that, all else being equal, is less accurate than it was before rounding.  And while the rounded figure isn't different to a statistically significant level from the original, it still describes a different distribution (see statement 4).  Poll numbers are the outcome of complex calculations involving three and four digit integers and scaling factors that may be expressed to several decimal places.  There is no set precision level for such an instrument and no reason data well below margin-of-error level shouldn't be retained.

I have found that some very educated people get this one wrong!

I may add more errors later.

So what should the media do?  

My suggestions are:

1. If you want to report on "margin of error", at least use the expression "theoretical margin of error" so that the audience has some idea the application of the concept is more than a little bit woolly.

2. Don't refer to "margin of error" when discussing poll-to-poll changes.  Instead, where the changes are minor (eg 1-2 points for major parties), note that the poll-to-poll change is "not statistically significant".  This is especially important when reporting, or making, any claim that any particular issue has caused or may have caused a difference between two poll results.

3. Avoid making claims that a party is "set to win", "on track for victory" or anything else confident-sounding off the back of close seat poll leads (51-49s and 52-48s especially).

4. Always note if a poll being reported on has different results to other recent polls covering the same seat or the same election.

5. Avoid discussing the poll-to-poll shift in a single poll without discussing the course of other polling over the same time.  This especially applies to polls on long orbits such as Ipsos and federal ReachTELs.

6. Always state the sample size and provide the full primary vote details (not just two or three parties and the 2PP) so that those interested can make their own calculations.  In the case of ReachTELs, include both the so-called "undecided" vote and the breakdowns of it on a leaning-to basis.

6 comments:

  1. Thanks Kevin for a brilliant article - your analysis is always very insightful and well-written

    ReplyDelete
  2. Excellent analysis, Kevin. It would be nice if media outlets took note but not only do they make mistakes in understanding polling, it is at times deliberate as they'd like to a push a narrative, like in the leadup to Super Saturday.

    ReplyDelete
  3. Yep you are on fire. A definite must read. I will highlight it on monday.

    Well done and keep up the good work

    ReplyDelete



  4. Great article Kevin
    I have given you some publicity via my modest v blog.

    Cam I ask for the margin of error 'prize'

    I nominate the OZ's article on the eve of the Braddon re-election that the Libs had a tracking poll of 500 that was favorable to the libs.
    The MOE was 4%. Tracking polls are usually daily so a lot of money wasted OR is was all propaganda.

    Keep it up

    ReplyDelete
    Replies
    1. There would be so many equal winners in the silly reporting stakes that I would not know where to start. As for the Liberal tracking poll it turned out to be utter garbage, as they claimed to have bombed Garland's vote down from 9% to below 5% with their negative campaigning only for him to poll 10.6% anyway. I believe it was "released" solely to try to discourage internal tensions about the wisdom of attacking Garland in the way they did. Whether the figures were even genuine survey results who knows.

      Delete