What 'Studies' Actually Show

My wife has always had pretty bad morning sickness during pregnancy.  A few years back she was prescribed Zofran for the nausea and it worked pretty well for her.  Two years later, during her next pregnancy, she had more bad morning sickness and went back for more Zofran.  The doctor told her that no, that might cause heart defects, so she should lay off it.  Instead they gave her some B vitamin complex with a sleep aid.  That 'helped' in the sense that she could keep food down if she timed the sleep aid to knock her out right after the meal, but it didn't help anywhere near as well as the Zofran.

Naturally, after hearing about this I went straight to PubMed to find out how much of a risk we'd unwittingly taken during the last pregnancy.  I found the original publication saying Zofran might cause heart problems.  They were looking for something else (teratomas) in pregnant women taking Zofran and didn't find it.  Instead they found children whose mothers took Zofan were more likely to be born with heart defects, and cautioned that this might be a problem.

I was suspicious (for reasons that will become apparent at the end of this post) at this type of result and looked to see if anyone was able to replicate it.  Sure enough, there was a second study done a few years later looking specifically at whether Zofran caused heart defects and it didn't find anything.  A third study said the same thing.  I quickly dismissed the whole business.  The Zofran scare, to my mind, was debunked.

But wait.  What about that first study?  Doesn't it count for anything?

No.  I want you to intuitively understand why that is, because it's central to being able to dismiss nearly every bogus news article based on some study that "shows X causes Y", or that "you can do W to prevent Z", or whatever.  In order to get there, we need to wade through some basic principles of statistics, but trust me that it's worth it - even if you know the statistics already.  It's important to understand the basic assumptions in order to make sure we end up where the math tells us we're going. 

You might want to just trust the people who know how to do the math.  Unless they're running the calculations wrong (maybe they forgot to carry the three - but they're probably using computers that don't forget to carry 3s) they should come to the right answer.  But as with all math, the calculations are only as good as the assumptions you feed into them.  So I'm going to skip most of the math in the explanation below, focusing instead on the assumptions behind the math.  If the assumptions are wrong, no amount of math matters.  All you have to understand is the assumptions and you can determine whether the conclusion is legitimate without knowing how to do the math. 

Flip a Coin

You have a friend who loves to gamble.  He only gambles on coin tosses, his lucky side is heads, and he is only willing to use his own coin.  Let's call him Harvey Dent.  He asks if you want to bet a dollar on his next toss - he'll take heads of course.  Suspicious, you ask to see the coin, but he refuses.  You're pretty sure he's using a trick coin - heads on both sides - but how can you be sure?  You put down a dollar to test your hypothesis.

Heads.

That doesn't tell you much.  There was a 50% chance of heads from the beginning.  So you do the experiment again.  Heads again.  It's possible to get heads twice in a row.  In fact, there's a 25% chance of that happening.  You keep going with a third dollar, heads (12.5% chance), a fourth dollar (6.25% chance) and a fifth dollar (3.125% chance).  You could keep losing money to him, but by now you figure you've caught him cheating.  The probability he's NOT using a trick coin at this point is a scant 3%, or 1 in 32.  You're confident enough after just five rounds to call him on his crooked ways and he confesses.

What you've done is essentially an exercise in Fisherian statistics.  You created a null hypothesis (that Dent is NOT a cheating cheater) and tried to disprove that hypothesis.  You set a threshold of less than a 5% probability, which we usually call a p-value.  Any time you see a 'p-value' it's supposed to mean "probability that my hypothesis is wrong".  Scientists often get excited about a 'really low p-value', by which they mean a really low probability their results are just due to random chance.  Just as it's possible to get heads five times in a row using an honest coin, it's possible to get heads 500 times in a row with an honest coin.  But it's really unlikely.  You could flip coins your entire life and never see heads 500 times in a row.  The p-value is what we use to help us know when to stop putting money on the table and accept the result as so unlikely it probably didn't just accidentally happen.  Often the threshold is set at a p-value of 0.05, which is a 5% probability your positive result isn't real, just a random coincidence (something that happens 1 in 20 times).

There's a problem with this story, and one that trips up a lot of science journalists, as well as many scientists themselves.  Let's go back to Harvey for a minute.  After you called him out, he mended his ways.  He still likes gambling on coin tosses, but he swears he'll never run a crooked game again.  In turning over a new leaf, he opened a circus of 32 professional, not-crooked coin flipping clowns.  As a gesture of good will, you're free to play any number of games with the coin flippers to confirm they're all legitimate.

You're still skeptical.  So you grab a stack of $1 bills and start playing against the first clown.  You win some, you lose some. After five games you come out $1 behind, but that's well within random chance.  He definitely doesn't have a two-headed coin.  But maybe Dent peppered his circus with both honest and dishonest clowns.  It could be he's using legitimate clowns to cover for shady ones.  The man already showed you he cheats at coin tosses, so you'd better test all 32 clowns.

You keep making the rounds, sometimes coming out ahead and other times coming out behind.  At the 28th clown you get suspicious after you lose three coin flips in a row.  You're down $3 with this clown, but you play two more times to be certain.  Sure enough, you lose two more times.  You calculate the odds just like you did above and determine there's only a 3.125% probability of this happening to you with a fair coin.  You're confident this guy cheated and you call him on it.  He hands over the coin with a contemptuous clown-scowl.  How dare you question his integrity as a professional harlequin?

Ashamed, you shrink away.  The first time you challenged your friend, you suspected him of cheating and tested your hypothesis through experiment.  There was nothing wrong with the way you went about that experiment, but when you tried to repeat it this time it didn't work.  What went wrong?

This second test, at the circus, was different than the first experiment.  You tested more than one of the 32 clowns for cheating.  As you discovered the other 27 clowns weren't cheating, your assumption changed from "all these clowns are cheating" to "at least one clown is cheating".  This is a different assumption, and it changes the rules and the statistics.  If you play a five-round coin toss game once, there's a 3.125% probability you'll come up heads five times in a row.  Once you play a second game that probability goes above 5%, and on up the more times you play.  There's no difference between this and rolling the dice long enough times until you get 'snake eyes'.  There's only a small chance you'll roll it on the first try, but a large chance you'll eventually roll it if you keep going long enough.

With the clowns, you assumed you'd know who was cheating by treating each game separately, but in reality they're all connected, like rolling the dice over and over again.  By the time you got to your 28th game, you should have expected at least one game to come out all heads through chance alone.

This phenomenon is part of what led to the current replication crisis in many branches of science.  Often, when scientists try to repeat an experiment they find in a peer-reviewed scientific journal they find it doesn't replicate.  In other words, the results weren't the result of real discoveries of scientific phenomena, they were just statistical accidents reported as real.  Just like the honest clown we accused of cheating, except nature can't hand over the coin and let us in on the secret.

How did scientists publish results that won't replicate?  Well, if you keep running your experiment many times, tweaking some minor variables each time in order to find out what 'works', eventually you'll get a positive result through random chance.  This is just like rolling the dice long enough to get the number you wanted.  Why would any careful researcher do this, though?  Are they stupid, or are they corrupt?

Neither.  Science is really complicated.  Sometimes all your cells die because it's springtime and the AC comes on, and there's a vent next to the laminar flow hood that shouldn't cause bacterial contamination in your cultures, but it does.  So you block the vent and suddenly all your cultures survive.  It's the little things.

Or perhaps you're trying to grow a new strain of bacteria, but nothing works until someone switches which brand of broth they use to grow them in.  (Maybe Fischer is having a sale.)  The new brand, it turns out, is slightly different from the old brand, in that it doesn't have trace amounts of molybdenum in it because the water in the factory where it's produced is slightly different.  You discover something about the bacteria (can't grow in presence of molybdenum) but the whole thing feels a little like magic.  You never know what tiny detail is going to throw everything off.

The presence of those tiny details doesn't make the science less valid.  Indeed, many major scientific discoveries come from scientists not ignoring, but instead pursuing exactly these tiny details and discovering something profound.  I had a friend as an undergrad who discovered a whole branch of new bacterial species because she made a slight change to the media she grew the cells in on a whim.  She published her paper in Science, one of the most prestigious science journals.

The other day I read about a group of scientists who were researching some beetle.  They moved labs from the UK to the USA and suddenly the beetles wouldn't grow anymore.  This isn't some new group trying to replicate findings, this is the same lab trying to continue the same techniques they used in a different country.  What went wrong?  After agonizing over the problem, they finally discovered the issue was the paper towels they used to grow the beetles on.  Apparently US trees used to make paper towels produce an insecticide the beetles are sensitive to.  UK trees make the same insecticide, but that one is destroyed by the heating process of making paper towels, where the US version is still active after the paper towels are made.  This is the kind of high-precision insanity that would make Rube Goldberg throw up his hands and give up!

These problems present themselves to scientists all the time.  It's almost impossible to tell the difference between tinkering to get the experiment to work right and tinkering that's basically the same as rolling the dice until you get the result you want.

So a researcher is looking at some hypothesis.  Let's say it's that people who are primed by reading a list of words you'd associate with old people will subsequently start acting slower.  If people's brains are thinking about old people, they will act more like old people.  They do this experiment a few times, but don't see people writing slower, or breathing slower.  Maybe those activities aren't close enough to consciousness to be affected, so they have the subjects walk down the hall and notice the people who were primed moved slower than those who weren't.  Eureka!  They publish, but don't mention all the other tests they did, not because they're trying to hide anything, but because that was just part of the process of figuring out how to measure the effect they were looking for.

Or maybe they looked for multiple other effects, such as whether priming would impact speech habits, or grooming standards, and instead found it impacted movement speed.  There are dozens of ways to make multiple analyses - increasing the number of tests and therefore increasing the probability of finding something by pure chance - and many of them are easily missed.  Some things are so small they don't even feel like tests (performing experiments in the morning instead of at night might change the results).

Awareness of this problem is a constant battle, but it's getting more attention recently.  Quality journals get around it by requiring scientists to repeat their experiments multiple times before they publish.  This almost entirely eliminates the risk of seeing results by random chance.  They can also make people reveal all the different analyses they used on the data, which helps us understand how many times they rolled the dice.

Rolling Infinite Dice

There's a more insidious way to get 'positive' results, and it's where we finally get at the intuition I mentioned at the beginning.  It's easy to spot once you know about it, but nearly invisible if you don't.  Let's go back to Dent's circus one more time to find it.

After you left, Dent expanded his operation.  He now has 1,000 clowns, all running coin flipping games.  You never finished testing those last three clowns last time, and you're still suspicious.  The problem is you only have five dollars left to play with.  You really want to bust Dent and his crooked circus gambling operation, but you can't go through and play all the clowns from the beginning.  For one, you'd need to lose far more than five times in a row to be confident, having played a thousand games, that you weren't observing something merely by chance.  Remember that the games are connected.  With only five dollars to play, you'll randomly expect to play over 30 games in which you'd lose five times in a row against an honest clown.  What can you do to catch the cheaters?

You recognize one of the four clowns you didn't test last time.  He always had a suspicious look about him.  He meets your gaze and looks away, hiding behind another clown.  You tell Dent you'd like to test one clown this time, not all 1,000.  He agrees, but gets nervous when you go up to that suspicious-looking clown.  You play five times, you lose five times.  Immediately you accuse the clown of cheating.  He hands over the coin and, sure enough, it's heads on both sides.  What happened?

Just like when you played against Dent himself back in the beginning, you started with a hypothesis: this one person is cheating.  When you tested that hypothesis by playing five times, there was only a 3.125% probability of getting that result by chance.  You rolled the dice one time.  You don't know anything about the other 999 clowns - they were never part of your hypothesis - but you do know this one clown was cheating.

The clown gives you back the ill-gotten gains and you pick another suspicious-looking clown.  Once again, you expose a cheater and get your money back.  You're successful with this method three more times.  This is going well!  Dent is upset that you're ruining his crooked circus operation, but he allows you to continue.  He must suspect you'll select the wrong clown based on your suspicion method.

You need a better way to identify which clowns are sporting double-headed coins.  You go back to look at the five clowns you've already exposed for some clues.  You notice they're all left-handed.  That's it!  All the left-handed clowns are cheaters.  After all, what are the chances that five clowns would all be left-handed?  On average, only one in ten people are left-handed.  To randomly select five left handers is a one in a 100,000 chance occurrence.  There are only 1,000 clown at the circus.

You pick a suspicious-looking left-handed clown and play five games against him.  You lose the first four throws, but the fifth one comes up tails.  This clown wasn't cheating.  How did this happen?  Once again, you made an assumption that was wrong.  You calculated the probability of left-handers as though it was the only possible thing that could distinguish the clowns from one another.  But they're clowns.  There are dozens, hundreds, possibly an infinite number of different ways you could tell one clown from another.  Some are Italian, others have long hair, some wear wigs, some wear red shoes while others wear blue, some are old, bald, fat, tall, have crooked teeth, and on and on.  You didn't test one hypothesis when you looked at those five clowns to try and tell which is different.  You tested countless hypotheses.  You didn't roll the dice once, you rolled a giant bag of dice, then looked through the pile for the results you wanted.  Of course you were going to find something that looked significant through random chance alone.

This is the same thing that happened with the Zofran study.  Originally, they thought the drug might cause teratomas, but when they tested for that specific thing they didn't find anything.  They rolled the dice once and came up empty.  Then they looked at all the data they collected throughout the clinical trial.  From the hundreds of lab tests, ECGs, vital signs, patient reports, and medical records they saw one result that was worse in the Zofran group compared to the control group.  That's the result they reported.  Looking at all that data was no different from pouring out a massive bag of dice and picking only the ones that matched what they were looking for.  It was still chance, but they fooled themselves into believing it was real.

To be fair to the authors of the original study, they weren't the ones who claimed they had proof the drug caused heart defects.  They made the honest claim any careful researcher should make in this type of situation: they said more research was necessary to test whether this was true.  There is a saying careful scientists have: You can't find your hypothesis in your results.  This is what they mean.  In order to know something, you have to start with the hypothesis, then test it.  You can do the test, looking for the hypothesis, but you haven't proven anything when you find it.  All you've done is generate an idea.  You still have to test it again.

This creates a seemingly strange situation where you could run an experiment and get results the first time, but since you didn't specifically name the result you expected you can't make any conclusion from it.  It's not until you do the same experiment the second time - you don't need to change anything from the first experiment except to name the predicted result in advance - that you have solid evidence that the effect is real.

When they went back and tested Zofran again, the hypothesis didn't pan out.  There was little reason to believe it would.  During the course of a clinical trial, thousands of possible data points are collected and compared against each other.  It would be surprising if there wasn't some difference between two groups through chance alone.  And you can be sure that any difference is going to be highlighted when the study is published.  Journalists often aren't as savvy as scientists.  Later, you'll sit down to read some article reporting on the result.  "Shocking study shows that X causes Y!"

By now, you should be able to figure out whether there's anything there.  Go look at the original study - the abstract usually has enough information without requiring too much scientific background - and see whether they found their hypothesis in their results.  Most of these garbage studies do exactly that.  If they didn't name the result they expected before the study began - or if they started by naming dozens of results, of which only one or two panned out - you can pretty much ignore the whole business until they run the experiment again.

Comments

  1. A delightful example of something very similar to the paper towel story showed up on a This American Life episode a while ago. I think the audio is only a few minutes long:

    https://www.thisamericanlife.org/241/20-acts-in-60-minutes/act-fourteen-8

    ReplyDelete

Post a Comment

Popular posts from this blog

Reverse Engineering Life

Cancer update: precision oncology comes into its own