By David Tuller, DrPH
David Tuller is academic coordinator of the concurrent masters degree program in public health and journalism at the University of California, Berkeley.
A few years ago, Dr. Racaniello let me hijack this space for a long piece about the CDC’s persistent incompetence in its efforts to address the devastating illness the agency itself had misnamed “chronic fatigue syndrome.” Now I’m back with an even longer piece about the U.K’s controversial and highly influential PACE trial. The $8 million study, funded by British government agencies, purportedly proved that patients could “recover” from the illness through treatment with one of two rehabilitative, non-pharmacological interventions: graded exercise therapy, involving a gradual increase in activity, and a specialized form of cognitive behavior therapy. The main authors, a well-established group of British mental health professionals, published their first results in The Lancet in 2011, with additional results in subsequent papers.
Much of what I report here will not be news to the patient and advocacy communities, which have produced a voluminous online archive of critical commentary on the PACE trial. I could not have written this piece without the benefit of that research and the help of a few statistics-savvy sources who talked me through their complicated findings. I am also indebted to colleagues and friends in both public health and journalism, who provided valuable suggestions and advice on earlier drafts. Yesterday, Virology Blog posted the first half of the story. Today’s installment was supposed to be the full second half. However, because the two final sections are each 4,000 words long, we decided to make it easier on readers, split the remainder into two posts, and publish them on successive days instead. I was originally working on this piece with Retraction Watch, but we could not ultimately agree on the direction and approach.
This examination of the PACE trial of chronic fatigue syndrome identified several major flaws:
*The study included a bizarre paradox: participants’ baseline scores for the two primary outcomes of physical function and fatigue could qualify them simultaneously as disabled enough to get into the trial but already “recovered” on those indicators–even before any treatment. In fact, 13 percent of the study sample was already “recovered” on one of these two measures at the start of the study.
*In the middle of the study, the PACE team published a newsletter for participants that included glowing testimonials from earlier trial subjects about how much the “therapy” and “treatment” helped them. The newsletter also included an article informing participants that the two interventions pioneered by the investigators and being tested for efficacy in the trial, graded exercise therapy and cognitive behavior therapy, had been recommended as treatments by a U.K. government committee “based on the best available evidence.” The newsletter article did not mention that a key PACE investigator was also serving on the U.K. government committee that endorsed the PACE therapies.
*The PACE team changed all the methods outlined in its protocol for assessing the primary outcomes of physical function and fatigue, but did not take necessary steps to demonstrate that the revised methods and findings were robust, such as including sensitivity analyses. The researchers also relaxed all four of the criteria outlined in the protocol for defining “recovery.” They have rejected requests from patients for the findings as originally promised in the protocol as “vexatious.”
*The PACE claims of successful treatment and “recovery” were based solely on subjective outcomes. All the objective measures from the trial—a walking test, a step test, and data on employment and the receipt of financial information—failed to provide any evidence to support such claims. Afterwards, the PACE authors dismissed their own main objective measures as non-objective, irrelevant, or unreliable.
*In seeking informed consent, the PACE authors violated their own protocol, which included an explicit commitment to tell prospective participants about any possible conflicts of interest. The main investigators have had longstanding financial and consulting ties with disability insurance companies, having advised them for years that cognitive behavior therapy and graded exercise therapy could get claimants off benefits and back to work. Yet prospective participants were not told about any insurance industry links and the information was not included on consent forms. The authors did include the information in the “conflicts of interest” sections of the published papers.
Top researchers who have reviewed the study say it is fraught with indefensible methodological problems. Here is a sampling of their comments:
Dr. Bruce Levin, Columbia University: “To let participants know that interventions have been selected by a government committee ‘based on the best available evidence’ strikes me as the height of clinical trial amateurism.”
Dr. Ronald Davis, Stanford University: “I’m shocked that the Lancet published it…The PACE study has so many flaws and there are so many questions you’d want to ask about it that I don’t understand how it got through any kind of peer review.”
Dr. Arthur Reingold, University of California, Berkeley: “Under the circumstances, an independent review of the trial conducted by experts not involved in the design or conduct of the study would seem to be very much in order.”
Dr. Jonathan Edwards, University College London: “It’s a mass of un-interpretability to me…All the issues with the trial are extremely worrying, making interpretation of the clinical significance of the findings more or less impossible.”
Dr. Leonard Jason, DePaul University: “The PACE authors should have reduced the kind of blatant methodological lapses that can impugn the credibility of the research, such as having overlapping recovery and entry/disability criteria.”
The PACE Trial is Published
Trial recruitment and randomization into the four arms began in early 2005. In 2007, the investigators published a short version of their trial protocol in the journal BMC Neurology. They promised to provide the following results for their two primary measures:
*”Positive outcomes” for physical function, defined as achieving either an SF-36 score of 75 or more, or a 50% increase in score from baseline.
*“Positive outcomes” for fatigue, defined as achieving either a Chalder Fatigue Scale score of 3 or less, or a 50% reduction in score from baseline.
*“Overall improvers,” defined as participants who achieved “positive outcomes” for both physical function and fatigue.
The investigators also promised to provide results for what they defined as “recovery,” a secondary outcome that included four components:
*A physical function score of 85 or more.
*A fatigue score of 3 or less.
*A score of 1 (“very much better”) out of 7 on the Clinical Global Impression scale, a self-rated measure of overall health change.
*Not fulfilling any of the three case definitions used in the study (the Oxford criteria, the CDC criteria for chronic fatigue syndrome, and the myalgic encephalomyelitis criteria).
Tom Kindlon scrutinized the protocol for details on the promised objective outcomes. He knew that self-reported questionnaire responses could be influenced by extraneous factors like affection for the therapist or a desire to believe the treatment worked. He also knew that previous studies of rehabilitative treatments for the illness had shown that objective measurements often failed even when a study reported improvements in subjective measures.
“I’d make the analogy that if you’re measuring weight loss, you wouldn’t ask people if they think they’d lost weight, you’d measure them,” he said.
The protocol’s objective measures of physical capacity and function included:
*A six-minute walking test;
*A self-paced step-test (i.e. on a short stool);
*Data on employment, wages, and the receipt of benefits
On the trial website, the PACE team posted occasional “participants newsletters,” which featured updates on funding, recruitment and related developments. The third newsletter, dated December 2008, included words of praise for the trial from Prime Minister Gordon Brown’s office as well as an article about the government’s release of new clinical treatment guidelines for chronic fatigue syndrome.
The new U.K. clinical guidelines, the newsletter told participants, were “based on the best available evidence” and recommended treatment with cognitive behavior therapy and graded exercise therapy, the two rehabilitative approaches being studied in PACE. The newsletter did not mention that one of the key PACE investigators, physiotherapist Jessica Bavington, had also served on the U.K. government committee that endorsed the PACE therapies.
The same newsletter included a series of testimonials from participants about their positive outcomes from the “therapy” and “treatment,” although it did not mention the trial arms by name. The newsletter did not balance these positive accounts by including any comments from participants with poor outcomes. At that time, about a third of the participants—200 or so out of the final total of 641–still had one or more assessments to undergo, according to a recruitment chart in the same newsletter.
“The therapy was excellent,” wrote one participant. Another was “so happy that this treatment/trial has greatly changed my sleeping!” A third wrote: “Being included in this trial has helped me tremendously. (The treatment) is now a way of life for me.” A fourth noted: “(The therapist) is very helpful and gives me very useful advice and also motivates me.” One participant’s doctor wrote about the “positive changes” in his patient from the “therapy,” declared that the trial “clearly has the potential to transform [the] lives of many people,” and congratulated the PACE team on its “successful programme”—although no trial findings had yet been published.
Arthur Reingold, the head of epidemiology at the University of California, Berkeley, School of Public Health (and a colleague of mine), has reviewed innumerable clinical trials and observational studies in his decades of work and research with state, national and international public health agencies. He said he had never before seen a case in which researchers themselves had disseminated, mid-trial, such testimonials and statements promoting therapies under investigation. The situation raised concerns about the overall integrity of the study findings, he said.
Although specific interventions weren’t named, he added, the testimonials could still have biased responses in all of the arms toward the positive, or exerted some other unpredictable effect—especially since the primary outcomes were self-reported. (He’d also never seen a trial in which participants could be disabled enough for entry and “recovered” on an indicator simultaneously.)
“Given the subjective nature of the primary outcomes, broadcasting testimonials from those who had received interventions under study would seem to violate a basic tenet of research design, and potentially introduce substantial reporting and information bias,” said Reingold. “I am hard-pressed to recall a precedent for such an approach in other therapeutic trials. Under the circumstances, an independent review of the trial conducted by experts not involved in the design or conduct of the study would seem to be very much in order.”
As soon as the Lancet article was released, Kindlon began sharing his impressions with others online. “It was like a hive mind,” he said. “Gradually people spotted different problems and would post those points, and you could see the flaws in it.”
In addition to asserting that cognitive behavior therapy and exercise therapy were modestly effective, the Lancet paper declared these treatments to be safe—no signs of serious adverse events, despite patients’ concerns. The pacing therapy proved little or no better than the baseline condition of specialist medical care. And the results for the two subgroups defined by other criteria did not differ significantly from the overall findings.
It didn’t take long for Kindlon and the others to notice something unusual—the investigators had made a great many mid-trial changes, including in both primary measures. Facing lagging recruitment eleven months into the trial, the PACE authors explained in The Lancet, they had decided to raise the physical function entry threshold, from the initial 60 to the healthier threshold of 65. With the fatigue scale, they had decided to abandon the 0 or 1 bimodal scoring system in favor of continuous scoring, with each answer ranging from 0 to 3; the reason, they wrote, was “to more sensitively test our hypotheses.” (As collected, the data allowed for both scoring methods.)
They did not explain why they made the decision about the fatigue scale in the middle of the trial rather than before, nor why they simply didn’t provide the results with both scoring methods. They did not mention that in 2010, the FINE trial—a smaller study for severely disabled and homebound ME/CFS patients that tested a rehabilitative intervention related to those in PACE–reported no significant differences in final outcomes between study arms, using the same physical function and fatigue questionnaires as in PACE.
The analysis of the Chalder Fatigue Scale responses in the FINE paper were bimodal, like those promised in the PACE protocol. However, the FINE researchers later reported that a post-hoc analysis, in which they rescored the Chalder Fatigue Scale responses using the continuous scale of 0 to 3, had found modest benefits. The following year, the PACE team adopted the same revised approach in The Lancet.
The FINE study also received funding in 2003 from the Medical Research Council, and the PACE team referred to it as its “sister” trial. Yet the text of the Lancet paper included nothing about the FINE trial and its negative findings.
Besides these changes, the authors did not include the promised protocol data: results for “positive outcomes” for fatigue and physical function, and for the “overall improvers” who achieved “positive outcomes” on both measures. Instead, noting that changes had been approved by oversight committees before outcome data had been examined, they introduced other statistical methods to assess the fatigue and physical function scores. All of their results showed modest advantages for cognitive behavior therapy and graded exercise therapy.
First, they compared the one-year changes in each arm’s average scores for physical function and fatigue. Yet unlike the method outlined in the protocol, this new mean-based measure did not provide information about a key factor of interest—the actual numbers or proportion of participants in each group who reported having gotten better or worse.
In another approach, which they identified as a post-hoc analysis, they determined the proportion of participants in each arm who achieved what they defined as a “clinically useful” benefit–an increase of eight points on the physical function scale and a decrease of two points on the revised fatigue scale. Unlike the first analysis, this post-hoc analysis did provide individual-level rather than aggregate responses. Yet post-hoc results never enjoy the level of confidence granted to pre-specified ones.
Moreover, the improvements required for what the researchers now called a “clinically useful” benefit were smaller than the minimum improvements needed to achieve the protocol’s threshold scores for “positive outcomes”—an increase of ten points on the physical function scale, from the entry threshold of 65 to 75, and a drop of three points on the original fatigue scale, from the entry threshold of 6 to 3.
A third method in the Lancet paper was another post-hoc analysis, this one assessing how many participants in each group achieved what the researchers called the “normal ranges” for fatigue and physical function. They calculated these “normal ranges” from earlier studies that reported the responses of large population samples to the SF-36 and Chalder Fatigue Scale questionnaires. The authors reported that 30 and 28 percent of participants in, respectively, the cognitive behavior therapy and graded exercise therapy arms scored within the “normal ranges” of representative populations for both fatigue and physical function, about double the rate in the other groups.
Of the key objective measures mentioned in the protocol, the Lancet paper only included the results of the six-minute walking test. Those in the exercise arm averaged a modest increase in distance walked of 67 meters, from 312 at baseline to 379 at one year, while those in the other three arms, including cognitive behavior therapy, made no significant improvements, from similar baseline values.
But the exercise arm’s performance was still evidence of serious disability, lagging far behind the mean performances of relatively healthy women from 70 to 79 years (490 meters), people with pacemakers (461 meters), patients with Class II heart failure (558 meters), and cystic fibrosis patients (626 meters). About three-quarters of the PACE participants were women; the average age was 38.
In reading the Lancet paper, Kindlon realized that Trudie Chalder was highlighting the post-hoc “normal range” analysis of the two primary outcomes when she spoke at the PACE press conference of “twice as many” participants in the cognitive behavior and exercise therapy arms getting “back to normal.” Yet he knew that “normal range” was a statistical construct, and did not mean the same thing as “back to normal” or “recovered” in medical terms.
The paper itself did not include any results for “recovery” from the illness, as defined using the four criteria outlined in the protocol. Given that, Kindlon believed Chalder had created unneeded confusion in referring to participants as “back to normal.” Moreover, he believed the colleagues of the PACE authors had compounded the problem with their claim in the accompanying commentary of a 30 percent “recovery” rate based on the same “normal range” analysis.
But Kindlon and others also noticed something very peculiar about these “normal ranges”: They overlapped with the criteria for entering the trial. While a physical function score of 65 was considered evidence of sufficient disability to be a study participant, the researchers had now declared that a score of 60 and above was “within the normal range.” Someone could therefore enter the trial with a physical function score of 65, become more disabled, leave with a score of 60, and still be considered within the PACE trial’s “normal range.”
The same bizarre paradox bedeviled the fatigue measure, in which a lower score indicated less fatigue. Under the revised, continuous method of scoring the answers on the Chalder Fatigue Scale, the 6 out of 11 required to demonstrate sufficient fatigue for entry translated into a score ranging from 12 and higher. Yet the PACE trial’s “normal range” for fatigue included any score of 18 or below. A participant could have started the trial with a revised fatigue score of 12, become more fatigued to score 18 at the end, and yet still been considered within the “normal range.”
“It was absurd that the criteria for ‘normal’ fatigue and physical functioning were lower than the entry criteria,” said Kindlon.
That meant, Kindlon realized, that some of the participants whom Chalder described as having gotten “back to normal” because they met the “normal range” threshold might have actually gotten worse during the study. And the same was true of the Lancet commentary accompanying the PACE paper, in which participants who met the peculiar “normal range” threshold were said to have achieved “recovery” according to a “strict criterion”—a definition of “recovery” that apparently survived the PACE authors’ pre-publication discussion of the commentary’s content.
Tom Kindlon wasn’t surprised when these “back to normal” and “recovery” claims became the focus of much of the news coverage. Yet it bothered him tremendously that Chalder and the commentary authors were able to generate such positive publicity from what was, after all, a post-hoc analysis that allowed participants to be severely disabled and “back to normal” or “recovered” simultaneously.
Perplexed at the findings, members of the online network checked out the population-based studies cited in PACE as the sources of the “normal ranges.” They discovered a serious problem. In those earlier studies, the responses to both the fatigue and physical function questionnaires did not form the symmetrical, bell-shaped curve known as a normal distribution. Instead, the responses were highly skewed, with many values clustered toward the healthier end of the scales—a frequent phenomenon in population-based health surveys. However, to calculate the PACE “normal ranges,” the authors used a standard statistical method—taking the mean value, plus/minus one standard deviation, which identifies a range that includes 68% of the values in a normally distributed sample.
A 2007 paper co-authored by White noted that this formula for determining normal ranges “assumed a normal distribution of scores” and yielded different results given “a violation of the assumptions of normality”—that is, when the data did not fall into a normal distribution. White’s 2007 paper also noted that the population-based responses to the SF-36 physical function questionnaire were not normally distributed and that using statistical methods specifically designed for such skewed populations would therefore yield different results.
To determine the fatigue “normal range,” the PACE team used a 2010 paper co-authored by Chalder, which provided population-based responses to the Chalder Fatigue Scale. Like the population-based responses to the SF-36 questionnaire, the responses on the fatigue scale were also not normally distributed but skewed toward the healthy end, as the Chalder paper noted.
Despite White’s caveats in his 2007 paper about “a violation of the assumption of normality,” the PACE paper itself included no similar warnings about this major source of distortion in calculating both the physical function and fatigue “normal ranges” using the formula for normally distributed data. The Lancet paper also did not mention or discuss the implications of the head-scratching results: having outcome criteria that indicated worse health than the entry criteria for disability.
Bruce Levin, the Columbia biostatistician, said there are simple statistical formulas for calculating ranges that would include 68 percent of the values when the data are skewed and not normally distributed, as with the population-based data sources used by PACE for both the fatigue and physical function “normal ranges.” To apply the standard formula to data sources that have highly skewed distributions, said Levin, can lead to “very misleading” results.
Raising tough questions about the changes to the PACE protocol certainly conformed to the philosophy of the journal that published it. BioMed Central, the publisher of BMC Neurology, notes on its site that a major goal of publishing trial protocols is “enabling readers to compare what was originally intended with what was actually done, thus preventing both ‘data dredging’ and post-hoc revisions of study aims.” The BMC Neurology “editor’s comment” linked to the PACE protocol reinforced the message that the investigators should be held to account.
Unplanned changes to protocols are never advisable, and they present particular problems in unblinded trials like PACE, said Levin, the Columbia biostatistician. Investigators in such trials might easily sense the outcome trends long before examining the actual outcome data, and that knowledge could influence how they revise the measures from the protocol, he said.
And even when changes are approved by appropriate oversight committees, added Levin, researchers must take steps to address concerns about the impacts on results. These steps might include reporting the findings under both the initial and the revised methods in sensitivity analyses, which can assess whether different assumptions or conditions would cause significant differences in the results, he said.
“And where substantive differences in results occur, the investigators need to explain why those differences arise and convince an appropriately skeptical audience why the revised findings should be given greater weight than those using the a priori measures.” said Levin, noting that the PACE authors did not take these steps.
Some PACE trial participants were unpleasantly surprised to learn only after the trial of the researchers’ financial and consulting ties to insurance companies. The researchers disclosed these links in the “conflicts of interest” section of the Lancet article. Yet the authors had promised to adhere to the Declaration of Helsinki, an international human research ethics code mandating that prospective trial participants be informed about “any possible conflicts of interest” and “institutional affiliations of the researcher.”
The sample participant information and consent forms in the final approved protocol did not include any of the information. Four trial participants interviewed, three in person and one by telephone, all said they were not informed before or during the study about the PACE investigators’ ties to insurance companies, especially those in the disability sector. Two said they would have agreed to be in the trial anyway because they lacked other options; two said it would have impacted their decision to participate.
Rhiannon Chaffer said she would likely have refused to be in the trial, had she known beforehand. “I’m skeptical of anything that’s backed by insurance, so it would have made a difference to me because it would have felt like the trial wasn’t independent,” said Chaffer, in her mid-30s, who became ill in 2006 and attended a PACE trial center in Bristol.
Another of the four withdrew her consent retroactively and forbade the researchers from using her data in the published results. “I wasn’t given the option of being informed, quite honestly,” she said, requesting anonymity because of ongoing legal matters related to her illness. “I felt quite pissed off and betrayed. I felt like they lied by omission.”
(None of the participants, including three in the cognitive behavior therapy arm, felt the trial had reversed their illness. I will describe these participants’ experiences at a later point).
Tomorrow: The Aftermath