By David Tuller, DrPH
I thought it might be helpful to re-post a list of questions I wanted to ask Professor White and his PACE colleagues in September, 2015–more than a month before Virology Blog posted the first installment of “Trial By Error: The Troubling Case of the PACE Chronic Fatigue Syndrome Study.” I originally posted this list on January 4, 2016.
In the years since, the PACE authors claim to have repeatedly answered everything. In truth, they have evaded hard questions and provided non-responsive responses. That’s why last year more than 100 experts signed an open letter to The Lancet calling out the PACE study’s “unacceptable methodological lapses.”
Last week, the beleaguered GET/CBT ideological brigades seized on a limited but positive-sounding report from an arm of the National Health Service to claim some sort of vindication–with media choreography provided by their enablers at the Science Media Centre. As tends to happen, the frantic spinning inevitably spun off misleading information. BMJ’s now-amended article about this report from the Health Research Authority became a salient example.
The initial version of the BMJ article identified me as “a US activist” while omitting my academic credentials. It also failed to mention that not only “activists” but many other “academics” share my view that PACE stinks. BMJ has made some fixes and added a “correction/clarification” notice that makes no mention of what was actually corrected or clarified. That the notice is opaque suggests that BMJ places a greater priority on minimizing embarrassment than on maximizing transparency in its editorial processes. Good to know.
In a rapid response, a senior lecturer offered this supportive observation: “It [PACE] was not a perfect trial, but whenever it has been scrutinised by organisations experienced in judging research (initial research council review, Cochrane review, and most recently the HRA) the process, methodology, analysis and the research team have been found to be well within, and acting within, the accepted norms of clinical research.”
This statement, which is true, encapsulates much of the problem. The fact that such organizations have given PACE good marks can be interpreted in more than one way. One interpretation is that distinguished oversight bodies found PACE to be a good trial because it was indeed a good trial. Another interpretation is that those who run such institutions might share some of the biases and delusions of the PACE authors, and have therefore overlooked, dismissed or ignored flaws that first-year epidemiology students at Berkeley can pick apart.
Regarding the HRA specifically, I have pointed out that its remit is narrow. While I have concerns about the report within its own remit, I very much appreciated that it explicitly noted the limits of its review and called for continuing robust debate about the quality of the science.
In the context of this renewed pubic discussion about whether PACE is a robust or quality study, I thought re-visiting my list of initial questions for the investigators would be a useful reality check.
Below is the list, exactly as I drew it up on September 1, 2015. (One exception: I have corrected Question 26, which contained an inaccurate description of the cost-effectiveness analysis in the 2012 PLoS One study. I have posted the original version of Q. 26 after the revised one.)
Some are outdated. Many are questions the investigators would claim to have answered; in some cases they have, but with inadequate explanations. It’s been 3+ years, and I’m still waiting for legitimate responses. The HRA’s report does not do much to resolve matters.
1) In June, a report commissioned by the National Institutes of Health declared that the Oxford criteria should be “retired” because the case definition impeded progress and possibly caused harm. As you know, the concern is that it is so non-specific that it leads to heterogeneous study samples that include people with many illnesses besides ME/CFS. How do you respond to that concern?
2) In published remarks after Dr. White’s presentation in Bristol last fall, Dr. Jonathan Edwards wrote: “What Dr White seemed not to understand is that a simple reason for not accepting the conclusion is that an unblinded trial in a situation where endpoints are subjective is valueless.” What is your response to Dr. Edward’s position?
3) The December 2008 PACE participants’ newsletter included an article about the UK NICE guidelines. The article noted that the recommended treatments, “based on the best available evidence,” included two of the interventions being studied–CBT and GET. (The article didn’t mention that PACE investigator Jessica Bavington also served on the NICE guidelines committee.) The same newsletter included glowing testimonials from satisfied participants about their positive outcomes from the trial “therapåy” and “treatment” but included no statements from participants with negative outcomes. According to the graph illustrating recruitment statistics in the same newsletter, about 200 or so participants were still slated to undergo one or more of their assessments after publication of the newsletter.
Were you concerned that publishing such statements would bias the remaining study subjects? If not, why not? A biostatistics professor from Columbia told me that for investigators to publish such information during a trial was “the height of clinical trial amateurism,” and that at the very least you should have assessed responses before and after disseminating the newsletter to ensure that there was no bias resulting from the statements. What is your response? Also, should the article about the NICE guidelines have disclosed that Jessica Bavington was on the committee and therefore playing a dual role?
4) In your protocol, you promised to abide by the Declaration of Helsinki. The declaration mandates that obtaining informed consent requires that prospective participants be “adequately informed” about “any possible conflicts of interest” and “institutional affiliations of the researcher.” In the Lancet and other papers, you disclosed financial and consulting ties with insurance companies as “conflicts of interest.” But trial participants I have interviewed said they did not find out about these “conflicts of interest” until after they completed the trial. They felt this violated their rights as participants to informed consent. One demanded her data be removed from the study after the fact. I have reviewed participant information and consent forms, including those from version 5.0 of the protocol, and none contain the disclosures mandated by the Declaration of Helsinki.
Why did you decide not to inform prospective participants about your “conflicts of interest” and “institutional affiliations” as part of the informed consent process? Do you believe this omission violates the Declaration of Helsinki’s provisions on disclosure to participants? Can you document that any PACE participants were told of your “possible conflicts of interest” and “institutional affiliations” during the informed consent process?
5) For both fatigue and physical function, your thresholds for “normal range” (Lancet) and “recovery” (Psych Med) indicated a greater level of disability than the entry criteria, meaning participants could be fatigued or physically disabled enough for entry but “recovered” at the same time. Thirteen percent of the sample was already “within normal range” on physical function, fatigue or both at baseline, according to information obtained under a freedom-of-information request.
Can you explain the logic of that overlap? Why did the Lancet and Psych Med papers not specifically mention or discuss the implication of the overlaps, or disclose that 13 percent of the study sample were already “within normal range” on an indicator at baseline? Do you believe that such overlaps affect the interpretation of the results? If not, why not? What oversight committee specifically approved this outcome measure? Or was it not approved by any committee, since it was a post-hoc analysis?
6) You have explained these “normal ranges” as the product of taking the mean value +/- 1 SD of the scores of representative populations–the standard approach to obtaining normal ranges when data are normally distributed. Yet the values in both those referenced source populations (Bowling for physical function, Chalder for fatigue) are clustered toward the healthier ends, as both papers make clear, so the conventional formula does not provide an accurate normal range. In a 2007 paper, Dr. White mentioned this problem of skewed populations and the challenge they posed to calculation of normal ranges.
Why did you not use other methods for determining normal ranges from your clustered data sets from Bowling and Chalder, such as basing them on percentiles? Why did you not mention the concern or limitation about using conventional methods in the PACE papers, as Dr. White did in the 2007 paper? Is this application of conventional statistical methods for non-normally distributed data the reason why you had such broad normal ranges that ended up overlapping with the fatigue and physical function entry criteria?
7) According to the protocol, the main finding from the primary measures would be rates of “positive outcomes”/”overall improvers,” which would have allowed for individual-level. Instead, the main finding was a comparison of the mean performances of the groups–aggregate results that did not provide important information about how many got better or worse. Who approved this specific change? Were you concerned about losing the individual-level assessments?
8) The other two methods of assessing the primary outcomes were both post-hoc analyses. Do you agree that post-hoc analyses carry significantly less weight than pre-specified results? Did any PACE oversight committees specifically approve the post-hoc analyses?
9) The improvement required to achieve a “clinically useful benefit” was defined as 8 points on the SF-36 scale and 2 points on the continuous scoring for the fatigue scale. In the protocol, categorical thresholds for a “positive outcome” were designated as 75 on the SF-36 and 3 on the Chalder fatigue scale, so achieving that would have required an increase of at least 10 points on the SF-36 and 3 points (bimodal) for fatigue. Do you agree that the protocol measure required participants to demonstrate greater improvements to achieve the “positive outcome” scores than the post-hoc “clinically useful benefit”?
10) When you published your protocol in BMC Neurology in 2007, the journal appended an “editor’s comment” that urged readers to compare the published papers with the protocol “to ensure that no deviations from the protocol occurred during the study.” The comment urged readers to “contact the authors” in the event of such changes. In asking for the results per the protocol, patients and others followed the suggestion in the editor’s comment appended to your protocol. Why have you declined to release the data upon request? Can you explain why Queen Mary has considered requests for results per the original protocol “vexatious”?
11) In cases when protocol changes are absolutely necessary, researchers often conduct sensitivity analyses to assess the impact of the changes, and/or publish the findings from both the original and changed sets of assumptions. Why did you decide not to take either of these standard approaches?
12) You made it clear, in your response to correspondence in the Lancet, that the 2011 paper was not addressing “recovery.” Why, then, did Dr. Chalder refer at the 2011 press conference to the “normal range” data as indicating that patients got “back to normal”–i.e. they “recovered”? And since you had input into the accompanying commentary in the Lancet before publication, according to the press complaints commission, why did you not dissuade the writers from declaring a 30 percent “recovery” rate? Do you agree with the commentary that PACE used “a strict criterion for recovery,” given that in both of the primary outcomes participants could get worse and be counted as “recovered,” or “back to normal” in Dr. Chalder’s words?
13) Much of the press coverage focused on “recovery,” even though the paper was making no such claim. Were you at all concerned that the media was mis-interpreting or over-interpreting the results, and did you feel some responsibility for that, given that Dr. Chalder’s statement of “back to normal” and the commentary claim of a 30 percent “recovery” rate were prime sources of those claims?
14) You changed your fatigue outcome scoring method from bimodal to continuous mid-trial, but cited no references in support of this that might have caused you to change your mind since the protocol. Specifically, you did not explain that the FINE trial reported benefits for its intervention only in a post-hoc re-analysis of its fatigue data using continuous scoring.
Were the FINE findings the impetus for the change in scoring in your paper? If so, why was this reason not mentioned or cited? If not, what specific change prompted your mid-trial decision to alter the protocol in this way? And given that the FINE trial was promoted as the “sister study” to PACE, why were that trial and its negative findings not mentioned in the text of the Lancet paper? Do you believe those findings are irrelevant to PACE? Moreover, since the Likert-style analysis of fatigue was already a secondary outcome in PACE, why did you not simply provide both bimodal and continuous analyses rather than drop the bimodal scoring altogether?
15) The “number needed to treat” (NNT) for CBT and GET was 7, as Dr. Sharpe indicated in an Australian radio interview after the Lancet publication. But based on the “normal range” data, the NNT for SMC was also 7, since those participants achieved a 15% rate of “being within normal range,” accounting for half of the rate experienced under the rehabilitative interventions.
Is that what Dr. Sharpe meant in the radio interview when he said: “What this trial wasn’t able to answer is how much better are these treatments and really not having very much treatment at all”? If not, what did Dr. Sharpe mean? Wasn’t the trial designed to answer the very question Dr. Sharpe cited? Since each of the rehabilitative intervention arms as well as the SMC arm had an NNT of 7, would it be accurate to interpret the “normal range” findings as demonstrating that CBT and GET worked as well as SMC, but not any better?
16) The PACE paper was widely interpreted, based on your findings and statements, as demonstrating that “pacing” isn’t effective. Yet patients describe “pacing” as an individual, flexible, self-help method for adapting to the illness. Would packaging and operationalizing it as a “treatment” to be administered by a “therapist” alter its nature and therefore its impact? If not, why not? Why do you think the evidence from APT can be extrapolated to what patients themselves call “pacing”? Also, given your partnership with Action4ME in developing APT, how do you explain the organization rejection of the findings in the statement issued after the study was published?
17) In your response to correspondence in the Lancet, you acknowledged a mistake in describing the Bowling sample as a “working age” rather than “adult” population–a mistake that changes the interpretation of the findings. Comparing the PACE participants to a sicker group but mislabeling it a healthier one makes the PACE results look better than they were; the percentage of participants scoring “within normal range” would clearly have been even lower had they actually been compared to the real “working age” population rather than the larger and more debilitated “adult” population. Yet the Lancet paper itself has not been corrected, so current readers are provided with misinformation about the measurement and interpretation of one of the study’s two primary outcomes.
Why hasn’t the paper been corrected? Do you believe that everyone who reads the paper also reads the correspondence, making it unnecessary to correct the paper itself? Or do you think the mistake is insignificant and so does not warrant a correction in the paper itself? Lancet policy calls for corrections–not mentions in correspondence–for mistakes that affect interpretation or replicability. Do you disagree that this mistake affects interpretation or replicability?
18) In our exchange of letters in the NYTimes four years ago, you argued that PACE provided “robust” evidence for treatment with CBT and GET “no matter how the illness is defined,” based on the two sub-group analyses. Yet Oxford requires that fatigue be the primary complaint–a requirement that is not a part of either of your other two sub-group case definitions. (“Fatigue” per se is not part of the ME definition at all, since post-exertional malaise is the core symptom; the CDC obviously requires “fatigue,” but not that it be the primary symptom, and patients can present with post-exertional malaise or cognitive problems as being their “primary” complaint.)
Given that discrepancy, why do you believe the PACE findings can be extrapolated to others “no matter how the illness is defined,” as you wrote in the NYTimes? Is it your assumption that everyone who met the other two criteria would automatically be screened in by the Oxford criteria, despite the discrepancies in the case definitions?
19) None of the multiple outcomes you cited as “objective” in the protocol supported the subjective outcomes suggesting improvement (excluding the extremely modest increase in the six-minute walking test for the GET group)? Does this lack of objective support for improvement and recovery concern you? Should the failure of the objective measures raise questions about whether people have achieved any actual benefits or improvements in performance?
20) If wearing the actometer was considered too much of a burden for patients to wear at the end of the trial, when presumably many of them would have been improved, why wasn’t it too much of a burden for patients at the beginning of the trial? In retrospect, given that your other objective findings failed, do you regret having made that decision?
21) In your response to correspondence after publication of the Psych Med paper, you mentioned multiple problems with the “objectivity” of the six-minute walking test that invalidated comparisons with other studies. Yet PACE started assessing people using this test when the trial began recruitment in 2005, and the serious limitations–the short corridors requiring patients to turn around more than was standard, the decision not to encourage patients during the test, etc.–presumably become apparent quickly.
Why then, in the published protocol in 2007, did you describe the walking test as an “objective” measure of function? Given that the study had been assessing patients for two years already, why had you not already recognized the limitations of the test and realized that it was apparently useless as an objective measure? When did you actually recognize these limitations?
22) In the Psych Med paper, you described “recovery” as recovery only from the current episode of illness–a limitation of the term not mentioned in the protocol. Since this definition describes what most people would refer to as “remission,” not “recovery,” why did you choose to use the word “recovery”–in the protocol and in the paper–in the first place? Would the term “remission” have been more accurate and less misleading? Not surprisingly, the media coverage focused on “recovery,” not on “remission.” Were you concerned that this coverage gave readers and viewers an inaccurate impression of the findings, since few readers or viewers would understand that what the Psych Med paper examined was in fact “remission” and not “recovery,” as most people would understand the terms?
23) In the Psychological Medicine definition of “recovery,” you relaxed all four of the criteria. For the first two, you adopted the “normal range” scores for fatigue and physical function from the Lancet paper, with “recovery” thresholds lower than the entry criteria. For the Clinical Global Impression scale, “recovery” in the Psych Med paper required a 1 or 2, rather than just a 1, as in the protocol. For the fourth element, you split the single category of not meeting any of the three case definitions into two separate categories–one less restrictive (‘trial recovery’) than the original proposed in the protocol (now renamed ‘clinical recovery’).
What oversight committee approved the changes in the overall definition of recovery from the protocol, including the relaxation of all four elements of the definition? Can you cite any references for your reconsideration of the CGI scale, and explain what new information prompted this reconsideration after the trial? Can you provide any references for the decision to split the final “recovery” element into two categories, and explain what new information prompted this change after the trial?
24) The Psychological Medicine paper, in dismissing the original “recovery” threshold of 85 on the SF-36, asserted that 50 percent of the population would score below this mean value and that it was therefore not an appropriate cut-off. But that statement conflates the mean and median values; given that this is not a normally distributed sample and that the median value is much higher than the mean in this population, the statement about 50 percent performing below 85 is clearly wrong.
Since the source populations were skewed and not normally distributed, can you explain this claim that 50 percent of the population would perform below the mean? And since this reasoning for dismissing the threshold of 85 is wrong, can you provide another explanation for why that threshold needed to be revised downward so significantly? Why has this erroneous claim not been corrected?
25) What are the results, per the protocol definition of “recovery”?
[The following question has been corrected. I have posted the original version of the question below. In the original, I misunderstood which assumptions for informal care were being used in the study and for the cited sensitivity analyses.]
26) The PLoS One paper reported that a sensitivity analysis found that the findings of the societal cost-effectiveness of CBT and GET would be “robust” even when informal care was measured not by replacement cost at the national mean wage but using alternative assumptions–the cost of a home care worker or minimum wage. When readers challenged this claim that the findings would be “robust” under alternative assumptions, the lead author, Paul McCrone, agreed in his responses that changing the value for informal care would, in fact, change the outcomes. He then criticized lower-cost alternative assumptions for not adequately valuing the family’s caregiving work–even though such lower-cost assumptions had been included in the PACE statistical analysis plan.
Why did the PLoS One paper include an apparently inaccurate sensitivity analysis that claimed the societal cost-effectiveness findings for CBT and GET were “robust” under alternative assumptions, even though that wasn’t the case? And if lower-cost alternative assumptions were “controversial” and “restrictive, as the lead author wrote in one of his posted responses, then why did the PACE team include them in the statistical plan in the first place?
[Below is the uncorrected version]
26) The PLoS One paper reported that a sensitivity analysis found that the findings of the societal cost-effectiveness of CBT and GET would be “robust” even when informal care was measured not by replacement cost of a health-care worker but using alternative assumptions of minimum wage or zero pay. When readers challenged this claim that the findings would be “robust” under these alternative assumptions, the lead author, Paul McCrone, agreed in his responses that changing the value for informal care would, in fact, change the outcomes. He then criticized the alternative assumptions because they did not adequately value the family’s caregiving work, even though they had been included in the PACE statistical plan.
Why did the PLoS One paper include an apparently inaccurate sensitivity analysis that claimed the societal cost-effectiveness findings for CBT and GET were “robust” under the alternative assumptions, even though that wasn’t the case? And if the alternative assumptions were “controversial” and “restrictive, as the lead author wrote in one of his posted responses, then why did the PACE team include them in the statistical plan in the first place?