Trial By Error: Null Outcomes Presented as Success in Yet Another CBT Trial from Prof Trudie Chalder

By David Tuller, DrPH

Trudie Chalder, a professor of cognitive behavior therapy (CBT) at King’s College London, has recently published yet another high-profile paper: the main results for “efficacy” from a trial of CBT for patients with so-called “persistent physical symptoms” (PPS) in secondary care. As usual with this group of investigators, things haven’t turned out well. But despite null results for the primary outcome, Professor Chalder and her like-minded colleagues have cast the findings in a positive light in their article, published in Psychological Medicine.

(Psychological Medicine also published the bogus 2013 “recovery” paper from the PACE team; Professor Chalder was one of the three lead investigators for this classic of likely research misconduct, in which participants could get worse on the primary outcomes and still be deemed. to be “recovered.” When I complained to the editors about it a few years ago, I was advised to replicate the PACE trial; instead, I wrote a letter to the journal that demanded an immediate retraction and garnered more than 100 signatories.)

The new study, called the PRINCE Secondary Trial, is separate from another trial of people with persistent physical symptoms in primary care, the PRINCE Primary Trial. Both are part of the ongoing campaign by Professor Chalder and her colleagues to provide an evidentiary base to justify the expansion of psychological services to anyone suffering from PPS, a category also frequently referred to as “medically unexplained symptoms” (MUS). For Professor Chalder and her colleagues, PPS and MUS include chronic fatigue syndrome, irritable bowel syndrome, fibromyalgia, and pretty much anything else that resists easy clinical assessment and diagnosis and could be inferred to be psychologically driven and/or perpetuated by experts predisposed toward such an interpretation.

First, let’s note that PRINCE Secondary is an unblinded trial relying on self-reported outcomes, a study design fraught with potential and actual bias. This is not just the opinion of people who dislike the PACE trial and believe that Psychological Medicine publishes a lot of crap research. As I’ve noted, the current editor-in-chief of the Journal of Psychosomatic Research, along with his two predecessors, published an editorial earlier this year in which they clearly indicated that subjective outcomes were subject to enormous bias in studies that were not rigorously blinded. (That hasn’t stopped the journal from continuing to publish such problematic research, such as recent output from Professor Chalder’s PACE colleague, Professor Peter White.)

On the basis of this laudable stance from the Journal of Psychosomatic Research, any positive findings from PRINCE Secondary would be suspect from the start. But that’s not even at issue here, given the null results for the primary outcome, the Work and Social Adjustment Scale (WSAS) at 52-weeks. (The invaluable blog CBT Watch has published a critique of the trial.)

PRINCE Secondary, according to the trial protocol, was “an RCT designed to evaluate the efficacy and cost-effectiveness of a transdiagnostic cognitive behavioural intervention for adults with PPS in secondary care.” The intervention, “therapist-delivered transdiagnostic” CBT, or TDT-CBT, was offered in addition to standard medical care (SMC). It was specifically developed to address the following concerns, as outlined in the published paper: “Patients with PPS can develop unhelpful cognitions and behaviour which can consequently lead to a reduction in daily functioning, reduced quality of life, and an increased susceptibility towards developing depression and anxiety.” The comparison arm received SMC alone.

The protocol touted the trial as a major initiative: “The PRINCE Secondary study will be the first trial worldwide to address the efficacy and cost-effectiveness of a manual-based, transdiagnostic, approach…If it proves to be efficacious, this treatment approach could significantly improve overall functioning in patients with PPS and may lead to substantial long-term economic benefits to the NHS.”

That last point is important. Professor Chalder and many of her colleagues have promoted the expansion of the National Health Service’s Improving Access to Psychological Therapies program to people with MUS. After this publication, it will be hard to cite PRINCE Secondary as proof that CBT for MUS is an “efficacious” treatment–on the contrary, the results undermine any such claim. But that probably won’t make Professor Chalder or any other members of the CBT ideological brigades question their own assertions about their favored interventions. [In research, “efficacy” and “efficacious” refer to how interventions perform in controlled studies like clinical trials; “effectiveness” and “effective” refer to how interventions perform in the real world.]

**********

Let’s forget about “efficacy” and cite “helpfulness” instead

The protocol for the PRINCE Secondary Trial, published in BMC Pyschiatry, clearly stated: “Efficacy will be assessed by examining the difference between arms in the primary outcome Work and Social Adjustment Scale (WSAS) at 52 weeks after randomisation.” The trial itself notes that the WSAS “was chosen as the primary outcome because the focus of therapy was on targeting processes which might result in a reduction of the impact of symptoms.” In other words, the TDT-CBT was aimed specifically at influencing the cognitive and behavioral factors that were presumed to be preventing PPS patients from full engagement in their work and social lives.

Ok, then. After conducting their due diligence and assessing all the earlier studies and possible outcome measures in the process of developing an authoritative protocol, the investigators determined how they wanted everyone to definitively measure the impact of their intervention. The WSAS is a 40-point scale. The investigators calculated that a difference of 3.6 points or more on the scale would be considered clinically significant. That is, any change of less than 3.6 points would be of insignificant clinical benefit to an individual, it would be an essentially meaningless blip that would not translate into a noticeable improvement.

At 52 weeks, the mean WSAS score of those who received the intervention was only 1.48 points lower than those did not. (Lower WSAS scores represent improvement.) The p-value was 0.139, far from the 0.05 threshold needed to be considered statistically significant. So the WSAS 52-week findings were both clinically and statistically insignificant. Moreover, the entire confidence interval range (-3.44 to 0.48) fell below the designated 3.6-point threshold for clinical significance. These are pretty unequivocal results. They are definitely not useful to those seeking to promote the use of CBT as a treatment for PPS and MUS.

That’s why the conclusion of the paper’s abstract is so striking, and so bizarre. An abstract’s conclusion is what many people who scan a paper will likely remember the most. Here is the entire conclusion of this abstract: “We have preliminary evidence that TDT-CBT + SMC may be helpful for people with a range of PPS. However, further study is required to maximise or maintain effects seen at end of treatment.”

Anyone who takes the time to review the paper should be mystified by this conclusion. This full-scale trial was approved because a lot of earlier research, as outlined in the protocol, had produced ample “preliminary evidence” of the kind mentioned in the conclusion. The protocol, unless I misread it, did not propose to produce more “preliminary evidence” that the TDT-CBT intervention “may be helpful.” PRINCE Secondary was presented in the protocol and received funding based on the notion that it would produce hard data about “the efficacy and cost-effectiveness” of the intervention. (The Psychological Medicine paper did not include the “cost-effectiveness” data.)

It should be noted that “helpfulness” is not the same as “efficacy” and is not defined in the protocol or the trial itself. An intervention might be “helpful” in some way as a supportive strategy while having no “efficacy” as an actual treatment. In this trial, the method of assessing the “efficacy” of the treatment was clearly designated; the results did not achieve that metric, so the treatment cannot be described as “efficacious.” As a vague stand-in, “helpfulness” sounds positive but can mean more or less anything, as it seems to here.

In the paper, the investigators designate eight secondary outcomes. They tout marginal improvements in three of them as indicating possible “helpfulness.” But the results suggest at best the following: Giving people eight sessions of encouragement and attention could prompt them to upgrade their answers by one step or two on some, but not most, questionnaires, compared to those who do not receive eight weeks of such encouragement and attention. That’s it. Expansive interpretations of “helpfulness” are not justified.

Let’s examine these secondary results in a bit more detail. The first is the WSAS at 20 weeks, which reported a 2.41-point difference between the groups. This is still below the 3.6-point threshold for being a clinically significant difference. And as expected because of the bias inherent with self-reported outcomes in unblinded studies, even this minimal apparent effect was not maintained by the 52-week point. (The WSAS at 20 weeks was not in fact listed as a secondary outcome in the protocol.)

Other results are also unimpressive. Five of the eight listed secondary outcomes did not produce statistically significant findings. Two others did. The intervention group achieved a 1.51-point difference from the comparison group on the 30-point Patient Health Questionnnaire 15 and a 0.55-point difference on the 9-point Global Clinical Impression scale. These minimal reported improvements do not provide convincing evidence in favor of the intervention, since they are well within the range of responses that one might expect from the kind of bias noted by the editors of the Journal of Psychosomatic Research in an unblinded study with subjective outcomes.

*******

The problem of multiple comparisons

And there’s another issue here. When authors engage in multiple comparisons, they increase the likelihood of obtaining some results that reach statistical significance by chance. To make up for that, it is common to correct or adjust the results with standard statistical steps, and the Bonferroni correction is the most well-known. Yet the PRINCE investigators don’t like that approach. It is too stringent for their needs, so they decided not to do it. Here’s what they say:

“Throughout this paper, we present unadjusted p values. Methods for adjusting the family-wise error by methods such as the Bonferroni correction are known to be conservative. However, if one were to use a method that controlled the false-discovery rate such as the Benjamini–Hochberg procedure then the differences on PHQ-15, WSAS at 20 weeks and CGI remained statistically significant and would therefore be considered as discoveries after correction for all nine outcomes (eight secondary plus primary outcome).“

My creative interpretation of this statement: Our findings are weak but they’re even weaker than they appear to be as reported. That’s why we didn’t bother to calculate and present p-values that took into account and corrected for how many tests we ran to try to find statistically significant results. Also, the standard method to correct for multiple tests in a study like this is really, really tough to get through, so we’re not going to use it. But we can assure you that a correction method we like better still lets us call these results “discoveries”! (We’re not presenting those corrected results, but trust us on this one–the study is a success.)

An abstract’s conclusion should at least make an effort to incorporate the findings for the primary outcome, the main results of interest. The conclusion from Chalder and colleagues should have forthrightly noted that the intervention was not efficacious. In this context, to prioritize exceedingly modest unadjusted results from a minority of secondary outcomes over the null results of the primary outcome is an insult to readers. This tactic for hiding bad news reeks of desperation, like a bald man’s comb-over.

Furthermore, it is disingenuous for investigators to assert that these meager data represent “preliminary evidence” for the intervention when the primary outcome was a bust. The call for further research to study how to maintain or extend these effects is unwarranted. The intervention failed to produce the predicted and desired effects. That’s the only credible interpretation of these disastrous results.

Peer reviewers and journal editors are supposed to act as safeguards against misrepresentations. In this case, the system failed. A conclusion that does not mention the null results for the primary outcome is unacceptable, and it should have been unpublishable.

Start typing and press enter to search