A Comparison of Individual Change using Item Response Theory and Sum Scoring on the Patient Health Questionnaire-9: Implications for Measurement-Based Care

Research Article

Ann Depress Anxiety. 2019; 6(1): 1098.

A Comparison of Individual Change using Item Response Theory and Sum Scoring on the Patient Health Questionnaire-9: Implications for Measurement-Based Care

Jones SMW¹*, Crane PK² and Simon G³

¹Fred Hutchinson Cancer Research Center, US

²Department of Medicine, University of Washington, US

³Senior Investigator, Kaiser Permanente Washington Health Research Institute, US

Corresponding author: Jones SMW, Assistant Member, Fred Hutchinson Cancer Research Center, Fairview Ave, Seattle, WA 98109, US

Received: January 22, 2019; Accepted: March 25, 2019; Published: April 01, 2019

Abstract

We examined change over time in depression with standard sum vs. Item Response Theory (IRT) scoring. Patient Health Questionnaire 9 item responses were extracted from the electronic health records of 5,405 people receiving depression treatment at the start of treatment and 30 to 180 days later. We used four methods to classify change: the Reliable Change Index (RCI), the 5-point change and 50% change from baseline for sum scores and the z-test for IRT scoring. The 5-point change and 50% change from baseline are both Health Effectiveness Data and Information Set measures. The z-test mostly agreed with the RCI, 5-point change or 50% change. More people had change using 5-point change or 50% change but not IRT scoring than no change using 5-point or 50% change but change using IRT scoring. Kappas between changes on IRT and sum scores ranged from 0.620 to 0.813. This difference in agreement is likely meaningful at the individual, patient level. People classified differently between IRT and sum scoring had moderate symptom change. Differences in conclusions from IRT and sum scoring may be most relevant in challenging clinical situations such as small or moderate symptom change.

Keywords: Depression; Treatment response; Item response theory; Change scores

Introduction

Item Response Theory models (IRT) have been increasingly used as an alternative to classical test theory in measure development and validation for psychiatric outcomes such as depression and anxiety [1]. IRT scoring may have more precision in distinguishing statistically significant individual differences in change over time [2]. A cross-sectional study found that even among people with the same standard sum score, IRT scores were associated with external criteria in the hypothesized direction [3], suggesting that IRT scoring may be more informative of actual level of depression or other symptoms in treatment compared to standard sum scores. Simulation studies demonstrate that IRT scoring may reduce bias in estimating rates of change over time compared to standard sum scoring [4]. Part of this reduction in bias may stem from IRT models not assuming that error is constant along the continuum of a measure, unlike classical test theory [2]. Although IRT scores and sum scores are highly correlated, even the small amount of disagreement between the scores may have impact at the individual patient level [5].

While there may be some psychometric advantages of IRT compared to classical test theory, different scoring methods may have different usefulness in measurement-based care. Measurementbased care is the use of patient-reported data in healthcare treatment, Primarily Patient-Reported Outcomes (PROs) [6-9], adoption in community is variable and below 20% [10]. New Health Effectiveness Data and Information Set (HEDIS) quality metrics emphasizing measurement based care [11] are expected to accelerate use of measurement-based care, assessing individual change in measurement-based care is particularly difficult and remains a barrier to implementation [5,10]. IRT may be one way to address this challenge. But the benefits of IRT scoring in measurement-based care needs to be considered against the practical advantages of standard sum scoring (simpler, easier, more transparent to clinicians). For example, nearly half of practicing clinical psychologists are in private practices [12] and only 15% of psychiatric hospitals have electronic medical records [13]. Implementing a complicated scoring system like IRT would be challenging in these settings as they do not have the infrastructure of large medical-surgical hospitals or academic centers. Research on different scoring methods for individual change have been mixed [14-18].

The aim of this study was to compare agreement between IRT scoring to standard sum scale scoring in classifying change from depression treatment initiation to follow-up on the Patient Health Questionnaire-9 (PHQ9). HEDIS focuses on simple, easy to compute measures of change such as 50% change from baseline [11] and more sophisticated measures of statistically significant change such as from IRT would only be needed if these methods disagreed substantially and were not interchangeable. Measurement-based care includes evaluating whether the initial treatment choice was successful, so determining change of symptoms from treatment initiation could help inform clinical decision making. We therefore focused on whether IRT and standard scoring provided different results on whether the initial treatment was effective or not. We also specified two sets of change measures for sum scores, statistically significant change by the Reliable Change Index [19] and general guidelines [20], for comparison to IRT scoring.

Methods

Population and procedures

Data were collected from the Electronic Health Records (EHR) of people starting treatment for depression (psychotherapy or antidepressants) in three integrated health systems: Kaiser Permanente Washington (formerly Group Health), Kaiser Permanente Colorado, and HealthPartners (n=5,420, see flow chart in Figure 1). A new episode of either antidepressant medication or psychotherapy was defined by a psychotherapy visit or a filled antidepressant prescription associated with a diagnosis of depression, preceded by at least 180 days without a psychotherapy visit or antidepressant prescription. Data were extracted for the period between January 1, 2010 and December 31, 2012. PHQ9 item responses were collected at baseline, defined as when participants were first starting depression treatment within our time window, and at a follow-up health care visit that occurred at least 30 days after baseline but no more than 180 days after baseline. Limited demographic information was collected from the EHR including age, sex, race/ethnicity and presence of medical comorbidities. As this study was a secondary analysis of data collected from another study, we did not have specific diagnoses for comorbidities nor number of comorbidities, though we had data on whether medical comorbidities were present as measured by the Charlson Comorbidity Index [21]. Responsible Institutional Review Boards for each health system reviewed all study procedures and approved a waiver of consent for use of de-identified records data for this research (IRB#213058). Study procedures complied with all ethical standards including the Helsinki Declaration.