Development and validation of coding algorithms to identify patients with incident lung cancer in United States healthcare claims data

Lung Cancer

Pharmacoepidemiol Drug Saf. 2020 Oct 4. doi: 10.1002/pds.5137. Online ahead of print.


PURPOSE: Our aim was to develop and validate a practical US healthcare claims algorithm for identifying incident lung cancer that improves on positive predictive value (PPV) and sensitivity observed in past studies.

METHODS: Patients newly diagnosed with lung cancer in Surveillance, Epidemiology, and End Results (SEER) (gold standard) were linked with Medicare claims. A 5% Medicare "other cancer" sample and noncancer sample served as controls. A split-sample validation approach was used. Rules-based, regression, and machine learning models for developing algorithms were explored. Algorithms were developed in the model building subset. Rules-based algorithms and those with the highest F scores were evaluated in the validation subset. F scores were compared for 1000 bootstrap samples. Misclassification was evaluated by calculating the odds of selection by the algorithm among true positives and true negatives.

RESULTS: A practical single-score algorithm derived from a logistic regression model had sensitivity = 78.22% and PPV = 78.50% (F score: 78.36). The algorithm was most likely to misclassify older patients (ages ≥80 years) or with missing data in the SEER registry, shorter follow-up time in Medicare (<3 months), insurance through Veterans Affairs, >1 cancer in SEER, or certain Charlson comorbidities (dementia, chronic pulmonary disease, liver disease, or myocardial infarction).

CONCLUSION: In this dataset, a practical point-based algorithm for identifying incident lung cancer demonstrated significant and substantial improvement (7.9% and 23.9% absolute improvement in sensitivity and PPV, respectively) compared with a current standard.