Social:

February 10, 2024
12:52 pm
Artificial Intelligence
Reading Time: 2 Mins

Automated Paper Screening for Clinical Reviews Using Large Language Models: Data Analysis Study

Background: The systematic review of clinical research papers is a labor-intensive and time-consuming process that often involves the screening of thousands of titles and abstracts. The accuracy and efficiency of this process are critical for the quality of the review and subsequent health care decisions. Traditional methods rely heavily on human reviewers, often requiring a significant investment of time and resources.

Objective: This study aims to assess the performance of the OpenAI generative pretrained transformer (GPT) and GPT-4 application programming interfaces (APIs) in accurately and efficiently identifying relevant titles and abstracts from real-world clinical review data sets and comparing their performance against ground truth labeling by 2 independent human reviewers.

Methods: We introduce a novel workflow using the Chat GPT and GPT-4 APIs for screening titles and abstracts in clinical reviews. A Python script was created to make calls to the API with the screening criteria in natural language and a corpus of title and abstract data sets filtered by a minimum of 2 human reviewers. We compared the performance of our model against human-reviewed papers across 6 review papers, screening over 24,000 titles and abstracts.

Results: Our results show an accuracy of 0.91, a macro F₁-score of 0.60, a sensitivity of excluded papers of 0.91, and a sensitivity of included papers of 0.76. The interrater variability between 2 independent human screeners was κ=0.46, and the prevalence and bias-adjusted κ between our proposed methods and the consensus-based human decisions was κ=0.96. On a randomly selected subset of papers, the GPT models demonstrated the ability to provide reasoning for their decisions and corrected their initial decisions upon being asked to explain their reasoning for incorrect classifications.

Conclusions: Large language models have the potential to streamline the clinical review process, save valuable time and effort for researchers, and contribute to the overall quality of clinical reviews. By prioritizing the workflow and acting as an aid rather than a replacement for researchers and reviewers, models such as GPT-4 can enhance efficiency and lead to more accurate and reliable conclusions in medical research.

Keywords: Chat GPT; GPT; GPT-4; LLM; NLP; abstract screening; classification; extract; extraction; free text; language model; large language models; natural language processing; nonopiod analgesia; review methodology; review methods; screening; systematic; systematic review; unstructured data.

Full Text FREE

Latest Posts

Cybersecurity as It Relates to Perfusion

March 21, 2026

Normothermic Regional Perfusion, Organ Transplantation, Donation After Circulatory Death, Bibliometric Analysis, Graft Survival, Ischemia Reperfusion Injury, Heart Transplantation, Liver Transplantation, Kidney Transplantation, Transplant Outcomes

Normothermic Regional Perfusion in Organ Transplantation: Trends, Key Topics, and Evolving Research Focus

March 20, 2026

Monitoring Venous Pressure After Cardiac Surgery: Protecting Kidney Function

Exploring the Role of Central Venous Pressure in Cardiac Surgery-Associated Acute Kidney Injury: A Comprehensive Scoping Review

March 20, 2026

Perfusionist-Led Pediatric ECMO Monitoring in a High-Tech ICU

A Bedside Staffing Model With Perfusionists for Pediatric Extracorporeal Membrane Oxygenation (ECMO) at a High-Volume Center

March 19, 2026

Global Collaboration in Perfusion Care and Education

Predictive Factors for Determining First-Attempt Success on the American Board of Cardiovascular Perfusion Certification Exams for Graduates of a Master’s Level Perfusion Education Program

March 18, 2026

Calculated vs Reality: The CPB Balancing Act

Is Continuous In-Line Blood Gas Monitoring Reliable During Cardiopulmonary Bypass When PaO2 and PaCO2 Are Calculated Rather Than Measured?

March 17, 2026