DRAG

Research

Estima Scientific has published pioneering research on how GenAI is transforming patient healthcare. Our peer-reviewed articles have been cited by NICE in their 'Use of AI in evidence generation' position statement, and we have submitted award-winning abstracts to ISPOR, the leading global conference for HEOR.

Filter:
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Peer Reviewed Articles

No results match the filters.

The “Artificial Intelligence Statistician”: Utilizing Generative Artificial Intelligence to Select an Appropriate Model and Execute Network Meta-Analyses’

Reason 2025
Value in Health
Tim Reason MSc, Yunchou Wu PhD, Cheryl Jones PhD, Emma Benbow PhD, Kasper Johannesen PhD, Bill Malcolm MSc
Highlights

Large language models (LLMs) have been successfully applied to automate key aspects of health economics and outcomes research with high accuracy, including tasks integral for network meta-analyses (NMAs).

This study demonstrates the potential for an LLM-based process to automate key components of NMA workflows and produce accurate results.

An LLM-based automated process could transform NMA practices and offer a potential solution toward addressing the increasing demands of health technology assessment.

Objectives

This exploratory study aimed to develop a large language model (LLM)-based process to automate components of network meta-analysis (NMA), including model selection, analysis, output evaluation, and results interpretation. Automating these tasks with LLMs can enhance efficiency, consistency, and scalability in health economics and outcomes research, while ensuring that analyses adhere to established guidelines required by health technology assessment agencies. Improvements in efficiency and scalability may potentially become relevant as the European Union Health Technology Assessment Regulation comes into force, given anticipated analysis requirements and timelines.

Methods

Using Claude 3.5 Sonnet (V2), a process was designed to automate statistical model selection, NMA output evaluation, and results interpretation based on an “analysis-ready” data set. Validation was assessed by replicating examples from the National Institute for Health and Care Excellence Technical Support Document (TSD2), replicating results of non-Decision Support Unit-published NMAs, and generating comprehensive outputs (eg, heterogeneity, inconsistency, and convergence).

Results

The automated LLM-based process produced accurate results. Compared with TSD2 examples, differences were minimal, within expectations (given differences in sampling frameworks used), and comparable to those observed between estimates produced by the R vignettes against TSD2. Similar consistency was noted for non-Decision Support Unit-published NMA examples. Additionally, the LLM process generated and interpreted comprehensive NMA outputs.

Conclusions

This exploratory study demonstrates the feasibility of LLMs to automate key components of NMAs, determining the requisite NMA framework based only on input data. Further exploring these capabilities could clarify their role in streamlining NMA workflows.

Cite This Article

Reason, Tim, Yunchou Wu, Cheryl Jones, Emma Benbow, Kasper Johannesen, and Bill Malcolm. 2025. ‘The “Artificial Intelligence Statistician”: Utilizing Generative Artificial Intelligence to Select an Appropriate Model and Execute Network Meta-Analyses’. Value in Health, August, S1098301525025112. https://doi.org/10.1016/j.jval.2025.08.001.

Read Paper
Automating NMA
Proof of Concept Studies

‘Using Generative Artificial Intelligence in Health Economics and Outcomes Research: A Primer on Techniques and Breakthroughs’

Reason 2025
PharmacoEconomics
Tim Reason, Sven Klijn, Will Rawlinson, Emma Benbow, Julia Langham, Siguroli Teitsson, Kasper Johannesen & Bill Malcolm
Abstract

The emergence of generative artificial intelligence (GenAI) offers the potential to enhance health economics and outcomes research (HEOR) by streamlining traditionally time-consuming and labour-intensive tasks, such as literature reviews, data extraction, and economic modelling. To effectively navigate this evolving landscape, health economists need a foundational understanding of how GenAI can complement their work. This primer aims to introduce health economists to the essentials of using GenAI tools, particularly large language models (LLMs), in HEOR projects.

For health economists new to GenAI technologies, chatbot interfaces like ChatGPT offer an accessible way to explore the potential of LLMs. For more complex projects, knowledge of application programming interfaces (APIs), which provide scalability and integration capabilities, and prompt engineering strategies, such as few-shot and chain-of-thought prompting, is necessary to ensure accurate and efficient data analysis, enhance model performance, and tailor outputs to specific HEOR needs. Retrieval-augmented generation (RAG) can further improve LLM performance by incorporating current external information.

LLMs have significant potential in many common HEOR tasks, such as summarising medical literature, extracting structured data, drafting report sections, generating statistical code, answering specific questions, and reviewing materials to enhance quality. However, health economists must also be aware of ongoing limitations and challenges, such as the propensity of LLMs to produce inaccurate information (‘hallucinate’), security concerns, issues with reproducibility, and the risk of bias. Implementing LLMs in HEOR requires robust security protocols to handle sensitive data in compliance with the European Union’s General Data Protection Regulation (GDPR) and the United States’ Health Insurance Portability and Accountability Act (HIPAA). Deployment options such as local hosting, secure API use, or cloud-hosted open-source models offer varying levels of control and cost, each with unique trade-offs in security, accessibility, and technical demands. Reproducibility and transparency also pose unique challenges.

To ensure the credibility of LLM-generated content, explicit declarations of the model version, prompting techniques, and benchmarks against established standards are recommended. Given the ‘black box’ nature of LLMs, a clear reporting structure is essential to maintain transparency and validate outputs, enabling stakeholders to assess the reliability and accuracy of LLM-generated HEOR analyses. The ethical implications of using artificial intelligence (AI) in HEOR, including LLMs, are complex and multifaceted, requiring careful assessment of each use case to determine the necessary level of ethical scrutiny and transparency. Health economists must balance the potential benefits of AI adoption against the risks of maintaining current practices, while also considering issues such as accountability, bias, intellectual property, and the broader impact on the healthcare system.

As LLMs and AI technologies advance, their potential role in HEOR will become increasingly evident. Key areas of promise include creating dynamic, continuously updated HEOR materials, providing patients with more accessible information, and enhancing analytics for faster access to medicines. To maximise these benefits, health economists must understand and address challenges such as data ownership and bias. The coming years will be critical for establishing best practices for GenAI in HEOR. This primer encourages health economists to adopt GenAI responsibly, balancing innovation with scientific rigor and ethical integrity to improve healthcare insights and decision-making.

Cite This Article

Reason, Tim, Sven Klijn, Will Rawlinson, et al. 2025. ‘Using Generative Artificial Intelligence in Health Economics and Outcomes Research: A Primer on Techniques and Breakthroughs’. PharmacoEconomics - Open 9 (4): 501–17. https://doi.org/10.1007/s41669-025-00580-4.

Read Paper
GenAI methods

‘Generative Artificial Intelligence to Automate the Adaptation of Excel Health Economic Models and Word Technical Reports’

Rawlinson 2025
Value in Health
William Rawlinson MPhysPhil, Siguroli Teitsson MSc, Tim Reason MSc, Bill Malcolm MSc, Andy Gimblett PhD, Sven L. Klijn MSc
Highlights

Many countries require a health technology assessment of new health interventions, which often requires adapting a “global” economic model and technical report from the setting of the reference country to the setting of the new country.

We developed robust methods to automatically adapt global Excel-based health economic models and associated Word technical reports using large language models. Using these methods, routine adaptations of Excel-based models and technical reports were performed accurately and rapidly at a low cost.

Large language models have huge potential for automating tasks that are currently performed manually in health economics and outcomes research, which could greatly expedite the health technology assessment process and improve timely patient access.

Objectives

In health economics and outcomes research (HEOR), many repetitive tasks could be performed by large language models (LLMs), including adapting Excel-based health economic models and associated Word technical reports to a new setting. However, it is vital to develop robust methods so that the LLM delivers at least human-level accuracy.

Methods

We developed LLM-based pipelines to automate parameter value adaptations for Excel-based models and subsequent reporting of the model results. Chain-of-thought prompting, ensemble shuffling, and task decomposition were used to enhance the accuracy of the LLM-generated content. We tested the pipelines by adapting 3 Excel-based models (2 cost-effectiveness models [CEMs] and 1 budget impact model [BIM]) and their associated technical reports. The quality of reporting was evaluated by 2 expert health economists.

Results

The accuracy of parameter value adaptations was 100% (147 of 147), 100% (207 of 207), and 98.7% (158 of 160) for the 2 CEMs and 1 budget impact model, respectively. The parameter value adaptations were performed without human intervention in 195 seconds, 245 seconds, and 189 seconds. For parameter value adaptations, the application programming interface costs associated with running the pipeline were $13.36, $6.48, and $2.65. The accuracy of report adaptations was 94.4% (17 of 18), 100% (54 of 54), and 95.1% (39 of 41), respectively. The report adaptations were performed in 128 seconds, 336 seconds, and 286 seconds. For report adaptations, the application programming interface costs associated with running the pipeline were $1.53, $4.24, and $4.05.

Conclusions

LLM-based toolchains have the potential to accurately and rapidly perform routine adaptations of Excel-based CEMs and technical reports at a low cost. This could expedite health technology assessments and improve patient access to new treatments.

Cite This Article

Rawlinson, William, Siguroli Teitsson, Tim Reason, Bill Malcolm, Andy Gimblett, and Sven L. Klijn. 2025. ‘Generative Artificial Intelligence to Automate the Adaptation of Excel Health Economic Models and Word Technical Reports’. Value in Health, June, S109830152502399X. https://doi.org/10.1016/j.jval.2025.05.020.

Read Paper
Automating Economic Models
Proof of Concept Studies

‘Artificial Intelligence to Automate Network Meta-Analyses: Four Case Studies to Evaluate the Potential Application of Large Language Models.’

Reason 2024
PharmacoEconomics - Open
Tim Reason, Emma Benbow, Julia Langham, Andy Gimblett, Sven L. Klijn, Bill Malcolm
Background

The emergence of artificial intelligence, capable of human-level performance on some tasks, presents an opportunity to revolutionise development of systematic reviews and network meta-analyses (NMAs). In this pilot study, we aim to assess use of a large-language model (LLM, Generative Pre-trained Transformer 4 [GPT-4]) to automatically extract data from publications, write an R script to conduct an NMA and interpret the results.

Methods

We considered four case studies involving binary and time-to-event outcomes in two disease areas, for which an NMA had previously been conducted manually. For each case study, a Python script was developed that communicated with the LLM via application programming interface (API) calls. The LLM was prompted to extract relevant data from publications, to create an R script to be used to run the NMA and then to produce a small report describing the analysis.

Results

The LLM had a > 99% success rate of accurately extracting data across 20 runs for each case study and could generate R scripts that could be run end-to-end without human input. It also produced good quality reports describing the disease area, analysis conducted, results obtained and a correct interpretation of the results.

Conclusions

This study provides a promising indication of the feasibility of using current generation LLMs to automate data extraction, code generation and NMA result interpretation, which could result in significant time savings and reduce human error. This is provided that routine technical checks are performed, as recommend for human-conducted analyses. Whilst not currently 100% consistent, LLMs are likely to improve with time.

Cite This Article

Reason, Tim, Emma Benbow, Julia Langham, Andy Gimblett, Sven L. Klijn, and Bill Malcolm. 2024. ‘Artificial Intelligence to Automate Network Meta-Analyses: Four Case Studies to Evaluate the Potential Application of Large Language Models.’ PharmacoEconomics - Open (Switzerland) 8 (2): 205–20. https://doi.org/10.1007/s41669-024-00476-9

Read Paper
Automating NMA
Proof of Concept Studies

Artificial Intelligence to Automate Health Economic Modelling: A Case Study to Evaluate the Potential Application of Large Language Models.

Reason 2024
PharmacoEconomics - Open
Tim Reason, William Rawlinson, Julia Langham, Andy Gimblett, Bill Malcolm, Sven Klijn
Background

Current generation large language models (LLMs) such as Generative Pre-Trained Transformer 4 (GPT-4) have achieved human-level performance on many tasks including the generation of computer code based on textual input. This study aimed to assess whether GPT-4 could be used to automatically programme two published health economic analyses.

Methods

The two analyses were partitioned survival models evaluating interventions in non-small cell lung cancer (NSCLC) and renal cell carcinoma (RCC). We developed prompts which instructed GPT-4 to programme the NSCLC and RCC models in R, and which provided descriptions of each model’s methods, assumptions and parameter values. The results of the generated scripts were compared to the published values from the original, human-programmed models. The models were replicated 15 times to capture variability in GPT-4’s output.

Results

GPT-4 fully replicated the NSCLC model with high accuracy: 100% (15/15) of the artificial intelligence (AI)-generated NSCLC models were error-free or contained a single minor error, and 93% (14/15) were completely error-free. GPT-4 closely replicated the RCC model, although human intervention was required to simplify an element of the model design (one of the model’s fifteen input calculations) because it used too many sequential steps to be implemented in a single prompt. With this simplification, 87% (13/15) of the AI-generated RCC models were error-free or contained a single minor error, and 60% (9/15) were completely error-free. Error-free model scripts replicated the published incremental cost-effectiveness ratios to within 1%.

Conclusion

This study provides a promising indication that GPT-4 can have practical applications in the automation of health economic model construction. Potential benefits include accelerated model development timelines and reduced costs of development. Further research is necessary to explore the generalisability of LLM-based automation across a larger sample of models.

Cite This Article

Reason, Tim, William Rawlinson, Julia Langham, Andy Gimblett, Bill Malcolm, and Sven Klijn. 2024. ‘Artificial Intelligence to Automate Health Economic Modelling: A Case Study to Evaluate the Potential Application of Large Language Models.’ PharmacoEconomics - Open (Switzerland) 8 (2): 191–203. https://doi.org/10.1007/s41669-024-00477-8.

Read Paper
Automating Economic Models
Proof of Concept Studies

‘Automated Mass Extraction of Over 680,000 PICOs from Clinical Study Abstracts Using Generative AI: A Proof-of-Concept Study’

Reason 2024
Pharmaceutical Medicine
Tim Reason, Julia Langham, Andy Gimblett
Background

Generative artificial intelligence (GenAI) shows promise in automating key tasks involved in conducting systematic literature reviews (SLRs), including screening, bias assessment and data extraction. This potential automation is increasingly relevant as pharmaceutical developers face challenging requirements for timely and precise SLRs using the population, intervention, comparator and outcome (PICO) framework, such as those under the impending European Union (EU) Health Technology Assessment Regulation 2021/2282 (HTAR). This proof-of-concept study aimed to evaluate the feasibility, accuracy and efficiency of using GenAI for mass extraction of PICOs from PubMed abstracts.

Methods

Abstracts were retrieved from PubMed using a search string targeting randomised controlled trials. A PubMed clinical study ‘specific/narrow’ filter was also applied. Retrieved abstracts were processed using the OpenAI Batch application programming interface (API), which allowed parallel processing and interaction with Generative Pre-trained Transformer 4 Omni (GPT-4o) via custom Python scripts. PICO elements were extracted using a zero-shot prompting strategy. Results were stored in CSV files and subsequently imported into a PostgreSQL database.

Results

The PubMed search returned 682,667 abstracts. PICOs from all abstracts were extracted in < 3 h, with an average processing time of 200 s per 1000 abstracts. A total of 395,992,770 tokens were processed, with an average of 580 tokens per abstract. The total cost was $3390. On the basis of a random sample of 350 abstracts, human verification confirmed that GPT-4o accurately and comprehensively extracted 342 (98%) of all PICOs, with only outcome elements rarely missed.

Conclusions

Using GenAI to extract PICOs from clinical study abstracts could fundamentally transform the way SLRs are conducted. By enabling pharmaceutical developers to anticipate PICO requirements, this approach allows for proactive preparation for the EU HTAR process, or other health technology assessments (HTAs), streamlining efficiency and reducing the burden of meeting these requirements.

Cite This Article

Reason, Tim, Julia Langham, and Andy Gimblett. 2024. ‘Automated Mass Extraction of Over 680,000 PICOs from Clinical Study Abstracts Using Generative AI: A Proof-of-Concept Study’. Pharmaceutical Medicine 38 (5): 365–72. https://doi.org/10.1007/s40290-024-00539-6.

Read Paper
PICO
Proof of Concept Studies

Reports

No results match the filters.

Generative Artificial Intelligence (AI) in Health Economic Modelling: HTA Innovation Laboratory Report

NICE 2025

Estima Scientific Ltd are excited to share the results from our collaboration with the NICE Health Technology Assessment (HTA) Lab focused on using generative AI (GenAI) to produce de novo cost-effectiveness models. Details of this work can be found in the final report: https://www.nice.org.uk/what-nice-does/our-research-work/hta-lab/hta-lab-projects#generative

Read Paper
Automating Economic Models

Conference Abstracts

No results match the filters.

HTA156 Assessing the Generalizability of Automating Adaptation of Excel-Based Cost-Effectiveness Models Using Generative AI

ISPOR EU 2024
Value in Health
Rawlinson et al.

Objectives

A previous study (ISPOR 2024, P48) described a method ‘LLMAdapt’ that uses a large language model (LLM) to automatically adjust an Excel-based cost-effectiveness model (CEM) from the setting of one country to another. The authors found a high level of accuracy (97%) for one test case. Assessment of generalizability is an important step for uptake and acceptance of AI-based methods by decision-makers. The objective of this study was to assess the generalizability of LLMAdapt across two distinct disease areas and countries.

Methods

LLMAdapt (powered by Generative Pre-trained Transformer 4 [the gpt-4-1106-preview model]) was used to automatically adjust two HTA-ready Excel CEMs from the setting of one country to the setting of another. To support the adaptations, GPT-4 was provided with tabular data for each of the target countries in a format that mimicked the output of a targeted literature review. Prior to conducting the study, each CEM received minor updates to improve its interpretability, such as clarifying vague descriptive text. The models spanned the following disease areas: muscle-invasive urothelial carcinoma (MIUC) and myelodysplastic syndrome (MDS) and were adapted to the following countries: the Czech Republic and the United States. All automated adaptations were manually checked by a human health economist to assess accuracy.

Results

The adaptations were performed without human intervention in 132 and 207 seconds. LLMAdapt performed 101/102 and 198/199 required updates successfully, resulting in accuracy scores of 99.0% and 99.4%. Two errors were identified, in which required parameter value changes were missed.

Conclusions

We found that the accuracy of LLMAdapt was maintained across two distinct disease areas and countries, demonstrating the generalizability of LLM-based methods to automate the adaptation of Excel-based CEMs. This is an important step towards uptake of these methods.

Cite This Article

HTA156 Assessing the Generalizability of Automating Adaptation of Excel-Based Cost-Effectiveness Models Using Generative AI Rawlinson, W. et al. Value in Health, Volume 27, Issue 12, S384. https://doi.org/10.1016/j.jval.2024.10.1981

Read Paper
Automating Economic Models

RWD137 Automated Non-Interventional Research Protocol Generation: A Case Study in Melanoma

ISPOR US 2024
Value in Health
Langham et al.
Objectives

Assess the potential to utilise large language models, such as GPT-4, for the automation of Non-Interventional Research (NIR) study protocols to enhance efficiency in the ability to conduct research.

Methods

To automate the development of specific sections of a protocol a Python API was used to send prompts to GPT-4 and receive output. Prompts were developed to provide specific inputs for each protocol, such as the population of interest, the aims and objectives, and the data source. Further information about the structure and content required for each section, and a template or example text for GPT-4 to modify was also developed and provided for each protocol section. The accuracy and completeness of GPT-4’s outputs were qualitatively assessed against the original human-produced protocol content, focusing on the identification of critical points, and noting any omissions or inaccuracies.

Results

Two protocols for retrospective cohort studies with objectives to describe patient characteristics, treatment patterns, and clinical outcomes for melanoma patients were autogenerated. Overall, there was close alignment between the original text and autogenerated text for the Study Design and Study Population sections. GPT-4 gave general aspects of data collection but lacked specifics related to the data sources and their use unless it was specified in the prompt. There was a substantial match in the description of statistical methods, with GPT-4 following the overall guidelines and providing clear methodology for analysis for each objective.

Conclusions

GPT-4 demonstrates potential in automating the drafting of sections of NIR protocols, with a high degree of alignment with original human-generated content. There was no inaccurate text reported. Where details were missing, the GPT-4 text could be enhanced by incorporating more specific details in the prompts, for example, subgroup analyses, how patients are selected from a data source, and the definition of the index date.

Cite This Article

RWD137 Automated Non-Interventional Research Protocol Generation: A Case Study in MelanomaLangham, J et al.Value in Health, Volume 27, Issue 6, S384 https://linkinghub.elsevier.com/retrieve/pii/S1098301524019065

Read Paper
RWE

EE205 Automating Economic Modeling: Potential of Generative AI for Updating Modeling Reports

ISPOR US 2024
Value in Health
Rawlinson et al.
Objectives

Using large language models (LLMs) such as Generative Pre-trained Transformer 4 (GPT-4) to edit Microsoft Word files could revolutionize the reporting of health economic models. This study aimed to assess GPT-4’s capabilities in automatically updating a Word technical report for a cost-utility model that was used in health technology assessments (HTAs) for muscle-invasive urothelial carcinoma (MIUC).

Methods

The MIUC model was first manually edited. Then, utilizing GPT-4 and deterministic programming, the Word technical report was automatically updated to reflect the Excel model. GPT-4 updated text in the results and conclusion sections, and all automated edits were captured in tracked changes. Two experienced health economists then blindly assessed the AI-adapted report alongside a manually adapted report that was developed by a third health economist. Accuracy was evaluated based on correct/incorrect changes, correct/incorrect retainment of original text, and instances of missing information.

Results

Both reviewers identified more than 30 instances for scoring. Accuracy was 94.3% for the AI-adapted report and 98.5% for the manually adapted report. The reviewers agreed there were 2 incorrect changes in the AI-adapted report: a rounding error and an incorrect description of a scenario analysis. Qualitatively, the reviewers generally approved of the tone of edits made by GPT-4. However, there were a small number of factually correct edits where the reviewers preferred language chosen by the human health economist.

Conclusions

This study is a promising early indication that LLMs can be leveraged as a part of a reviewer-friendly pipeline for automatically updating model technical reports in Microsoft Word. The accuracy achieved in our study suggests suitability as a first editor prior to human review. Utilizing AI to adapt technical reports for HTA submissions could accelerate dissemination of health technologies around the world.

Cite This Article

"EE205 Automating Economic Modeling: Potential of Generative AI for Updating Modeling Reports. Rawlinson, W et al. Value in Health, Volume 27, Issue 6, S95". https://doi.org/10.1016/j.jval.2024.03.504

Read Paper
Automating Economic Models

P53 Exploring the Ability of Generative AI to Interpret and Report Comprehensive NMA Results: A Step Towards the Automation of NMA Reports

ISPOR EU 2025
Value in Health
Benbow et al.

A previous study developed a system to automate network meta-analyses (NMAs) using large language models (LLMs). To maximize the benefits of automation, it is essential to automate the production of NMA reports, which include assessment and interpretation of results. This study aims to determine whether LLMs can assess and interpret comprehensive NMA outputs and accurately report these.

Cite This Article

P53 Exploring the Ability of Generative AI to Interpret and Report Comprehensive NMA Results: A Step Towards the Automation of NMA Reports.  Emma Benbow et al., Value in Health, Volume 28, Issue 12, Supplement 1,2025, Pages S37-S38, ISSN 1098-3015. https://doi.org/10.1016/j.jval.2025.09.092

Read Paper
Automating NMA

RWD66 Utilizing Generative AI to Automate Model Selection and Network Meta-Analyses

ISPOR US 2025
Value in Health
Reason et al.
Objectives

The automation of network meta-analyses (NMAs) through the use of large language models (LLMs) provides a significant opportunity for healthcare technology developers to optimize workflows critical for health technology assessment (HTA). This is particularly pertinent to ensure requirements of the joint clinical assessment (JCA) are fulfilled. The objective was to develop a system leveraging LLMs to automate substantial elements involved with conducting de novo NMAs.

Methods

An automated system, utilising Claude 3.5 Sonnet [V2] LLM, was designed to process analysis-ready datasets. Based on each dataset, the LLM was prompted to select a suitable statistical model, write code and execute analyses, evaluate outputs, and interpret results. The automated results were validated by: 1) replicating examples from the National Institute for Health and Care Excellence (NICE)Technical SupportDocument (TSD 2); 2) reproducing results from two non-DSU published NMAs; and 3) generating and assessing comprehensive outputs, including heterogeneity, inconsistency, and convergence.

Results

For all 14 TSD 2 examples (seven models, fixed and random effects), the LLM produced executable code without the need for human correction or intervention. The mean values produced were within 0.02 of the published results and credible interval limits were within the expected range. For non-DSU published NMA examples, the automated process replicated consistent results. Each model (DSU and non-DSU) was run five times and the LLM consistently selected the appropriate statistical model and generated correct results. Finally, the LLM successfully generated and interpreted heterogeneity, inconsistency, and convergence.

Conclusions

Starting with analysis-ready datasets, this study has demonstrated LLMs can execute essential steps required for de novo NMAs, paving the way for an automated NMA system. Automation of NMAs can streamline workflows and significantly reduce the amount of time and resource required. This is especially relevant when considering the extensive analyses to be completed within tight timeframes to comply with JCA requirements.

Cite This Article

RWD66 Utilizing Generative AI to Automate Model Selection and Network Meta-Analyses Reason, Tim et al.  Value in Health, Volume 28, Issue 6, S375 https://doi.org/10.1016/j.jval.2025.04.1650

Read Paper
Automating NMA

RWD136 Integration of Survival Analysis Outputs Into Excel-Based Cost-Effectiveness Models Using Generative AI

ISPOR US 2025
Value in Health
Rawlinson et al.
Objectives

Survival analysis is commonly performed outside of Excel-based cost-effectiveness models (CEMs). As such, the output data are manually transferred into the CEMs, an error-prone and time-consuming process. This study’s objective was to assess the performance of an LLM-driven pipeline in automatically ingesting survival analysis outputs and integrating the datapoints into Excel-based CEMs.

Methods

The pipeline was used to automatically insert survival analysis outputs from two trials (parameter estimates, Cholesky decompositions, knot positions, and AIC/BIC values) into an HTA-ready Excel CEM. The outputs were provided in two different, non-standardized Excel formats exported from R. The Excel CEM used a layout to facilitate ‘compression’ of the spreadsheet, optimizing for AI integration. An advanced self-consistency process flagged updates with less than 100% ‘AI-confidence’, facilitating subsequent human review. Accuracy was assessed through manual review.

Results

The survival analysis outputs were integrated into the CEM without human intervention in 122, and 150 seconds, respectively. 81/82 (98.8%), and 51/51 (100.0%) of required updates were successfully performed. The single error was correctly flagged by the system as uncertain and requiring human review.

Conclusions

We found that an LLM-driven pipeline could accurately transfer survival analysis outputs from non-standardized export formats into an Excel-based CEM, which has implications for the automation and/or quality control of current manual processes. The LLM made accurate insertions over a spreadsheet area including >70,000 cells, demonstrating the power of compression as a route to integration of AI in Excel modelling. The self-consistency approach successfully flagged the single error as an uncertain action, highlighting this method as a useful tool for facilitating human-in-the-loop review of AI-driven tasks in HEOR.

Cite This Article

RWD136 Integration of Survival Analysis Outputs Into Excel-Based Cost-Effectiveness Models Using Generative AI. Rawlinson, William et al.  Value in Health, Volume 28, Issue 6, S375 https://doi.org/10.1016/j.jval.2025.04.1719

Read Paper
Survival Analysis
Automating Economic Models

P24 Innovations in Automated Survival Curve Selection and Reporting of Survival Analyses Through Generative AI

ISPOR EU 2024
Value in Health
Wu et al.
Objectives

Survival analyses are a core part of many HTA submissions where extrapolation of time-to-event clinical endpoints is required. The purpose of this research was to explore automation of survival analysis reporting using Generative Artificial Intelligence (GenAI). Following published best practices for curve selection, GenAI was leveraged to recommend an appropriate extrapolation curve and provide justifications.

Methods

Data were taken from a previously accepted HTA survival analysis report (NICE TA817) for patients treated for resectable urothelial cancer (PD-L1 ≥1%), with a minimum follow-up of 11-months. GPT-4o was provided with survival analysis outputs, including statistical tests, survival probability estimates, and figures, to assess proportional hazards (PH) and goodness-of-fit. Prompted with relevant content, GPT-4o was asked to; 1) assess PH, 2) select suitable extrapolation models (dependent vs. independent), 3) consider external data, then 4) select an appropriate curve. To validate accuracy, GPT-4o’s results were compared with results in the original report, the report published by NICE, and assessed against the opinion of three expert health economists.

Results

GPT-4o’s interpretation of log-cumulative hazard plots, Schoenfeld residual plots, and Grambsch-Therneau test results aligned with interpretations made by the three health economic experts, the human produced report, and the NICE committee. GPT-4o concluded that the PH assumption might be violated, therefore suggesting consideration of both dependent and independent parametric models. Based on a comprehensive analysis of goodness-of-fit, visual fit, and long-term external survival data, GPT-4o recommended the same survival curves as those selected in the original report and by the NICE Committee. Notably, 13/13 statements or decisions made by GPT-4o were consistent with the original report or expert opinion.

Conclusions

The results suggest automation of curve selection and reporting of survival analyses is possible. However, more research is required to determine generalizability with differing levels of data maturity and to test the performance of alternative GenAI models.

Cite This Article

P24 Innovations in Automated Survival Curve Selection and Reporting of Survival Analyses Through Generative AI Wu, Y et al. Value in Health, Volume 27, Issue 12, S6 https://doi.org/10.1016/j.jval.2024.10.030

Read Paper
Survival Analysis

CO161 Use of Generative AI for Rapid and Accurate Extraction of PICOs at Scale

ISPOR EU 2024
Value in Health
Reason et al.
Objectives

The impending Joint Clinical Assessment (JCA) regulations will result in the need for precise and comprehensive extraction of Patient, Intervention, Comparison, and Outcome (PICO) elements from a vast number of clinical trials. The recent rise of Generative AI (GenAI) has shown great promise in a wide range of fields including health economics and outcomes research (HEOR) with GenAI models becoming faster, more flexible and more accurate. The objective of this study was therefore to evaluate the feasibility, accuracy, and efficiency of using GenAI for mass extraction of PICOs from PubMed abstracts.

Methods

Relevant abstracts were identified using the search string corresponding to the PubMed clinical trials filter. PICOs were extracted from all individual abstracts using GPT-4 Omni (GPT-4o). The extraction was performed using the OpenAI batch API, which allowed parallel processing of 682,667 abstracts. To determine the accuracy of GenAI PICO extraction a random subsample was selected for human checking, containing 274 clinical trials.

Results

From the sub sample of 274 clinical trials GPT-4o comprehensively and accurately extracted 269 (98%) of all PICOs. In the remaining 5 cases, the model occasionally missed some outcome elements but always extracted the population, intervention, and comparator accurately. The entire process resulted in the extraction of 682,667 PICOs in less than 3 hours.

Conclusions

This study highlights the potential of GenAI, specifically GPT-4o, for large-scale, rapid PICO extraction and to our knowledge represents the first instance of PICO extraction at this speed, scale and accuracy. While accuracy is high, with 98% of extractions being fully correct, further human validation may be necessary to ensure further adoption in clinical research settings. This promising application of AI will likely be beneficial for meeting JCA requirements and may potentially open up a new paradigm for how systematic literature review (SLR) is conducted.

Cite This Article

CO161 Use of Generative AI for Rapid and Accurate Extraction of PICOs at Scale Reason, T. et al. Value in Health, Volume 27, Issue 12, S45 https://doi.org/10.1016/j.jval.2024.10.237

Read Paper
PICO
Data Extraction

HTA410 Generative AI: A Novel Approach to Data Extraction for NMAs in EU JCA

ISPOR EU 2024
Value in Health
Wu et al.
Objectives

EU HTA Regulation’s Joint Clinical Assessments (JCA) are likely to require health technology developers (HTDs) to conduct a large number of comparative clinical analyses within a tight timeframe (≤100 days). Large language models (LLMs) have previously demonstrated proficiency for extracting data from text, therefore, the purpose of this study was to explore whether LLMs could also be leveraged to extract data from tables and figures. Such automation could help HTDs meet JCA requirements.

Methods

Python was utilized to screen publications and identify pages containing relevant tables and figures. A deep learning model was then used to extract tables and figures into separate images and GPT-4o employed to identify the type of figure/table and label accordingly (e.g. patient characteristics, forest plot). Labelled images of 13 figures and 5 tables were submitted to two LLMs (Claude 3 Opus and GPT-4o) for data extraction. The data extracted from each table or figure was assessed for inclusiveness and accuracy.

Results

The deep learning model achieved 100% accuracy when extracting tables and figures from relevant pages in a publication and saving them as images; this was inclusive of situations where tables/figures were split across multiple pages. GPT-4o and Claude achieved 100% accuracy when extracting data from the images of tables/figures, including in cases where figures comprised multiple subfigures. Only in instances where the text size in the image was very small (<6pts), and not easily readable by a human, were LLMs unable to extract data and became susceptible to hallucinations.

Conclusions

The results show that a combination of Python, deep learning and LLMs can be used to automatically extract images of tables and figures from within publications. Using these images, LLMs have also demonstrated the ability to extract data accurately. Such automation has the potential to significantly reduce burden on HTDs preparing for JCA submissions.

Cite This Article

HTA410 Generative AI: A Novel Approach to Data Extraction for NMAs in EU JCA Wu, Y et al. Value in Health, Volume 27, Issue 12, S436 https://doi.org/10.1016/j.jval.2024.10.3247

Read Paper
Data Extraction
JCA

HTA99 Matching Insights From Clinical Experts and Generative AI for JCA PICO Validation

ISPOR EU 2024
Value in Health
Benbow et al.
Objectives

The JCA process uses the Patient, Intervention, Comparator, Outcome (PICO) framework and requests that each EU member state put forward their PICO requirements. This could potentially introduce many PICO sets that need consideration within a JCA submission. Thus, it would be beneficial to have an automated process able to quickly determine which PICO sets align with a registrational trial’s PICO. We have investigated whether large language models (LLMs) can determine the alignment of JCA PICO populations, as predicted by clinical experts, with the population of a target registrational trial, using a case study in patients with relapsed refractory multiple myeloma (RRMM).

Methods

Twentypredicted JCA PICO populations were identified for patients with RRMM and ≥1 prior line. We used a modal approach and provided prompts and contextual information to two LLMs (Claude 3 Opus, GPT-4) accessing their APIs through Python. Alignment was defined as “Full” (trial population = JCA population), “Partial (subgroup)” (trial population subgroup of JCA population), “Partial (overlap)” (trial population overlaps with JCA population), “None” (no overlap). Accuracy of alignment categorization for the populations was determined by comparing the LLM outputs to alignment categorization by clinical experts.

Results

Human classification of the alignment of the 20 populations was partial (subgroup) (“PS”) for three and partial (overlap) (“PO”) for 17. Claude was correct for 18/20, with 2 misclassifications (Full instead of PO; PO instead of PS). GPT was also correct for 18/20, with 2 misclassifications (both PO instead of PS). Potential ambiguity in the population definition for the two populations was likely to have caused the mis-categorization.

Conclusions

If appropriate context is provided, LLMs are capable of understanding complex epidemiological concepts and categorizing the alignment of two populations. Thus, LLMs can be used to automate this categorization of PICOs within the JCA process.

Cite This Article

HTA99 Matching Insights From Clinical Experts and Generative AI for JCA PICO Validation Benbow, E et al. Value in Health, Volume 27, Issue 12, S372 https://linkinghub.elsevier.com/retrieve/pii/S1098301524047879

Read Paper
JCA
PICO

EPH28 Revolutionizing Systematic Reviews: The Precision of LLMS in Screening Observational Studies

ISPOR EU 2024
Value in Health
Langham et al.
Objectives

We previously reported the accuracy of GPT-4 in screening titles, abstracts, and full publications for a systematic review of randomized controlled trials, showing specificity and sensitivity of 95.9% and 86.7%, respectively. Our objective was to assess GPT-4’s accuracy in selecting studies for systematic literature reviews (SLRs) of real-world evidence (RWE) compared to traditional double screening by human reviewers.

Methods

Two case studies where two human reviewers had screened titles and abstracts, and full-text publications were selected. The SLRs had different criteria; one studied the epidemiology of solid tumor sites harboring NTRK fusion mutations, and the other compared outcomes from oncology therapies that have both an IV and SC formulation. GPT-4 was used via a Python API to identify titles and abstracts that fulfilled the eligibility criteria. We compared the screening results of GPT-4 and human reviewers to determine agreement and successful identification of publications.

Results

The sensitivity and specificity (and 95% confidence Interval) of GPT-4 compared to humans was 91.07 (83.60 to 98.54) and 74.38 (71.56 to 77.20), respectively, in case study 1 (n=977) and 87.50 (64.58 to 110.42), and 85.86 (83.28 to 88.44) respectively in case study 2 (n=708). The approximate time required for GPT-4 to process the information was 1 hour for 500 titles and abstracts screened.

Conclusions

Searching and screening for observational studies is more difficult due to a lack of adherence to reporting guidelines, which raises output. However, GPT-4 quickly and accurately summarized relevant study characteristics from the title and abstract to determine study design eligibility in two diverse RWE SLRs. Further prompt refinement and fine-tuning with GPT-4 would increase the accuracy, particularly for the more complex decisions. Testing on further SLRs and a full publication review will be required to improve prompting and demonstrate generalizability.

Cite This Article

EPH28 Revolutionizing Systematic Reviews: The Precision of LLMS in Screening Observational Studies Langham, J et al. Value in Health, Volume 27, Issue 12, S227 - S228https://doi.org/10.1016/j.jval.2024.10.1159

Read Paper
Evidence Generation (SLRs)
RWE

MSR68 Variability and Improvements of Answers Generated with Different Versions of Large Language Models

ISPOR US 2024
Value in Health
Benbow et al.
Objectives

Since OpenAI’s release of the GPT-3.5 large language model (LLM) in March 2022, subsequent updates have introduced new and enhanced models. The impact of response variations among these models on the accuracy of automated network meta-analyses (NMAs) remains uncertain. The objective was to evaluate the variability and improvements in answers generated by different LLMs during data extraction for an NMA of overall survival in non-small cell lung cancer patients.

Methods

Using a range of LLMs, via a Python API, we extracted survival data from publications of five studies. We have investigated the variability and accuracy of the data extraction achieved by repeatedly extracting the data from the study publications (20 iterations of the Python script per model) and comparing the results with the data extraction conducted (and checked) by systematic literature review and NMA experts.

Results

Each iteration required extraction of 36 data items. For the worst performing model (GPT-3.5 turbo), correct extraction per iteration ranged from 0 to 36, with an overall mean of 57.4%. This significantly improved for GPT-4 Turbo Beta, which correctly extracted between 30 and 36 items per iteration, averaging 98.8%. The best performing model (GPT-4) correctly extracted between 34 and 36 items per iteration, with an overall mean of 99.4%.

Conclusions

GPT models have exhibited notable enhancements in accurately extracting required NMA data. Whilst GPT-4 demonstrated superior performance in this limited test, it was not significantly better than GPT-4 Turbo Beta. The potential release of the production version may further boost GPT-4 Turbo's performance, potentially surpassing that of GPT-4. GPT-4 Turbo also holds promise for more intricate data extraction tasks, given its significantly larger token limit.

Cite This Article

MSR68 Variability and Improvements of Answers Generated with Different Versions of Large Language ModelsBenbow, E et al.Value in Health, Volume 27, Issue 6, S272 https://linkinghub.elsevier.com/retrieve/pii/S1098301524016164

Read Paper
Data Extraction

MSR18 Improving the Performance of Generative AI to Achieve 100% Accuracy in Data Extraction

ISPOR US 2024
Value in Health
Klijn et al.
Objectives

We have previously demonstrated that there is potential to use large language models (LLMs), such as GPT-4, to automate data extraction for NMA. Whilst data extraction accuracy of over 97% was achieved, there is scope to improve the performance and reliability of data extraction to 100%, before full implementation in HEOR. The aim of this study was to assess improvements in accuracy of data extraction from publications reporting overall survival in adult patients with advanced or metastatic non-small cell lung cancer (NSCLC), using a modal approach.

Methods

An a priori defined modal algorithm was postulated, developed, and tested. This used GPT-4, via a Python API, to automatically extract survival data from NSCLC publications multiple times and then calculate the mode of each block of 20 iterations. Results were compared with the data extraction conducted (and checked) by systematic literature review and NMA experts.

Results

When comparing the results of 400 iterations of the automatic data extraction with the human data extraction, GPT-4 accurately extracted over 99% of the necessary data. However, by implementing the modal algorithm it was possible to achieve a data extraction accuracy of 100% for all 20x20 blocks of data.

Conclusions

Whilst GPT-4 generally extracts the correct data, there are occasions when it fails to extract all required data from a publication. We have demonstrated an approach that improves the extraction rate and, in the case study considered, results in perfect extraction by GPT-4. This represents a useful method to demonstrate the accuracy, repeatability and reliability of data extracted. Work to apply this approach to the other automated stages of network meta-analysis is underway.

Cite This Article

MSR18 Improving the Performance of Generative AI to Achieve 100% Accuracy in Data Extraction Klijn, S et al. Value in Health, Volume 27, Issue 6, S262 - S263 https://linkinghub.elsevier.com/retrieve/pii/S1098301524015663

Read Paper
Data Extraction

P48 Automating Economic Modelling: Potential of Generative AI for Updating Excel-Based Cost-Effectiveness Models

ISPOR US 2024
Value in Health
Rawlinson et al.
Objectives

Using large language models (LLMs) such as Generative Pre-trained Transformer 4 (GPT-4) to edit Microsoft Excel files could revolutionize the way we interact with health economic models. The aim of this study was to assess the accuracy and capability of GPT-4 in automating the adjustment of an HTA-ready Excel cost-effectiveness model (CEM) for muscle-invasive urothelial carcinoma (MIUC) from the setting of one country to another.

Methods

This adaptation, conducted by humans, was submitted to HTA authorities globally who deemed the model appropriate for decision making. For this case study, GPT-4 was used to adapt the MIUC model from a UK base case to a Czech Republic perspective. Prior to conducting the study, the model received minor updates to improve its interpretability, such as clarifying vague descriptive text. GPT-4 was then provided with natural language instructions and tabular data that described adaptations in a human-oriented manner (without the use of cell references). Based on this, GPT-4 automatically updated input values in the Excel model without human intervention. All edits made by GPT-4 were highlighted, enhancing subsequent review by a health economist. Accuracy was measured by a human checking whether all required adaptations had been performed and whether all updates performed by GPT-4 were correct.

Results

The AI-generated adaptations were performed in 245 seconds. GPT-4 performed 62/64 required updates, and 100% of these updates were performed correctly. This resulted in an overall accuracy score of 97% (adverse event costs, 100% [7/7]; model settings, 100% [2/2]; drug acquisition and administration costs, 82% [9/11]; resource costs, 100% [32/32]; subsequent treatment proportions 100% [12/12]).

Conclusions

This study demonstrates the technical feasibility of using LLMs to automate the editing of Excel-based CEMs. Given that models are set up clearly, this is a promising early indication that highly accurate edits of input values can be achieved.

Cite This Article

P48 Automating Economic Modelling: Potential of Generative AI for Updating Excel-Based Cost-Effectiveness Models Rawlinson, W. et al. Value in Health, Volume 27, Issue 6, S11 https://linkinghub.elsevier.com/retrieve/pii/S109830152400189X

Read Paper
Automating Economic Models

P46 Can Large Language Models Simulate HTA Committee Discussions? Findings and Challenges from a Case Study in Neoadjuvant Treatment of Resectable Non-Small Cell Lung Cancer

ISPOR US 2024
Value in Health
Reason et al.
Objectives

Health Technology Assessment (HTA) committees play a crucial role in evaluating reimbursement dossiers for healthcare interventions for the routine use of emerging technologies and interventions. These committees comprise members with vast amounts of expertise whose knowledge is not readily available to pharmaceutical manufacturers.

Methods

We developed a Large Language Model (LLM) based simulation in Python using GPT-4 Turbo to replicate an HTA committee discussion, using a real Economic Assessment Group (EAG) report in non small cell lung cancer (NSCLC) as a reference document. The virtual committee comprised a fixed number of members with varying categorical attributes, including Health Economics and Outcomes Research (HEOR) knowledge, attitudes towards the pharmaceutical industry, occupations and personal perspectives. These attributes were programmatically modified to generate a range of virtual personalities. The LLM facilitated the committee discussion, with each member contributing and continuing the discussion based on their predefined characteristics. Finally, a chair simulated by the LLM (deterministically), summarised the discussions and formulated a final recommendation on the healthcare intervention under review.

Results

The LLM demonstrated capability in generating realistic and coherent committee discussions. Virtual members maintained distinct and consistent personalities, contributing perspectives aligned with their assigned attributes. However it was difficult to sustain seeds of disagreement between members who tended to converge on consensus towards recommending products. The virtual committee chair effectively summarised discussions and made recommendations that were coherent with the rest of the virtual discussion.

Conclusions

This study highlights the potential and limitations of using LLMs to simulate HTA committee discussions. While LLMs show promise in replicating realistic committee dynamics and maintaining diversity in accordance with distinct member characteristics, further refinement is needed to enhance focus specificity. This approach paves the way for future research in AI applications for training, policy analysis, and exploring decision-making processes requiring committee approval in healthcare settings.

Cite This Article

P46 Can Large Language Models Simulate HTA Committee Discussions? Findings and Challenges from a Case Study in Neoadjuvant Treatment of Resectable Non-Small Cell Lung Cancer Reason, T. et al. Value in Health, Volume 27, Issue 6, S11 https://linkinghub.elsevier.com/retrieve/pii/S1098301524001876

Read Paper
Proof of Concept Studies

P22 Disrupting Health Economics: Automating Network Meta-Analyses With AI and Large Language Models

ISPOR EU 2023
Value in Health
Reason et al.
Objectives

The advancement of Large Language Models (LLMs), such as GPT-4, provides opportunities for automating data extraction and analysis in systematic reviews and meta-analyses. However, their practical application in Health Economics and Outcomes Research (HEOR) remains unverified. Our study aimed to evaluate GPT-4's accuracy in replicating a Network Meta-Analysis (NMA) result on overall survival of adult patients with advanced non-small cell lung cancer (NSCLC) post platinum-based treatment and pre-immunotherapy.

Methods

Using GPT-4 through a Python API, we extracted survival data from the abstracts of eleven studies, transformed it to the log scale, and generated an R script for NMA using generic code from the R 'multinma' package. GPT-4 updated the code with new data and produced an executable R script that was run end-to-end in Docker, and parsed Docker output to create a mini NMA report. This was compared to the original human conducted NMA.

Results

The LLM-generated model accurately replicated the overall survival outcome using abstracts and generic R code. It successfully extracted and converted survival data, created the NMA R script, ran it using Docker, and accurately produced the original NMA results. This was achieved with a single generic python script, demonstrating GPT-4's capability to perform end-to-end NMA using unstructured abstracts.

Conclusions

This study offers promising evidence for the potential of AI models like GPT-4 in automating data extraction and NMA. Further studies are necessary to confirm these findings in diverse contexts and investigate AI's potential in enhancing systematic reviews and NMA. Further exploration is also required on multimodal versions, and the ability of LLMs to validate the proportional hazards assumption.

Cite This Article

P22 Disrupting Health Economics: Automating Network Meta-Analyses With AI and Large Language Models Reason, T. et al. Value in Health, Volume 26, Issue 12, S6. https://doi.org/10.1016/j.jval.2023.09.031

Read Paper
Automating NMA

MSR80 AI-Enabled Risk of Bias Assessment of RCTs in Systematic Reviews: A Case Study

ISPOR EU 2023
Value in Health
Langham et al.
Objectives

The task of assessing the risk of bias (RoB) of studies included in a systematic review is time-consuming and requires considerable expertise and judgement. The potential of large language models (LLMs), such as GPT-4 to assist and automate RoB assessment, particularly in randomised trials (RCTs) where reporting is standardised, remains unclear. This study evaluated the accuracy of GPT-4 in assessing the RoB of RCTs included in a published network meta-analysis (NMA).

Methods

The RoB of each domain (randomization, blinding, missing outcome, outcome measurement, and selective outcome reporting) for ten RCTs was assessed by an experienced systematic reviewer and by GPT-4, using the revised Cochrane risk-of-bias tool for randomized trials (RoB 2). The risk was calculated using Cochrane algorithms based on answers to signalling questions. GPT-4 extracted text from the PDFs and from information downloaded from “ClinicalTrials.gov” to answer signalling questions via a set of prompts. The risk estimates provided by human reviewer and GPT-4 were compared.

Results

The AI-enabled RoB assessment successfully extracted text and answered signalling questions, with very good agreement (70-100%) across questions. There was 80% agreement in the overall risk of bias judgement, 100% agreement in three domains, 90% in Domain 2 (masking) and 70% in Domain 1 (randomisation). Some discrepancy in the assessment of concealment of allocation and randomisation method was reflected in the overall risk assessment by domain.

Conclusions

This case study provides early evidence for the potential of LLMs to extract and summarise the relevant information required to assess RoB and to quickly deliver information in a concise form for assessing study quality. More work is required on how best to interact with a LLM to ensure that all relevant information is extracted and reported to help assist with quality assessment.

Cite This Article

MSR80 AI-Enabled Risk of Bias Assessment of RCTs in Systematic Reviews: A Case StudyLangham, J. et al.Value in Health, Volume 26, Issue 12, S408 https://linkinghub.elsevier.com/retrieve/pii/S1098301523052695

Read Paper
Evidence Generation (SLRs)
Risk of Bias

MSR46 Breaking Through Limitations: Enhanced Systematic Literature Reviews With Large Language Models

ISPOR EU 2023
Value in Health
Reason et al.
Objectives

The potential for utilising AI to improve efficiency of systematic reviews (SLRs) is increasingly recognised and the capabilities of Large Language Models (LLMs) like GPT-4 warrant further investigation. Our objective was to assess the accuracy of GPT-4 in selecting eligible randomised controlled trials (RCTs) from titles and abstracts for an SLR and network meta-analysis (NMA) on overall survival of adult patients with advanced non-small cell lung cancer.

Methods

Titles and abstracts of RCTs identified in a systematic literature search using EMBASE, MEDLINE and CENTRAL, were screened by two human reviewers and GPT-4. GPT-4 was utilised using a series of prompts delivered via a Python API to identify data relevant to the key inclusion and exclusion criteria and to assess eligibility. The results of screening by AI and human reviewers were compared to assess level of agreement and the successful identification of publications used in the final NMA.

Results

After deduplication, GPT-4 screened 1994 abstracts identifying 14.6% of abstracts as fulfilling the criteria for inclusion (compared with 6% human reviewers). The overall agreement between GPT-4 and reviewers was 80.9% sensitivity, 89.2% specificity and 88.8% accuracy. Both reviewers and GPT-4 identified all studies providing data for the NMA. AI screening took 4 hours to process and deliver output.

Conclusions

This study shows the potential of using LLMs to quickly and correctly identify a shortlist of studies from titles and abstracts given specific instructions about type of study design, population and intervention. This may help minimise risk of human error and improve the accessibility of results. Further studies are required to test the generalisability of these results, and test how variation in prompts vary sensitivity and specificity of results.

Cite This Article

MSR46 Breaking Through Limitations: Enhanced Systematic Literature Reviews With Large Language Models. Reason, T. et al. Value in Health, Volume 26, Issue 12, S402 https://linkinghub.elsevier.com/retrieve/pii/S109830152305235X

Read Paper
Evidence Generation (SLRs)

P1 Automating Economic Modelling: A Case Study of AI's Potential With Large Language Models

ISPOR EU 2023
Value in Health
Reason et al.
Background

The potential to integrate large language models (LLMs) such as GPT-4 in script extraction and generation could revolutionise the creation and deployment of economic models. The aim of this study was to assess the validity of a partitioned survival model produced by GPT-4 against a published model, comparing treatments for advanced or metastatic non-small cell lung cancer (NSCLC) for patients who had disease progression after platinum-based treatment and had not received prior immunotherapy (CTLA4 or PD-[L]1 inhibitors).

Methods

GPT-4 was instructed by providing prompts to replicate an existing 3-state partitioned survival model for NSCLC by modifying a generic shell R script. Data for the model, assumptions, and choice of parametric models to fit OS and PFS were provided to GPT-4, extracted from the publication. The output of the AI generated model was compared with the original published model to evaluate the validity and accuracy.

Results

GPT-4 successfully adapted a shell script, extracting data from pre-generated tables, to produce a 3-state partitioned survival cost-effectiveness model script for NSCLC. The resultant script compiled with no edits by a human modeller. Total costs and total QALYs for the AI-generated model were within 1-10% of the outcomes in the publication and observed differences were primarily explained by model inputs such as the data quality and availability (digitised survival curves vs individual patient data, and an unpublished input value) rather than inaccuracies in the AI-generated model.

Conclusions

The final model closely resembled the published version and yielded similar outputs, substantiating the reliability of the AI-generated model. This study offers evidence for the incorporation of LLMs, such as GPT-4, in the automation of economic modelling, specifically in script generation. Future research is necessary to corroborate results and further explore the potential of AI to streamline and enhance the quality of economic modelling.

Cite This Article

P1 Automating Economic Modelling: A Case Study of AI's Potential With Large Language Models. Reason, T. et al. Value in Health, Volume 26, Issue 12, S1. https://doi.org/10.1016/j.jval.2023.09.005

Read Paper
Automating Economic Models

Let’s talk

Get in touch with our team below and discover how Estima can help you:

Start your journey
Start your journey
We’ll be in touch soon
Oops! Something’s gone wrong, please try again