BackgroundWith the increasing interest in the application of large language models (LLMs) in the medical field, the feasibility of its potential use as a standardized patient in medical assessment is rarely evaluated. Specifically, we delved into the potential of using ChatGPT, a representative LLM,...

Descrición completa

Gardado en:
Detalles Bibliográficos
Main Authors: Chenxu Wang, Shuhan Li, Nuoxi Lin, Xinyu Zhang, Ying Han, Xiandi Wang, Di Liu, Xiaomei Tan, Dan Pu, Kang Li, Guangwu Qian, Rong Yin
Formato: Artigo
Acceso en liña:https://doaj.org/article/e0cbe54df9bd4c11a4cdbeff2efaf92b
Tags: Engadir etiqueta
Sen Etiquetas, Sexa o primeiro en etiquetar este rexistro!
_version_ 1859433675366596608
author Chenxu Wang
Shuhan Li
Nuoxi Lin
Xinyu Zhang
Ying Han
Xiandi Wang
Di Liu
Xiaomei Tan
Dan Pu
Kang Li
Guangwu Qian
Rong Yin
author_facet Chenxu Wang
Shuhan Li
Nuoxi Lin
Xinyu Zhang
Ying Han
Xiandi Wang
Di Liu
Xiaomei Tan
Dan Pu
Kang Li
Guangwu Qian
Rong Yin
date_str_mv 2025-01-01T00:00:00Z
description BackgroundWith the increasing interest in the application of large language models (LLMs) in the medical field, the feasibility of its potential use as a standardized patient in medical assessment is rarely evaluated. Specifically, we delved into the potential of using ChatGPT, a representative LLM, in transforming medical education by serving as a cost-effective alternative to standardized patients, specifically for history-taking tasks. ObjectiveThe study aims to explore ChatGPT’s viability and performance as a standardized patient, using prompt engineering to refine its accuracy and use in medical assessments. MethodsA 2-phase experiment was conducted. The first phase assessed feasibility by simulating conversations about inflammatory bowel disease (IBD) across 3 quality groups (good, medium, and bad). Responses were categorized based on their relevance and accuracy. Each group consisted of 30 runs, with responses scored to determine whether they were related to the inquiries. For the second phase, we evaluated ChatGPT’s performance against specific criteria, focusing on its anthropomorphism, clinical accuracy, and adaptability. Adjustments were made to prompts based on ChatGPT’s response shortcomings, with a comparative analysis of ChatGPT’s performance between original and revised prompts. A total of 300 runs were conducted and compared against standard reference scores. Finally, the generalizability of the revised prompt was tested using other scripts for another 60 runs, together with the exploration of the impact of the used language on the performance of the chatbot. ResultsThe feasibility test confirmed ChatGPT’s ability to simulate a standardized patient effectively, differentiating among poor, medium, and good medical inquiries with varying degrees of accuracy. Score differences between the poor (74.7, SD 5.44) and medium (82.67, SD 5.30) inquiry groups (P<.001), between the poor and good (85, SD 3.27) inquiry groups (P<.001) were significant at a significance level (α) of .05, while the score differences between the medium and good inquiry groups were not statistically significant (P=.16). The revised prompt significantly improved ChatGPT’s realism, clinical accuracy, and adaptability, leading to a marked reduction in scoring discrepancies. The score accuracy of ChatGPT improved 4.926 times compared to unrevised prompts. The score difference percentage drops from 29.83% to 6.06%, with a drop in SD from 0.55 to 0.068. The performance of the chatbot on a separate script is acceptable with an average score difference percentage of 3.21%. Moreover, the performance differences between test groups using various language combinations were found to be insignificant. ConclusionsChatGPT, as a representative LLM, is a viable tool for simulating standardized patients in medical assessments, with the potential to enhance medical training. By incorporating proper prompts, ChatGPT’s scoring accuracy and response realism significantly improved, approaching the feasibility of actual clinical use. Also, the influence of the adopted language is nonsignificant on the outcome of the chatbot.
doi_str 10.2196/59435
format Article
id oai_oai_doaj.org_article_e0cbe54df9bd4c11a4cdbeff2efaf92b
issn_str_mv 1438-8871
language_str_mv EN
oai_datestamp_str 2025-01-01T16:30:39Z
oai_identifier_str oai:doaj.org/article:e0cbe54df9bd4c11a4cdbeff2efaf92b
publisher_str JMIR Publications
relation_str_mv https://www.jmir.org/2025/1/e59435
https://doaj.org/toc/1438-8871
source_str JOURNAL_A
source_txt Journal of Medical Internet Research, Vol 27, p e59435 (2025)
spellingShingle Application of Large Language Models in Medical Training Evaluation—Using ChatGPT as a Standardized Patient: Multimetric Assessment
Chenxu Wang
Shuhan Li
Nuoxi Lin
Xinyu Zhang
Ying Han
Xiandi Wang
Di Liu
Xiaomei Tan
Dan Pu
Kang Li
Guangwu Qian
Rong Yin
subject_str_mv Computer applications to medicine. Medical informatics
R858-859.7
Public aspects of medicine
RA1-1270
title Application of Large Language Models in Medical Training Evaluation—Using ChatGPT as a Standardized Patient: Multimetric Assessment
type_str article
url https://doaj.org/article/e0cbe54df9bd4c11a4cdbeff2efaf92b