Staff View: :: Library Catalog

BackgroundWith the increasing interest in the application of large language models (LLMs) in the medical field, the feasibility of its potential use as a standardized patient in medical assessment is rarely evaluated. Specifically, we delved into the potential of using ChatGPT, a representative LLM,...

Descrición completa

Gardado en:

Detalles Bibliográficos
Main Authors:	Chenxu Wang, Shuhan Li, Nuoxi Lin, Xinyu Zhang, Ying Han, Xiandi Wang, Di Liu, Xiaomei Tan, Dan Pu, Kang Li, Guangwu Qian, Rong Yin
Formato:	Artigo
Acceso en liña:	https://doaj.org/article/e0cbe54df9bd4c11a4cdbeff2efaf92b
Tags:	Engadir etiqueta Sen Etiquetas, Sexa o primeiro en etiquetar este rexistro!

_version_	1859433675366596608
author	Chenxu Wang Shuhan Li Nuoxi Lin Xinyu Zhang Ying Han Xiandi Wang Di Liu Xiaomei Tan Dan Pu Kang Li Guangwu Qian Rong Yin
author_facet	Chenxu Wang Shuhan Li Nuoxi Lin Xinyu Zhang Ying Han Xiandi Wang Di Liu Xiaomei Tan Dan Pu Kang Li Guangwu Qian Rong Yin
date_str_mv	2025-01-01T00:00:00Z
description	BackgroundWith the increasing interest in the application of large language models (LLMs) in the medical field, the feasibility of its potential use as a standardized patient in medical assessment is rarely evaluated. Specifically, we delved into the potential of using ChatGPT, a representative LLM, in transforming medical education by serving as a cost-effective alternative to standardized patients, specifically for history-taking tasks. ObjectiveThe study aims to explore ChatGPT’s viability and performance as a standardized patient, using prompt engineering to refine its accuracy and use in medical assessments. MethodsA 2-phase experiment was conducted. The first phase assessed feasibility by simulating conversations about inflammatory bowel disease (IBD) across 3 quality groups (good, medium, and bad). Responses were categorized based on their relevance and accuracy. Each group consisted of 30 runs, with responses scored to determine whether they were related to the inquiries. For the second phase, we evaluated ChatGPT’s performance against specific criteria, focusing on its anthropomorphism, clinical accuracy, and adaptability. Adjustments were made to prompts based on ChatGPT’s response shortcomings, with a comparative analysis of ChatGPT’s performance between original and revised prompts. A total of 300 runs were conducted and compared against standard reference scores. Finally, the generalizability of the revised prompt was tested using other scripts for another 60 runs, together with the exploration of the impact of the used language on the performance of the chatbot. ResultsThe feasibility test confirmed ChatGPT’s ability to simulate a standardized patient effectively, differentiating among poor, medium, and good medical inquiries with varying degrees of accuracy. Score differences between the poor (74.7, SD 5.44) and medium (82.67, SD 5.30) inquiry groups (P<.001), between the poor and good (85, SD 3.27) inquiry groups (P<.001) were significant at a significance level (α) of .05, while the score differences between the medium and good inquiry groups were not statistically significant (P=.16). The revised prompt significantly improved ChatGPT’s realism, clinical accuracy, and adaptability, leading to a marked reduction in scoring discrepancies. The score accuracy of ChatGPT improved 4.926 times compared to unrevised prompts. The score difference percentage drops from 29.83% to 6.06%, with a drop in SD from 0.55 to 0.068. The performance of the chatbot on a separate script is acceptable with an average score difference percentage of 3.21%. Moreover, the performance differences between test groups using various language combinations were found to be insignificant. ConclusionsChatGPT, as a representative LLM, is a viable tool for simulating standardized patients in medical assessments, with the potential to enhance medical training. By incorporating proper prompts, ChatGPT’s scoring accuracy and response realism significantly improved, approaching the feasibility of actual clinical use. Also, the influence of the adopted language is nonsignificant on the outcome of the chatbot.
doi_str	10.2196/59435
format	Article
id	oai_oai_doaj.org_article_e0cbe54df9bd4c11a4cdbeff2efaf92b
issn_str_mv	1438-8871
language_str_mv	EN
oai_datestamp_str	2025-01-01T16:30:39Z
oai_identifier_str	oai:doaj.org/article:e0cbe54df9bd4c11a4cdbeff2efaf92b
publisher_str	JMIR Publications
relation_str_mv	https://www.jmir.org/2025/1/e59435 https://doaj.org/toc/1438-8871
source_str	JOURNAL_A
source_txt	Journal of Medical Internet Research, Vol 27, p e59435 (2025)
spellingShingle	Application of Large Language Models in Medical Training Evaluation—Using ChatGPT as a Standardized Patient: Multimetric Assessment Chenxu Wang Shuhan Li Nuoxi Lin Xinyu Zhang Ying Han Xiandi Wang Di Liu Xiaomei Tan Dan Pu Kang Li Guangwu Qian Rong Yin
subject_str_mv	Computer applications to medicine. Medical informatics R858-859.7 Public aspects of medicine RA1-1270
title	Application of Large Language Models in Medical Training Evaluation—Using ChatGPT as a Standardized Patient: Multimetric Assessment
type_str	article
url	https://doaj.org/article/e0cbe54df9bd4c11a4cdbeff2efaf92b

Títulos similares