Assessing ChatGPT's Mastery of Bloom's Taxonomy Using Psychosomatic Medicine Exam Questions: Mixed-Methods Study

Anne Herrmann-Werner; Teresa Festl-Wietek; Friederike Holderried; Lea Herschbach; Jan Griewatz; Ken Masters; Stephan Zipfel; Moritz Mahling

doi:10.2196/52113

Assessing ChatGPT's Mastery of Bloom's Taxonomy Using Psychosomatic Medicine Exam Questions: Mixed-Methods Study

Anne Herrmann-Werner, Teresa Festl-Wietek, Friederike Holderried, Lea Herschbach, Jan Griewatz, Ken Masters, Stephan Zipfel, Moritz Mahling

نتاج البحث: المساهمة في مجلة › Article › مراجعة النظراء

4 اقتباسات (Scopus)

ملخص

Background: Large language models such as GPT-4 (Generative Pre-trained Transformer 4) are being increasingly used in medicine and medical education. However, these models are prone to “hallucinations” (ie, outputs that seem convincing while being factually incorrect). It is currently unknown how these errors by large language models relate to the different cognitive levels defined in Bloom's taxonomy. Objective: This study aims to explore how GPT-4 performs in terms of Bloom's taxonomy using psychosomatic medicine exam questions. Methods: We used a large data set of psychosomatic medicine multiple-choice questions (N=307) with real-world results derived from medical school exams. GPT-4 answered the multiple-choice questions using 2 distinct prompt versions: detailed and short. The answers were analyzed using a quantitative approach and a qualitative approach. Focusing on incorrectly answered questions, we categorized reasoning errors according to the hierarchical framework of Bloom's taxonomy. Results: GPT-4's performance in answering exam questions yielded a high success rate: 93% (284/307) for the detailed prompt and 91% (278/307) for the short prompt. Questions answered correctly by GPT-4 had a statistically significant higher difficulty than questions answered incorrectly (P = .002 for the detailed prompt and P < .001 for the short prompt). Independent of the prompt, GPT-4's lowest exam performance was 78.9% (15/19), thereby always surpassing the “pass” threshold. Our qualitative analysis of incorrect answers, based on Bloom's taxonomy, showed that errors were primarily in the “remember” (29/68) and “understand” (23/68) cognitive levels; specific issues arose in recalling details, understanding conceptual relationships, and adhering to standardized guidelines. Conclusions: GPT-4 demonstrated a remarkable success rate when confronted with psychosomatic medicine multiple-choice exam questions, aligning with previous findings. When evaluated through Bloom's taxonomy, our data revealed that GPT-4 occasionally ignored specific facts (remember), provided illogical reasoning (understand), or failed to apply concepts to a new situation (apply). These errors, which were confidently presented, could be attributed to inherent model biases and the tendency to generate outputs that maximize likelihood.

اللغة الأصلية	English
رقم المقال	e52113
الصفحات (من إلى)	e52113
دورية	Journal of Medical Internet Research
مستوى الصوت	26
رقم الإصدار	1
المعرِّفات الرقمية للأشياء	https://doi.org/10.2196/52113
حالة النشر	Published - يناير 23 2024
منشور خارجيًا	نعم

ASJC Scopus subject areas

???subjectarea.asjc.2700.2718???

الوصول إلى المستند

10.2196/52113

الملفات والروابط الأخرى

قم بذكر هذا

Assessing ChatGPT's Mastery of Bloom's Taxonomy Using Psychosomatic Medicine Exam Questions: Mixed-Methods Study. / Herrmann-Werner, Anne; Festl-Wietek, Teresa; Holderried, Friederike وآخرون.
في: Journal of Medical Internet Research, المجلد 26, رقم 1, e52113, ٢٣.٠١.٢٠٢٤, صفحة e52113.

نتاج البحث: المساهمة في مجلة › Article › مراجعة النظراء

@article{9b22bb47096e4817a8a8439fe25eb938,

title = "Assessing ChatGPT's Mastery of Bloom's Taxonomy Using Psychosomatic Medicine Exam Questions: Mixed-Methods Study",

abstract = "Background: Large language models such as GPT-4 (Generative Pre-trained Transformer 4) are being increasingly used in medicine and medical education. However, these models are prone to “hallucinations” (ie, outputs that seem convincing while being factually incorrect). It is currently unknown how these errors by large language models relate to the different cognitive levels defined in Bloom's taxonomy. Objective: This study aims to explore how GPT-4 performs in terms of Bloom's taxonomy using psychosomatic medicine exam questions. Methods: We used a large data set of psychosomatic medicine multiple-choice questions (N=307) with real-world results derived from medical school exams. GPT-4 answered the multiple-choice questions using 2 distinct prompt versions: detailed and short. The answers were analyzed using a quantitative approach and a qualitative approach. Focusing on incorrectly answered questions, we categorized reasoning errors according to the hierarchical framework of Bloom's taxonomy. Results: GPT-4's performance in answering exam questions yielded a high success rate: 93% (284/307) for the detailed prompt and 91% (278/307) for the short prompt. Questions answered correctly by GPT-4 had a statistically significant higher difficulty than questions answered incorrectly (P = .002 for the detailed prompt and P < .001 for the short prompt). Independent of the prompt, GPT-4's lowest exam performance was 78.9% (15/19), thereby always surpassing the “pass” threshold. Our qualitative analysis of incorrect answers, based on Bloom's taxonomy, showed that errors were primarily in the “remember” (29/68) and “understand” (23/68) cognitive levels; specific issues arose in recalling details, understanding conceptual relationships, and adhering to standardized guidelines. Conclusions: GPT-4 demonstrated a remarkable success rate when confronted with psychosomatic medicine multiple-choice exam questions, aligning with previous findings. When evaluated through Bloom's taxonomy, our data revealed that GPT-4 occasionally ignored specific facts (remember), provided illogical reasoning (understand), or failed to apply concepts to a new situation (apply). These errors, which were confidently presented, could be attributed to inherent model biases and the tendency to generate outputs that maximize likelihood.",

keywords = "answer, artificial intelligence, assessment, Bloom{\textquoteright}s taxonomy, ChatGPT, classification, error, exam, examination, generative, Generative Pre-trained Transformer 4, GPT-4, language model, learning outcome, LLM, MCQ, medical education, medical exam, multiple-choice question, natural language processing, NLP, psychosomatic, question, response, taxonomy, Humans, Education, Medical, Psychosomatic Medicine, Medicine, Research Design",

author = "Anne Herrmann-Werner and Teresa Festl-Wietek and Friederike Holderried and Lea Herschbach and Jan Griewatz and Ken Masters and Stephan Zipfel and Moritz Mahling",

year = "2024",

month = jan,

day = "23",

doi = "10.2196/52113",

language = "English",

volume = "26",

pages = "e52113",

journal = "Journal of Medical Internet Research",

issn = "1439-4456",

publisher = "Journal of medical Internet Research",

number = "1",

}

TY - JOUR

T1 - Assessing ChatGPT's Mastery of Bloom's Taxonomy Using Psychosomatic Medicine Exam Questions

T2 - Mixed-Methods Study

AU - Herrmann-Werner, Anne

AU - Festl-Wietek, Teresa

AU - Holderried, Friederike

AU - Herschbach, Lea

AU - Griewatz, Jan

AU - Masters, Ken

AU - Zipfel, Stephan

AU - Mahling, Moritz

PY - 2024/1/23

Y1 - 2024/1/23

N2 - Background: Large language models such as GPT-4 (Generative Pre-trained Transformer 4) are being increasingly used in medicine and medical education. However, these models are prone to “hallucinations” (ie, outputs that seem convincing while being factually incorrect). It is currently unknown how these errors by large language models relate to the different cognitive levels defined in Bloom's taxonomy. Objective: This study aims to explore how GPT-4 performs in terms of Bloom's taxonomy using psychosomatic medicine exam questions. Methods: We used a large data set of psychosomatic medicine multiple-choice questions (N=307) with real-world results derived from medical school exams. GPT-4 answered the multiple-choice questions using 2 distinct prompt versions: detailed and short. The answers were analyzed using a quantitative approach and a qualitative approach. Focusing on incorrectly answered questions, we categorized reasoning errors according to the hierarchical framework of Bloom's taxonomy. Results: GPT-4's performance in answering exam questions yielded a high success rate: 93% (284/307) for the detailed prompt and 91% (278/307) for the short prompt. Questions answered correctly by GPT-4 had a statistically significant higher difficulty than questions answered incorrectly (P = .002 for the detailed prompt and P < .001 for the short prompt). Independent of the prompt, GPT-4's lowest exam performance was 78.9% (15/19), thereby always surpassing the “pass” threshold. Our qualitative analysis of incorrect answers, based on Bloom's taxonomy, showed that errors were primarily in the “remember” (29/68) and “understand” (23/68) cognitive levels; specific issues arose in recalling details, understanding conceptual relationships, and adhering to standardized guidelines. Conclusions: GPT-4 demonstrated a remarkable success rate when confronted with psychosomatic medicine multiple-choice exam questions, aligning with previous findings. When evaluated through Bloom's taxonomy, our data revealed that GPT-4 occasionally ignored specific facts (remember), provided illogical reasoning (understand), or failed to apply concepts to a new situation (apply). These errors, which were confidently presented, could be attributed to inherent model biases and the tendency to generate outputs that maximize likelihood.

AB - Background: Large language models such as GPT-4 (Generative Pre-trained Transformer 4) are being increasingly used in medicine and medical education. However, these models are prone to “hallucinations” (ie, outputs that seem convincing while being factually incorrect). It is currently unknown how these errors by large language models relate to the different cognitive levels defined in Bloom's taxonomy. Objective: This study aims to explore how GPT-4 performs in terms of Bloom's taxonomy using psychosomatic medicine exam questions. Methods: We used a large data set of psychosomatic medicine multiple-choice questions (N=307) with real-world results derived from medical school exams. GPT-4 answered the multiple-choice questions using 2 distinct prompt versions: detailed and short. The answers were analyzed using a quantitative approach and a qualitative approach. Focusing on incorrectly answered questions, we categorized reasoning errors according to the hierarchical framework of Bloom's taxonomy. Results: GPT-4's performance in answering exam questions yielded a high success rate: 93% (284/307) for the detailed prompt and 91% (278/307) for the short prompt. Questions answered correctly by GPT-4 had a statistically significant higher difficulty than questions answered incorrectly (P = .002 for the detailed prompt and P < .001 for the short prompt). Independent of the prompt, GPT-4's lowest exam performance was 78.9% (15/19), thereby always surpassing the “pass” threshold. Our qualitative analysis of incorrect answers, based on Bloom's taxonomy, showed that errors were primarily in the “remember” (29/68) and “understand” (23/68) cognitive levels; specific issues arose in recalling details, understanding conceptual relationships, and adhering to standardized guidelines. Conclusions: GPT-4 demonstrated a remarkable success rate when confronted with psychosomatic medicine multiple-choice exam questions, aligning with previous findings. When evaluated through Bloom's taxonomy, our data revealed that GPT-4 occasionally ignored specific facts (remember), provided illogical reasoning (understand), or failed to apply concepts to a new situation (apply). These errors, which were confidently presented, could be attributed to inherent model biases and the tendency to generate outputs that maximize likelihood.

KW - answer

KW - artificial intelligence

KW - assessment

KW - Bloom’s taxonomy

KW - ChatGPT

KW - classification

KW - error

KW - exam

KW - examination

KW - generative

KW - Generative Pre-trained Transformer 4

KW - GPT-4

KW - language model

KW - learning outcome

KW - LLM

KW - MCQ

KW - medical education

KW - medical exam

KW - multiple-choice question

KW - natural language processing

KW - NLP

KW - psychosomatic

KW - question

KW - response

KW - taxonomy

KW - Humans

KW - Education, Medical

KW - Psychosomatic Medicine

KW - Medicine

KW - Research Design

UR - http://www.scopus.com/inward/record.url?scp=85183224105&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85183224105&partnerID=8YFLogxK

UR - https://www.mendeley.com/catalogue/10239a51-9c1d-39fc-9751-372375be29d8/

U2 - 10.2196/52113

DO - 10.2196/52113

M3 - Article

C2 - 38261378

AN - SCOPUS:85183224105

SN - 1439-4456

VL - 26

SP - e52113

JO - Journal of Medical Internet Research

JF - Journal of Medical Internet Research

IS - 1

M1 - e52113

ER -