Assessing ChatGPT's Mastery of Bloom's Taxonomy Using Psychosomatic Medicine Exam Questions: Mixed-Methods Study

Anne Herrmann-Werner; Teresa Festl-Wietek; Friederike Holderried; Lea Herschbach; Jan Griewatz; Ken Masters; Stephan Zipfel; Moritz Mahling

doi:10.2196/52113

Assessing ChatGPT's Mastery of Bloom's Taxonomy Using Psychosomatic Medicine Exam Questions: Mixed-Methods Study

Anne Herrmann-Werner, Teresa Festl-Wietek, Friederike Holderried, Lea Herschbach, Jan Griewatz, Ken Masters, Stephan Zipfel, Moritz Mahling

Research output: Contribution to journal › Article › peer-review

4 Citations (Scopus)

Abstract

Background: Large language models such as GPT-4 (Generative Pre-trained Transformer 4) are being increasingly used in medicine and medical education. However, these models are prone to “hallucinations” (ie, outputs that seem convincing while being factually incorrect). It is currently unknown how these errors by large language models relate to the different cognitive levels defined in Bloom's taxonomy. Objective: This study aims to explore how GPT-4 performs in terms of Bloom's taxonomy using psychosomatic medicine exam questions. Methods: We used a large data set of psychosomatic medicine multiple-choice questions (N=307) with real-world results derived from medical school exams. GPT-4 answered the multiple-choice questions using 2 distinct prompt versions: detailed and short. The answers were analyzed using a quantitative approach and a qualitative approach. Focusing on incorrectly answered questions, we categorized reasoning errors according to the hierarchical framework of Bloom's taxonomy. Results: GPT-4's performance in answering exam questions yielded a high success rate: 93% (284/307) for the detailed prompt and 91% (278/307) for the short prompt. Questions answered correctly by GPT-4 had a statistically significant higher difficulty than questions answered incorrectly (P = .002 for the detailed prompt and P < .001 for the short prompt). Independent of the prompt, GPT-4's lowest exam performance was 78.9% (15/19), thereby always surpassing the “pass” threshold. Our qualitative analysis of incorrect answers, based on Bloom's taxonomy, showed that errors were primarily in the “remember” (29/68) and “understand” (23/68) cognitive levels; specific issues arose in recalling details, understanding conceptual relationships, and adhering to standardized guidelines. Conclusions: GPT-4 demonstrated a remarkable success rate when confronted with psychosomatic medicine multiple-choice exam questions, aligning with previous findings. When evaluated through Bloom's taxonomy, our data revealed that GPT-4 occasionally ignored specific facts (remember), provided illogical reasoning (understand), or failed to apply concepts to a new situation (apply). These errors, which were confidently presented, could be attributed to inherent model biases and the tendency to generate outputs that maximize likelihood.

Original language	English
Article number	e52113
Pages (from-to)	e52113
Journal	Journal of Medical Internet Research
Volume	26
Issue number	1
DOIs	https://doi.org/10.2196/52113
Publication status	Published - Jan 23 2024
Externally published	Yes

Keywords

answer
artificial intelligence
assessment
Bloom’s taxonomy
ChatGPT
classification
error
exam
examination
generative
Generative Pre-trained Transformer 4
GPT-4
language model
learning outcome
LLM
MCQ
medical education
medical exam
multiple-choice question
natural language processing
NLP
psychosomatic
question
response
taxonomy
Humans
Education, Medical
Psychosomatic Medicine
Medicine
Research Design

ASJC Scopus subject areas

Health Informatics

Access to Document

10.2196/52113

Cite this

@article{9b22bb47096e4817a8a8439fe25eb938,

title = "Assessing ChatGPT's Mastery of Bloom's Taxonomy Using Psychosomatic Medicine Exam Questions: Mixed-Methods Study",

abstract = "Background: Large language models such as GPT-4 (Generative Pre-trained Transformer 4) are being increasingly used in medicine and medical education. However, these models are prone to “hallucinations” (ie, outputs that seem convincing while being factually incorrect). It is currently unknown how these errors by large language models relate to the different cognitive levels defined in Bloom's taxonomy. Objective: This study aims to explore how GPT-4 performs in terms of Bloom's taxonomy using psychosomatic medicine exam questions. Methods: We used a large data set of psychosomatic medicine multiple-choice questions (N=307) with real-world results derived from medical school exams. GPT-4 answered the multiple-choice questions using 2 distinct prompt versions: detailed and short. The answers were analyzed using a quantitative approach and a qualitative approach. Focusing on incorrectly answered questions, we categorized reasoning errors according to the hierarchical framework of Bloom's taxonomy. Results: GPT-4's performance in answering exam questions yielded a high success rate: 93% (284/307) for the detailed prompt and 91% (278/307) for the short prompt. Questions answered correctly by GPT-4 had a statistically significant higher difficulty than questions answered incorrectly (P = .002 for the detailed prompt and P < .001 for the short prompt). Independent of the prompt, GPT-4's lowest exam performance was 78.9% (15/19), thereby always surpassing the “pass” threshold. Our qualitative analysis of incorrect answers, based on Bloom's taxonomy, showed that errors were primarily in the “remember” (29/68) and “understand” (23/68) cognitive levels; specific issues arose in recalling details, understanding conceptual relationships, and adhering to standardized guidelines. Conclusions: GPT-4 demonstrated a remarkable success rate when confronted with psychosomatic medicine multiple-choice exam questions, aligning with previous findings. When evaluated through Bloom's taxonomy, our data revealed that GPT-4 occasionally ignored specific facts (remember), provided illogical reasoning (understand), or failed to apply concepts to a new situation (apply). These errors, which were confidently presented, could be attributed to inherent model biases and the tendency to generate outputs that maximize likelihood.",

keywords = "answer, artificial intelligence, assessment, Bloom{\textquoteright}s taxonomy, ChatGPT, classification, error, exam, examination, generative, Generative Pre-trained Transformer 4, GPT-4, language model, learning outcome, LLM, MCQ, medical education, medical exam, multiple-choice question, natural language processing, NLP, psychosomatic, question, response, taxonomy, Humans, Education, Medical, Psychosomatic Medicine, Medicine, Research Design",

author = "Anne Herrmann-Werner and Teresa Festl-Wietek and Friederike Holderried and Lea Herschbach and Jan Griewatz and Ken Masters and Stephan Zipfel and Moritz Mahling",

year = "2024",

month = jan,

day = "23",

doi = "10.2196/52113",

language = "English",

volume = "26",

pages = "e52113",

journal = "Journal of Medical Internet Research",

issn = "1439-4456",

publisher = "Journal of medical Internet Research",

number = "1",

}

TY - JOUR

T1 - Assessing ChatGPT's Mastery of Bloom's Taxonomy Using Psychosomatic Medicine Exam Questions

T2 - Mixed-Methods Study

AU - Herrmann-Werner, Anne

AU - Festl-Wietek, Teresa

AU - Holderried, Friederike

AU - Herschbach, Lea

AU - Griewatz, Jan

AU - Masters, Ken

AU - Zipfel, Stephan

AU - Mahling, Moritz

PY - 2024/1/23

Y1 - 2024/1/23

N2 - Background: Large language models such as GPT-4 (Generative Pre-trained Transformer 4) are being increasingly used in medicine and medical education. However, these models are prone to “hallucinations” (ie, outputs that seem convincing while being factually incorrect). It is currently unknown how these errors by large language models relate to the different cognitive levels defined in Bloom's taxonomy. Objective: This study aims to explore how GPT-4 performs in terms of Bloom's taxonomy using psychosomatic medicine exam questions. Methods: We used a large data set of psychosomatic medicine multiple-choice questions (N=307) with real-world results derived from medical school exams. GPT-4 answered the multiple-choice questions using 2 distinct prompt versions: detailed and short. The answers were analyzed using a quantitative approach and a qualitative approach. Focusing on incorrectly answered questions, we categorized reasoning errors according to the hierarchical framework of Bloom's taxonomy. Results: GPT-4's performance in answering exam questions yielded a high success rate: 93% (284/307) for the detailed prompt and 91% (278/307) for the short prompt. Questions answered correctly by GPT-4 had a statistically significant higher difficulty than questions answered incorrectly (P = .002 for the detailed prompt and P < .001 for the short prompt). Independent of the prompt, GPT-4's lowest exam performance was 78.9% (15/19), thereby always surpassing the “pass” threshold. Our qualitative analysis of incorrect answers, based on Bloom's taxonomy, showed that errors were primarily in the “remember” (29/68) and “understand” (23/68) cognitive levels; specific issues arose in recalling details, understanding conceptual relationships, and adhering to standardized guidelines. Conclusions: GPT-4 demonstrated a remarkable success rate when confronted with psychosomatic medicine multiple-choice exam questions, aligning with previous findings. When evaluated through Bloom's taxonomy, our data revealed that GPT-4 occasionally ignored specific facts (remember), provided illogical reasoning (understand), or failed to apply concepts to a new situation (apply). These errors, which were confidently presented, could be attributed to inherent model biases and the tendency to generate outputs that maximize likelihood.

AB - Background: Large language models such as GPT-4 (Generative Pre-trained Transformer 4) are being increasingly used in medicine and medical education. However, these models are prone to “hallucinations” (ie, outputs that seem convincing while being factually incorrect). It is currently unknown how these errors by large language models relate to the different cognitive levels defined in Bloom's taxonomy. Objective: This study aims to explore how GPT-4 performs in terms of Bloom's taxonomy using psychosomatic medicine exam questions. Methods: We used a large data set of psychosomatic medicine multiple-choice questions (N=307) with real-world results derived from medical school exams. GPT-4 answered the multiple-choice questions using 2 distinct prompt versions: detailed and short. The answers were analyzed using a quantitative approach and a qualitative approach. Focusing on incorrectly answered questions, we categorized reasoning errors according to the hierarchical framework of Bloom's taxonomy. Results: GPT-4's performance in answering exam questions yielded a high success rate: 93% (284/307) for the detailed prompt and 91% (278/307) for the short prompt. Questions answered correctly by GPT-4 had a statistically significant higher difficulty than questions answered incorrectly (P = .002 for the detailed prompt and P < .001 for the short prompt). Independent of the prompt, GPT-4's lowest exam performance was 78.9% (15/19), thereby always surpassing the “pass” threshold. Our qualitative analysis of incorrect answers, based on Bloom's taxonomy, showed that errors were primarily in the “remember” (29/68) and “understand” (23/68) cognitive levels; specific issues arose in recalling details, understanding conceptual relationships, and adhering to standardized guidelines. Conclusions: GPT-4 demonstrated a remarkable success rate when confronted with psychosomatic medicine multiple-choice exam questions, aligning with previous findings. When evaluated through Bloom's taxonomy, our data revealed that GPT-4 occasionally ignored specific facts (remember), provided illogical reasoning (understand), or failed to apply concepts to a new situation (apply). These errors, which were confidently presented, could be attributed to inherent model biases and the tendency to generate outputs that maximize likelihood.

KW - answer

KW - artificial intelligence

KW - assessment

KW - Bloom’s taxonomy

KW - ChatGPT

KW - classification

KW - error

KW - exam

KW - examination

KW - generative

KW - Generative Pre-trained Transformer 4

KW - GPT-4

KW - language model

KW - learning outcome

KW - LLM

KW - MCQ

KW - medical education

KW - medical exam

KW - multiple-choice question

KW - natural language processing

KW - NLP

KW - psychosomatic

KW - question

KW - response

KW - taxonomy

KW - Humans

KW - Education, Medical

KW - Psychosomatic Medicine

KW - Medicine

KW - Research Design

UR - http://www.scopus.com/inward/record.url?scp=85183224105&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85183224105&partnerID=8YFLogxK

UR - https://www.mendeley.com/catalogue/10239a51-9c1d-39fc-9751-372375be29d8/

U2 - 10.2196/52113

DO - 10.2196/52113

M3 - Article

C2 - 38261378

AN - SCOPUS:85183224105

SN - 1439-4456

VL - 26

SP - e52113

JO - Journal of Medical Internet Research

JF - Journal of Medical Internet Research

IS - 1

M1 - e52113

ER -

Assessing ChatGPT's Mastery of Bloom's Taxonomy Using Psychosomatic Medicine Exam Questions: Mixed-Methods Study

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Cite this