A validation framework for an online English language Exit Test: A case study using Moodle as an assessment management system

Zakiya Salim Hamed Al-Naddabi

doi:https://doi.org/10.14264/uql.2018.154

A validation framework for an online English language Exit Test: A case study using Moodle as an assessment management system

Zakiya Salim Hamed Al-Naddabi

Center for Preparatory Studies

نتاج البحث: Doctoral Thesis

86 التنزيلات (Pure)

ملخص

Technology-enhanced language tests are increasingly being hosted on course management systems (CMSs) like Moodle. Despite the increased use of CMS-hosted tests and the rising concerns over the reliability and construct validity of computerised tests due to a potential testing mode effect (Chapelle & Douglas, 2006; Fulcher, 2003), validation research on these tests is lacking. Therefore, this study seeks to fill this gap with empirical validation research using a case study of administering and validating a CMS-hosted test. The test was a technology-enhanced English Language Proficiency Exit Test that was hosted on Moodle (hereafter called Moodle-hosted test) and administered to a group of EFL students (N = 207) at Sultan Qaboos University in Oman. The overall aim of the study was to provide a validity argument about using a Moodle-hosted test for its intended purpose by empirically establishing reliability and construct validity evidence. To achieve this aim, a study framework was successfully applied following principles of the Assessment Use Argument (AUA) framework of Bachman (2005) and Bachman and Palmer (2010). Applying the framework as a pragmatic tool to conduct validation research led to the structuring of an evidencebased argument about test reliability and construct validity drawing on multiple sources of evidence (Kane, 1992) collected via mixed-method design.

The results of Rasch analysis revealed that a quarter of the test items, which were of the gap-filling type requiring typing of responses, were overly difficult and had high unacceptable measurement error values. Although the study outcomes demonstrated warrants of statistically acceptable reliability estimates, two threats to reliability and construct validity were identified: construct-irrelevance and construct under-representation. The overly difficult items introduced construct-irrelevant difficulty as some test takers found the construct difficult and the resulting scores might have been invalidly low. Thirty percent of the test items also had unacceptable fit statistics, suggesting that they did not contribute independently to test reliability and they inconsistently assessed student performances. Having items with unacceptable fit statistics indicated departure from unidimensionality, as the test might have measured construct-irrelevant sub-dimensions other than the single dimension of language proficiency. Construct under-representation was identified by finding gaps between item difficulty and person ability measures, suggesting that the test did not capture examinees’ ability levels well. As difficulty of the items did not match the ability levels of test takers, the test construct might have been under-represented by the set of items and better quality items might be needed to address a range of ability levels. With this evidence that the test had reliability and construct validity issues, the test scores might not be reliable and valid indicators of the target test construct. Further investigation examined a number of factors that could be potential sources of reliability and construct validity issues interfering with test performance results in the Moodle-hosted technology-enhanced testing mode.

Based on a comparison of test scores with examinees’ post-test questionnaire responses, the study revealed that test performance was significantly affected by the testing mode due to construct-irrelevant technology-related factors. These were strong rebuttals to reliability and construct validity claims in the validity argument. The study found that some construct-irrelevant technology-related variables significantly affected test performance including: 1) the familiarity and levels of technology experience of test takers, familiarity with Moodle tests, and computer-literacy; 2) the functionality of headphones during the exam; 3) test taker’s attitude towards the testing format; 4) the need to type responses for constructed-response test items; and 5) test time sufficiency and the use of a count-down timer. Other construct-irrelevant technology-related issues that did not significantly interfere with test performance were also considered as issues of concern, and these were: 1) screen layout and scrolling; 2) note-taking and text highlighting features; and 3) eye fatigue. Because negative evidence indicated that the testing mode effect threatened reliability and construct validity and created unfairness or bias issues, it was concluded in the validity argument that the Moodle-hosted score-based decisions cannot be justifiably reliable nor valid. The research questions were answered in the validity argument based on combined evidence from the study outputs, including test and post-test questionnaire responses. Therefore, a significant finding from this study was that statistical analysis of test responses alone is insufficient in developing computerised tests that are holistically fit for purpose.

This study contributes knowledge to the field as its findings lay out significant implications and recommendations about the testing mode effect. Practitioners and researchers may wish to adopt these implications and recommendations as guidelines for creating, developing, implementing, and researching reliable and valid large-scale high-stakes tests delivered on Moodle, other course management systems, or any other computerised test delivery tools. To ensure policy-makers are informed about whether using test outcomes can be justifiably fair to students, future validation research studies should be conducted so that potential issues with this testing mode can be further identified and addressed.

اللغة الأصلية	English
التأهيل	Doctor of Philosophy
المؤسسة المانحة	School of Education, The University of Queensland, Australia
المشرفون/المستشارون	Hillier, Mathew , Supervisor, موظف خارجي Iwashita, Noriko, Supervisor, موظف خارجي Campbell, Chris , Supervisor, موظف خارجي
رعاة الأطروحة	Ministry of Higher Education, Oman
تاريخ الجائزة	ديسمبر ٢٠ ٢٠١٧
المعرِّفات الرقمية للأشياء	https://doi.org/10.14264/uql.2018.154
حالة النشر	Published - ديسمبر 20 2017

الوصول إلى المستند

https://doi.org/10.14264/uql.2018.154

PhD_final_thesis_Zakiya Al NadabiFinal published version, ٤٫٠٦ MB

https://espace.library.uq.edu.au/view/UQ:702967

الملفات والروابط الأخرى

https://scholar.google.com/scholar?q=intitle:A%20validation%20framework%20for%20an%20online%20English%20language%20Exit%20Test:%20A%20case%20study%20using%20Moodle%20as%20an%20assessment%20management%20system

قم بذكر هذا

@phdthesis{4f115a38c6434bd1b98ce45e4eae6312,

title = "A validation framework for an online English language Exit Test: A case study using Moodle as an assessment management system",

abstract = "Technology-enhanced language tests are increasingly being hosted on course management systems (CMSs) like Moodle. Despite the increased use of CMS-hosted tests and the rising concerns over the reliability and construct validity of computerised tests due to a potential testing mode effect (Chapelle & Douglas, 2006; Fulcher, 2003), validation research on these tests is lacking. Therefore, this study seeks to fill this gap with empirical validation research using a case study of administering and validating a CMS-hosted test. The test was a technology-enhanced English Language Proficiency Exit Test that was hosted on Moodle (hereafter called Moodle-hosted test) and administered to a group of EFL students (N = 207) at Sultan Qaboos University in Oman. The overall aim of the study was to provide a validity argument about using a Moodle-hosted test for its intended purpose by empirically establishing reliability and construct validity evidence. To achieve this aim, a study framework was successfully applied following principles of the Assessment Use Argument (AUA) framework of Bachman (2005) and Bachman and Palmer (2010). Applying the framework as a pragmatic tool to conduct validation research led to the structuring of an evidencebased argument about test reliability and construct validity drawing on multiple sources of evidence (Kane, 1992) collected via mixed-method design.The results of Rasch analysis revealed that a quarter of the test items, which were of the gap-filling type requiring typing of responses, were overly difficult and had high unacceptable measurement error values. Although the study outcomes demonstrated warrants of statistically acceptable reliability estimates, two threats to reliability and construct validity were identified: construct-irrelevance and construct under-representation. The overly difficult items introduced construct-irrelevant difficulty as some test takers found the construct difficult and the resulting scores might have been invalidly low. Thirty percent of the test items also had unacceptable fit statistics, suggesting that they did not contribute independently to test reliability and they inconsistently assessed student performances. Having items with unacceptable fit statistics indicated departure from unidimensionality, as the test might have measured construct-irrelevant sub-dimensions other than the single dimension of language proficiency. Construct under-representation was identified by finding gaps between item difficulty and person ability measures, suggesting that the test did not capture examinees{\textquoteright} ability levels well. As difficulty of the items did not match the ability levels of test takers, the test construct might have been under-represented by the set of items and better quality items might be needed to address a range of ability levels. With this evidence that the test had reliability and construct validity issues, the test scores might not be reliable and valid indicators of the target test construct. Further investigation examined a number of factors that could be potential sources of reliability and construct validity issues interfering with test performance results in the Moodle-hosted technology-enhanced testing mode.Based on a comparison of test scores with examinees{\textquoteright} post-test questionnaire responses, the study revealed that test performance was significantly affected by the testing mode due to construct-irrelevant technology-related factors. These were strong rebuttals to reliability and construct validity claims in the validity argument. The study found that some construct-irrelevant technology-related variables significantly affected test performance including: 1) the familiarity and levels of technology experience of test takers, familiarity with Moodle tests, and computer-literacy; 2) the functionality of headphones during the exam; 3) test taker{\textquoteright}s attitude towards the testing format; 4) the need to type responses for constructed-response test items; and 5) test time sufficiency and the use of a count-down timer. Other construct-irrelevant technology-related issues that did not significantly interfere with test performance were also considered as issues of concern, and these were: 1) screen layout and scrolling; 2) note-taking and text highlighting features; and 3) eye fatigue. Because negative evidence indicated that the testing mode effect threatened reliability and construct validity and created unfairness or bias issues, it was concluded in the validity argument that the Moodle-hosted score-based decisions cannot be justifiably reliable nor valid. The research questions were answered in the validity argument based on combined evidence from the study outputs, including test and post-test questionnaire responses. Therefore, a significant finding from this study was that statistical analysis of test responses alone is insufficient in developing computerised tests that are holistically fit for purpose.This study contributes knowledge to the field as its findings lay out significant implications and recommendations about the testing mode effect. Practitioners and researchers may wish to adopt these implications and recommendations as guidelines for creating, developing, implementing, and researching reliable and valid large-scale high-stakes tests delivered on Moodle, other course management systems, or any other computerised test delivery tools. To ensure policy-makers are informed about whether using test outcomes can be justifiably fair to students, future validation research studies should be conducted so that potential issues with this testing mode can be further identified and addressed.",

keywords = "Testing mode effect, Reliability, Construct validity, Construct-irrelevance, Construct under-representation, Course management system, Moodle, Validity framework, Validation",

author = "{Salim Hamed Al-Naddabi}, Zakiya",

year = "2017",

month = dec,

day = "20",

doi = "https://doi.org/10.14264/uql.2018.154",

language = "English",

school = "School of Education, The University of Queensland, Australia",

}

TY - BOOK

T1 - A validation framework for an online English language Exit Test: A case study using Moodle as an assessment management system

AU - Salim Hamed Al-Naddabi, Zakiya

PY - 2017/12/20

Y1 - 2017/12/20

N2 - Technology-enhanced language tests are increasingly being hosted on course management systems (CMSs) like Moodle. Despite the increased use of CMS-hosted tests and the rising concerns over the reliability and construct validity of computerised tests due to a potential testing mode effect (Chapelle & Douglas, 2006; Fulcher, 2003), validation research on these tests is lacking. Therefore, this study seeks to fill this gap with empirical validation research using a case study of administering and validating a CMS-hosted test. The test was a technology-enhanced English Language Proficiency Exit Test that was hosted on Moodle (hereafter called Moodle-hosted test) and administered to a group of EFL students (N = 207) at Sultan Qaboos University in Oman. The overall aim of the study was to provide a validity argument about using a Moodle-hosted test for its intended purpose by empirically establishing reliability and construct validity evidence. To achieve this aim, a study framework was successfully applied following principles of the Assessment Use Argument (AUA) framework of Bachman (2005) and Bachman and Palmer (2010). Applying the framework as a pragmatic tool to conduct validation research led to the structuring of an evidencebased argument about test reliability and construct validity drawing on multiple sources of evidence (Kane, 1992) collected via mixed-method design.The results of Rasch analysis revealed that a quarter of the test items, which were of the gap-filling type requiring typing of responses, were overly difficult and had high unacceptable measurement error values. Although the study outcomes demonstrated warrants of statistically acceptable reliability estimates, two threats to reliability and construct validity were identified: construct-irrelevance and construct under-representation. The overly difficult items introduced construct-irrelevant difficulty as some test takers found the construct difficult and the resulting scores might have been invalidly low. Thirty percent of the test items also had unacceptable fit statistics, suggesting that they did not contribute independently to test reliability and they inconsistently assessed student performances. Having items with unacceptable fit statistics indicated departure from unidimensionality, as the test might have measured construct-irrelevant sub-dimensions other than the single dimension of language proficiency. Construct under-representation was identified by finding gaps between item difficulty and person ability measures, suggesting that the test did not capture examinees’ ability levels well. As difficulty of the items did not match the ability levels of test takers, the test construct might have been under-represented by the set of items and better quality items might be needed to address a range of ability levels. With this evidence that the test had reliability and construct validity issues, the test scores might not be reliable and valid indicators of the target test construct. Further investigation examined a number of factors that could be potential sources of reliability and construct validity issues interfering with test performance results in the Moodle-hosted technology-enhanced testing mode.Based on a comparison of test scores with examinees’ post-test questionnaire responses, the study revealed that test performance was significantly affected by the testing mode due to construct-irrelevant technology-related factors. These were strong rebuttals to reliability and construct validity claims in the validity argument. The study found that some construct-irrelevant technology-related variables significantly affected test performance including: 1) the familiarity and levels of technology experience of test takers, familiarity with Moodle tests, and computer-literacy; 2) the functionality of headphones during the exam; 3) test taker’s attitude towards the testing format; 4) the need to type responses for constructed-response test items; and 5) test time sufficiency and the use of a count-down timer. Other construct-irrelevant technology-related issues that did not significantly interfere with test performance were also considered as issues of concern, and these were: 1) screen layout and scrolling; 2) note-taking and text highlighting features; and 3) eye fatigue. Because negative evidence indicated that the testing mode effect threatened reliability and construct validity and created unfairness or bias issues, it was concluded in the validity argument that the Moodle-hosted score-based decisions cannot be justifiably reliable nor valid. The research questions were answered in the validity argument based on combined evidence from the study outputs, including test and post-test questionnaire responses. Therefore, a significant finding from this study was that statistical analysis of test responses alone is insufficient in developing computerised tests that are holistically fit for purpose.This study contributes knowledge to the field as its findings lay out significant implications and recommendations about the testing mode effect. Practitioners and researchers may wish to adopt these implications and recommendations as guidelines for creating, developing, implementing, and researching reliable and valid large-scale high-stakes tests delivered on Moodle, other course management systems, or any other computerised test delivery tools. To ensure policy-makers are informed about whether using test outcomes can be justifiably fair to students, future validation research studies should be conducted so that potential issues with this testing mode can be further identified and addressed.

AB - Technology-enhanced language tests are increasingly being hosted on course management systems (CMSs) like Moodle. Despite the increased use of CMS-hosted tests and the rising concerns over the reliability and construct validity of computerised tests due to a potential testing mode effect (Chapelle & Douglas, 2006; Fulcher, 2003), validation research on these tests is lacking. Therefore, this study seeks to fill this gap with empirical validation research using a case study of administering and validating a CMS-hosted test. The test was a technology-enhanced English Language Proficiency Exit Test that was hosted on Moodle (hereafter called Moodle-hosted test) and administered to a group of EFL students (N = 207) at Sultan Qaboos University in Oman. The overall aim of the study was to provide a validity argument about using a Moodle-hosted test for its intended purpose by empirically establishing reliability and construct validity evidence. To achieve this aim, a study framework was successfully applied following principles of the Assessment Use Argument (AUA) framework of Bachman (2005) and Bachman and Palmer (2010). Applying the framework as a pragmatic tool to conduct validation research led to the structuring of an evidencebased argument about test reliability and construct validity drawing on multiple sources of evidence (Kane, 1992) collected via mixed-method design.The results of Rasch analysis revealed that a quarter of the test items, which were of the gap-filling type requiring typing of responses, were overly difficult and had high unacceptable measurement error values. Although the study outcomes demonstrated warrants of statistically acceptable reliability estimates, two threats to reliability and construct validity were identified: construct-irrelevance and construct under-representation. The overly difficult items introduced construct-irrelevant difficulty as some test takers found the construct difficult and the resulting scores might have been invalidly low. Thirty percent of the test items also had unacceptable fit statistics, suggesting that they did not contribute independently to test reliability and they inconsistently assessed student performances. Having items with unacceptable fit statistics indicated departure from unidimensionality, as the test might have measured construct-irrelevant sub-dimensions other than the single dimension of language proficiency. Construct under-representation was identified by finding gaps between item difficulty and person ability measures, suggesting that the test did not capture examinees’ ability levels well. As difficulty of the items did not match the ability levels of test takers, the test construct might have been under-represented by the set of items and better quality items might be needed to address a range of ability levels. With this evidence that the test had reliability and construct validity issues, the test scores might not be reliable and valid indicators of the target test construct. Further investigation examined a number of factors that could be potential sources of reliability and construct validity issues interfering with test performance results in the Moodle-hosted technology-enhanced testing mode.Based on a comparison of test scores with examinees’ post-test questionnaire responses, the study revealed that test performance was significantly affected by the testing mode due to construct-irrelevant technology-related factors. These were strong rebuttals to reliability and construct validity claims in the validity argument. The study found that some construct-irrelevant technology-related variables significantly affected test performance including: 1) the familiarity and levels of technology experience of test takers, familiarity with Moodle tests, and computer-literacy; 2) the functionality of headphones during the exam; 3) test taker’s attitude towards the testing format; 4) the need to type responses for constructed-response test items; and 5) test time sufficiency and the use of a count-down timer. Other construct-irrelevant technology-related issues that did not significantly interfere with test performance were also considered as issues of concern, and these were: 1) screen layout and scrolling; 2) note-taking and text highlighting features; and 3) eye fatigue. Because negative evidence indicated that the testing mode effect threatened reliability and construct validity and created unfairness or bias issues, it was concluded in the validity argument that the Moodle-hosted score-based decisions cannot be justifiably reliable nor valid. The research questions were answered in the validity argument based on combined evidence from the study outputs, including test and post-test questionnaire responses. Therefore, a significant finding from this study was that statistical analysis of test responses alone is insufficient in developing computerised tests that are holistically fit for purpose.This study contributes knowledge to the field as its findings lay out significant implications and recommendations about the testing mode effect. Practitioners and researchers may wish to adopt these implications and recommendations as guidelines for creating, developing, implementing, and researching reliable and valid large-scale high-stakes tests delivered on Moodle, other course management systems, or any other computerised test delivery tools. To ensure policy-makers are informed about whether using test outcomes can be justifiably fair to students, future validation research studies should be conducted so that potential issues with this testing mode can be further identified and addressed.

KW - Testing mode effect

KW - Reliability

KW - Construct validity

KW - Construct-irrelevance

KW - Construct under-representation

KW - Course management system

KW - Moodle

KW - Validity framework

KW - Validation

UR - https://scholar.google.com/scholar?q=intitle:A%20validation%20framework%20for%20an%20online%20English%20language%20Exit%20Test:%20A%20case%20study%20using%20Moodle%20as%20an%20assessment%20management%20system

U2 - https://doi.org/10.14264/uql.2018.154

DO - https://doi.org/10.14264/uql.2018.154

M3 - Doctoral Thesis

ER -

A validation framework for an online English language Exit Test: A case study using Moodle as an assessment management system

ملخص

الوصول إلى المستند

الملفات والروابط الأخرى

بصمة

قم بذكر هذا