|ORIGINAL RESEARCH ARTICLE
|Year : 2018 | Volume
| Issue : 2 | Page : 65-71
Choosing medical assessments: Does the multiple-choice question make the grade?
Hannah Pham1, Monique Trigg2, Shaopeng Wu1, Alice O'Connell1, Christopher Harry1, John Barnard2, Peter Devitt1
1 Department of Surgery, University of Adelaide, Adelaide, South Australia
2 Excel Psychological and Educational Consultancy, Victoria, Australia
|Date of Web Publication||30-Nov-2018|
Adelaide Medical School, University of Adelaide, Adelaide, South Australia 5000
Source of Support: None, Conflict of Interest: None
Background: The multiple-choice question (MCQ) has been shown to measure the same constructs as the short-answer question (SAQ), yet the use of the latter persists. The study aims to evaluate whether assessment using the MCQ alone provides the same outcomes as testing with the SAQ. Methods: A prospective study design was used. A total of 276 medical students participated in a mock examination consisting of forty MCQs paired to forty SAQs, each pair matched in cognitive skill level and content. Each SAQ was marked by three independent markers. The impact of item-writing flaws (IWFs) on examination outcome was also evaluated. Results: The intraclass correlation coefficient (ICC) was 0.75 for the year IV examinations and 0.68 for the year V examinations. MCQs were more prone to IWFs than SAQs, but the effect when present in the latter was greater. Removal of questions containing IWFs from the year V SAQ allowed 39% of students who would otherwise have failed to pass. Discussion: The MCQ can test higher order skills as effectively as the SAQ and can be used as a single format in written assessment provided quality items testing higher order cognitive skills are used. IWFs can have a critical role in determining pass/fail results.
Keywords: Item-writing flaws, inter-rater reliability, medical assessment, multiple-choice question, short-answer question
|How to cite this article:|
Pham H, Trigg M, Wu S, O'Connell A, Harry C, Barnard J, Devitt P. Choosing medical assessments: Does the multiple-choice question make the grade?. Educ Health 2018;31:65-71
| Background|| |
The type of written test instrument that is best suited to assess the medical student knowledge is highly contested. Commonly used test instruments include the multiple-choice question (MCQ), which is traditionally thought to be better at testing knowledge recall, and the short-answer question (SAQ), which is considered better at testing higher order cognitive skills such as the ability to synthesize, analyze, and apply knowledge.
A well-constructed MCQ should include a detailed clinical stem, an unambiguous question, and four or five answer options from which the candidate must select only the most appropriate answer. It is a familiar format to educators and often used for its high reliability and resource effectiveness. However, the MCQ has garnered a largely negative reputation. This is on account of the belief that all a student has to do is correctly recognize the answer from the list of options, rather than spontaneously generate it (a phenomenon termed “cueing”). Furthermore, the MCQ is often perceived to be far-removed from the real-life demands of the practicing clinician and therefore less valid. MCQs are also susceptible to internal sources of error, including item-writing flaws (IWFs) which can impact student performance.
The SAQ may vary from requiring a simple, short response to demonstration of a candidates' clinical reasoning and decision-making skills. Often, these are posed in complex cases. Exposing such intermediate steps in thinking may theoretically assist educators in remediation and thus serve an educational purpose. However, inherent problems with the open-ended format (including the SAQ) include resource intensiveness in construction and marking of written responses, subjectivity, and interobserver variability in scoring candidates.
The MCQ is often perceived as inherently inferior to the SAQ. This is driven by the notion that the SAQ measures a greater breadth of cognitive skill, and that the ability to spontaneously generate an answer is both more cognitively taxing and more closely simulates real-life practice. Yet, a recurring finding in literature on the MCQ is its equivalence to the open-ended format in measuring student performance. Many observational, retrospective studies have shown that the open-ended formats, such as the SAQ, short-essay question, and modified essay question, do not confer any significant advantage over the MCQ. It has repeatedly been shown that for the majority of students, performance operates independent of testing format.,, However, some studies have shown that at extremes of the performance spectrum, the testing format had a strong influence on the course grade.,
Faculties have nonetheless persisted with utilizing a variety of tools for written assessment, likely to balance what are regarded as the merits and flaws of each tool. This may reduce any potentially detrimental effect of using a single format that is poorly constructed. One correlational analytical study made the argument for using a variety of tools to improve both validity and reliability, thus enabling examiners to “draw fair conclusions about a student's ability.”
MCQs have also been shown to frequently contain IWFs and are often written at lower cognitive skill levels, which can have a critical impact on student performance. This may explain why faculties have persisted with the open-ended format, despite the drawbacks of poor reliability and resource intensiveness. Perhaps the reluctance to move away from the traditional thinking about the MCQ is difficulty with item-writing itself.
There is a paucity of prospective studies within this area in the current literature where carefully matched MCQ and SAQ items are tested on a single cohort. This study aims to evaluate whether the MCQ alone provides the same outcomes as the SAQ when testing higher order cognitive skills. This would give confidence to educators in the design phase of constructing MCQ-only written assessments, particularly as the drawbacks of the open-ended format continue to strain faculties.
| Methods|| |
Ethical approval for conducting the project was granted by the University of Adelaide's Human Research Ethics Committee.
Year IV and year V students in the Bachelor of Medicine, Bachelor of Surgery (MBBS) course at the University of Adelaide were invited to participate in the study.
A total of forty MCQs paired to forty SAQs were constructed and subjected to a peer-review process. Paired items were matched according to content area and cognitive skill level, but question stems were not identical. Cognitive skill level was rated by three independent reviewers using the modified Bloom's Taxonomy [Figure 1] using consensus agreement. The cognitive level of paired items was either equivalent (e.g., Level I paired to Level I) or no more than one level apart ( e.g., Level II paired to Level III). The content of questions targeted the MBBS curriculum and was pitched at end-of-year summative level. Level II and III questions were considered “higher order.”
Each MCQ consisted of five answer options in a single-best-answer format and was worth one mark. The MCQs were collated to form an “MCQ examination” worth a total of forty possible marks. Each SAQ enabled an open-ended response and was worth one, two, or three marks. The SAQs were similarly collated into a “SAQ examination” worth a total of 91 possible marks for the year IV examination and 90 possible marks for the year V examination. The majority of questions were context rich and embedded within a clinical case scenario.
The study was in the form of a mock examination. The examinations were administered to students via a computer in one of four test sessions scheduled across 2 days in September 2015. Students nominated their preferred session on registration. In each session, students were allocated into two groups by surname and advised to complete either the MCQ or SAQ examination first. The order was reversed for the remaining candidates. This was done to reduce potential bias associated with the test order, e.g., prompting and fatigue.
Students who sat the examination on-site were subject to standard examination conditions. Supervision for students based off-site (in one of the multiple rural placement sites) could not be provided. These students were therefore permitted to undertake the examination in their own time.
The MCQ examination was marked via a computer using a simple algorithm that compared student responses with the correct responses. The SAQ examination was assessed by three independent markers. A total of six markers were recruited to mark the SAQ examinations. Markers 1–3 were assigned to assess the year IV SAQ examination responses. Markers 4–6 were assigned to assess the year V SAQ examination responses.
Markers were instructed to batch mark each of the SAQ questions to maximize scoring consistency. A detailed key/model answer guide was provided to all markers, who were instructed not to deviate from the given answers.
Analysis of item-writing flaws
IWFs were identified on completion of the examination. [Figure 2] illustrates some examples derived from a taxonomy of item-writing guidelines.
SAQ marks (for 1, 2, and 3 mark questions) were converted to a total possible mark out of 1.0, with equal weighting given to all three markers. Intraclass correlation coefficients (ICCs) were calculated to test inter-rater reliability of markers 1–3 and markers 4–6 for the SAQ, as well as inter-rater reliability of total SAQ and MCQ scores.
Final examination scores were calculated and compared using logistic generalized estimating equation, linear mixed effects, and linear regression models. Simple kappa coefficients were calculated for student performance outcomes (with “pass” defined as a score ≥50% and “fail” <50%).
Student performance before and after IWFs were removed from the examinations was compared using McNemar's test, particularly to determine if there was any difference in pass/fail margins. The impact of IWFs particularly on high performers (defined as top 10%) was analyzed.
The effect of the order of examination (MCQ first or SAQ first) and demographic groups (female and male, and on-site and off-site) and the impact on student performance were analyzed.
| Results|| |
Of 199 year IV and 175 year V students, 136 year IV (72 females and 64 males; 15 off-site and 121 on-site) and 140 year V (85 females and 55 males; 35 off-site and 95 on-site) students enrolled and successfully completed all the parts of the examination.
Overall student performance
[Table 1] summarizes the overall student performance.
Year IV and V students who passed the SAQ did not necessarily pass the MCQ and vice versa. The result was often pass/fail, rather than fail/fail.
Performance of the top 10%
In the year IV examinations, all high performers passed both examinations (MCQ: SAQ pass ratio of 13/13). Of the lower performers (excludes top 10%), only 34.4% passed both examinations (MCQ: SAQ pass ratio of 22/64). In the year V examinations, all high performers passed both examinations (MCQ: SAQ pass ratio of 14/14) compared to 52.8% of lower performers passing both examinations (MCQ: SAQ pass ratio of 57/108).
Correlation of multiple-choice question and short-answer question performance
For the year IV and V examinations, there was a statistically significant association between total MCQ percentage and total SAQ percentage (P < 0.0001) using ICC. The ICC for year IV total MCQ and SAQ scores was 0.75 (95% confidence interval [CI]: 0.65–0.82). The ICC for year V total MCQ and SAQ scores was 0.68 (95% CI: 0.55–0.77).
Correlation of multiple-choice question and short-answer question performance by cognitive skill level tested
Each individual question was categorized using modified Bloom's taxonomy, summarized in [Table 2].
|Table 2: Number of questions per cognitive skill level (defined by modified Bloom's taxonomy) across all examinations|
Click here to view
For the year IV and V SAQ examinations, there was no difference between student performances on different cognitive skill level questions using ordinal generalized estimating equation models (global P = 0.3015 and P = 0.7096, respectively). However, students performed better on Level III compared to Level I and II questions in the year V MCQ examination (global P < 0.0001, odds ratio: 0.78, 95% CI: 0.71–0.87). A statistically significant relationship was seen for the year IV MCQ examination (global P = 0.1098), but the difference was not statistically significant (Chi-square P = 0.0925).
There was a statistically significant association between all higher order questions in the year IV MCQ compared to higher order questions in the year IV SAQ with an ICC of 0.77 (95% CI: 0.68–0.84). There was no association between lower order questions with an ICC of 0.13 (95% CI: - 0.22–0.38).
There was a statistically significant association between both higher and lower order questions in the year V MCQ compared to the year V SAQ with ICC of 0.62 (95% CI: 0.47–0.73) and 0.32 (95% CI: 0.05–0.51), respectively.
Order of examination and impact on student performance
There was a statistically significant difference in total SAQ score between order groups (global P = 0.0335). Students who sat the year IV MCQ first had a mean total SAQ score of 1.59 marks higher than students who sat the year IV SAQ first (estimate = 1.59, 95% CI: 0.12–3.07). There was no statistically significant association between orders of tests in the year V examinations.
Inter-rater reliability of the short-answer question
Markers 1–3 (year IV) demonstrated a very strong agreement as measured by ICC. All but one question in the year IV SAQ had an ICC >0.87. Markers 4–6 (year V) also demonstrated a strong agreement. All but one question in the year V SAQ had an ICC >0.80.
Item-writing flaws and effect on student performance
There were 11 flawed MCQs and 7 flawed SAQs in the year IV examination, compared to 9 flawed MCQs and 6 flawed SAQs in the year V examinations.
Examinations which included questions containing IWFs were defined as “nonstandard.” Examinations which excluded questions containing IWFs were defined as “standard.” Student pass rates were compared across the nonstandard and standard examinations (pass defined as over the 50% cutoff mark).
There was no statistically significant difference in the mean scores on the year IV MCQ nonstandard and standard examinations (global P = 0.4062). However, there was a statistically significant difference on the year IV SAQ nonstandard and standard examinations (global P = 0.0293), although it did not benefit students who would otherwise have failed.
There was no statistically significant difference in the mean scores on the year V MCQ nonstandard and standard examinations (global P = 0.7237). However, there was a statistically significant difference on the year V SAQ nonstandard and standard examinations (global P < 0.0001), with a mean difference of 4.9% (estimate = −4.9, 95% CI: 6.3–3.5). McNemar's test was performed and showed that the null hypothesis was rejected (P = 0.0008). Thus, students who would otherwise have failed benefited.
[Table 3] and [Table 4] show the number of students who passed and failed when IWFs were removed from all the examinations.
|Table 4: Performance on the year V standard and nonstandard examinations|
Click here to view
For the year V SAQ examination, of 69 students who failed the nonstandard examination, 27 passed when IWFs were removed. This represents 39% of those who failed and who could have passed if IWFs had been removed from the year V SAQ. For the year V MCQ examination, 18 students failed the nonstandard examination compared to 25 students on the standard examination. This represents an additional 5% of all students who failed when IWFs were removed.
Location and effect on student performance
There was no statistically significant difference in performance across location of all examinations.
| Discussion|| |
This study has shown that a well-constructed MCQ can match the SAQ in its ability to test and measure students' use of their higher order cognitive skills. The association between total SAQ and MCQ percentage (suggested by a strong ICC of 0.75 for year IV and 0.68 for year V) indicates that the two examinations are measuring similar, but not identical constructs and there is no advantage of one over the other. The results suggest that the difference in response format was irrelevant to student performance.
There are obvious drawbacks of an open-ended examination in terms of less reliability (testing fewer items in the same time frame) and consumption of resources in construction, delivery, and marking. The findings of this study reiterate existing findings in the literature, in particular those of Schuwirth and van der Vleuten who demonstrated that how a question is asked is more important than how the question is answered. That is, that the stimulus format is far more important to eliciting cognitive processes in the candidate than the response format and so the MCQ may sufficiently measure what the SAQ can, provided that context-rich stems are used. The present study used a controlled prospective study design with questions carefully matched in content area and cognitive skill level. Careful MCQ construction with attention to cognitive skill level and content can result in a test that is equivalent to the SAQ in its ability to measure higher order cognitive skills, with the inherent benefits of reliability and ease of construction.
While the questions in this study were deliberately written to concentrate on the assessment of higher order skills, it should not be assumed that either MCQs or SAQs automatically do this. An analysis of final-year examination papers from an Australian university showed that 51%–54% of their open-ended questions did no more than test knowledge recall. The use of a test blueprint and peer-review process was likely assisted in achieving this. Students performed better in the MCQ paper, with higher pass rates compared to the SAQ. One explanation for this difference is the requirement of the SAQ to provide justification for each answer to obtain additional marks. Students who were unable to demonstrate the intermediate steps in thinking and justification for their answers missed out on additional marks. The implication of generally lower scores on the SAQ would need to be considered when both determining pass marks and utilizing results to guide remediation. If a single-format approach is used, the appropriateness and degree of “scaling” should be carefully considered.
Students performed better on Level III MCQ questions compared to Level I and II. One explanation is that students were able to refine their “guesses” on Level III MCQs because these questions typically included context-rich stems that were necessary to produce a clinical scenario that could test higher order cognitive skills. Bloom's taxonomy assumes a defined hierarchy of knowledge (remember, understand, and apply), but this may not always hold true when considering Level III questions (i.e., applying knowledge). A student more apt at demonstrating their reasoning may infer the correct response on Level III questions. The same student may have poor knowledge recall, thus earning zero marks on Level I question. Furthermore, guessing in (five-option) MCQs results in a 20% likelihood of a correct answer compared to an SAQ which may award marks for demonstrating reasoning even if the answer provided is incorrect.
The ICC for the year IV SAQ compared to MCQ total percentage in items testing higher order cognitive skills was 0.77 in contrast to 0.13 for items testing lower order cognitive skills and similarly for year V 0.62 compared to 0.32. This suggests that the MCQ measures the same constructs as the SAQ more effectively at higher order skill levels compared to lower order skill levels. Faculties considering a single-format (i.e., MCQ only) approach to medical examinations should bear this in mind when constructing items. It has been shown that the correction of IWFs improves the discriminatory ability of the question., In fact, questions written at lower cognitive skill levels are more likely to contain IWFs. One study showed that high-achieving students were more likely than borderline students to be negatively impacted by flawed items. The latter benefitted from the poorly written test items, possibly due to “test wiseness” which enables examinees to identify the correct answer without any content-specific knowledge.
This study has shown that at one extreme of performance (top 10%) the testing format did not influence overall performance. However, for the rest of the students, the format may impact on overall performance, but this depended on the scale used to determine the cutoff mark, which for this study was set at 50%.
There was a small difference in student performance (1.59 marks higher on the SAQ) when the MCQ was delivered first, which highlights the effect of examination order. However, this should not pose a problem in reality as examination content would not be matched as it was in this study. It is interesting to note the comparable performance of students in an on-site and off-site setting. Educators may therefore consider supervision not as critical in a formative test setting when the aim is to provide students with examination preparation.
The overall incidence of IWFs in the MCQ and SAQ used in the current study was much lower than reported previous studies. Our analysis has shown that 39% of students who failed the year V SAQ could have passed if IWFs were removed, in comparison to another study which reported a difference of 10%–15% who could have passed. This suggests that the presence of IWFs has made the SAQ examination more difficult, thus failing students. This is a critical finding when considering sources of error in examinations. This is compared to the 5% of all students who sat the year V MCQ and who would have failed if IWFs were removed, suggesting that IWFs have made the examination overall “easier” for these students and enabled them to pass.
The SAQ examination was less prone to error of IWFs compared to the MCQ, yet the effect was greatly magnified in comparison. One possible explanation for the lower incidence of IWFs in SAQs may be the role of student responses in bringing IWFs to the attention of item-writers. This problem may be circumvented by encouraging faculties to invest in training for item-writing and analysis of flaws in workshops, which have been shown to improve the quality of MCQs and frequency of IWFs.
Markers are an additional source of error as they introduce internal adjustments which create inter-rater variability. The high ICCs which suggest low inter-rater variability between markers nested within each year level for the SAQ is a reassuring finding. In one examination, significant differences in 50% of questions were reported due to inter-marker variation that influenced median scores of the examination as a whole. Markers who are experts in their field may introduce subjectivity by deviating from the proposed model/key answer. The high ICC in this study may be due to the use of model/key answers with clear marking guides detailing both appropriate and inappropriate student answers.
| Conclusion|| |
In summary, this study has shown that the MCQ can effectively measure the same constructs as the SAQ when testing higher order cognitive skills. The present study has done so using a prospective design. The MCQ may be used as a single format in written assessment provided high-quality questions are used, with careful consideration of generally higher MCQ scores and thus appropriateness of benchmarking. Once again, student performance has been shown to be independent of the testing format.
The purported advantages of the SAQ, including transparency in reasoning, do not form an essential component of a summative medical examination. This is provided items testing higher order cognitive skills are used. If the aim of the examination is to provide certification, the MCQ should be used as it confers inherent advantages including high reliability. If the aim is diagnostic or educational, then material should be provided for students to complete in their own time.
Those constructing assessments should be made aware of the potentially significant effect of IWFs. Although SAQs contained fewer IWFs, their effect on student performance was disproportionate when compared to the MCQ. Faculties should be encouraged to invest in training in constructing good MCQs to offset or even substitute the resources consumed by persisting with the SAQ.
We wish to acknowledge Ms. Suzanne Edwards BN, GDipMa, GDipMStat Statistician, Data, Design and Statistics Service, Adelaide Health Technology Assessment, School of Public Health, University of Adelaide, for performing the statistical analysis.
We also wish to acknowledge Dr. David Darwent MD PhD, Royal Adelaide Hospital, for critically reviewing this manuscript.
Financial support and sponsorship
Conflicts of interest
There are no conflicts of interest.
| References|| |
Schuwirth LW, van der Vleuten CP, Donkers HH. A closer look at cueing effects in multiple-choice questions. Med Educ 1996;30:44-9.
Tarrant M, Ware J. Impact of item-writing flaws in multiple-choice questions on student achievement in high-stakes nursing assessments. Med Educ 2008;42:198-206.
Palmer EJ, Duggan P, Devitt PG, Russell R. The modified essay question: Its exit from the exit examination? Med Teach 2010;32:e300-7.
Falk B, Ancess J, Darling-Hammond L. Authentic Assessment in Action: Studies of Schools and Students at Work. United States of America: Teachers College Press; 1995.
Hift RJ. Should essays and other “open-ended”-type questions retain a place in written summative assessment in clinical medicine? BMC Med Educ 2014;14:249.
Walke YS, Kamat AS, Bounsule SA. A retrospective comparative study of multiple choice questions versus short answer questions as assessment tool in evaluating the performance of students in medical pharmacology. Int J Basic Clin Pharmacol 2014;3:1020-3.
Mujeeb AM, Pardeshi ML, Ghongane BB. Comparative assessment of multiple choice questions versus short essay questions in pharmacology examinations. Indian J Med Sci 2010;64:118-24. [Full text]
Pepple DJ, Young LE, Carroll RG. A comparison of student performance in multiple-choice and long essay questions in the MBBS stage I physiology examination at the university of the West Indies (Mona campus). Adv Physiol Educ 2010;34:86-9.
Mahmood H. Correlation of MCQ and SEQ scores in written undergraduate ophthalmology assessment. J Coll Physicians Surg Pak 2015;25:185-8.
Baig M, Ali SK, Ali S, Huda N. Evaluation of multiple choice and short essay question items in basic medical sciences. Pak J Med Sci 2014;30:3-6.
Bloom BS. Taxonomy of Educational Objectives. Handbook I: Cognitive Domain. New York: David McKay; 1956.
Haladyna TM, Downing SM. A taxonomy of multiple-choice item-writing rules. Appl Meas Educ 1989;2:37-50.
Schuwirth LW, van der Vleuten CP. Different written assessment methods: What can be said about their strengths and weaknesses? Med Educ 2004;38:974-9.
Palmer EJ, Devitt PG. Assessment of higher order cognitive skills in undergraduate education: Modified essay or multiple choice questions? Research paper. BMC Med Educ 2007;7:49.
Ali SH, Ruit KG. The impact of item flaws, testing at low cognitive level, and low distractor functioning on multiple-choice question quality. Perspect Med Educ 2015;4:244-51.
Rush BR, Rankin DC, White BJ. The impact of item-writing flaws and item complexity on examination item difficulty and discrimination value. BMC Med Educ 2016;16:250.
Tarrant M, Knierim A, Hayes SK, Ware J. The frequency of item writing flaws in multiple-choice questions used in high stakes nursing assessments. Nurse Educ Today 2006;26:662-71.
Downing SM. The effects of violating standard item writing principles on tests and students: The consequences of using flawed test items on achievement examinations in medical education. Adv Health Sci Educ Theory Pract 2005;10:133-43.
Abdulghani HM, Ahmad F, Irshad M, Khalil MS, Al-Shaikh GK, Syed S, et al.
Faculty development programs improve the quality of multiple choice questions items' writing. Sci Rep 2015;5:9556.
[Figure 1], [Figure 2]
[Table 1], [Table 2], [Table 3], [Table 4]
|This article has been cited by|
||Student-generated formative assessment and its impact on final assessment in a problem-based learning curriculum
| ||Mazhar Mushtaq,MuhammadAbdul Mateen,KhawajaHusnain Haider |
| ||Saudi Journal for Health Sciences. 2020; 9(2): 77 |
|[Pubmed] | [DOI]|
||The European Hematology Exam
| ||José-Tomás Navarro,Gunnar Birgegård,Janet Strivens,Wietske W.G. Hollegien,Naomi van Hattem,Manon T. Saris,Marielle J. Wondergem,Cheng-Hock Toh,Antonio M. Almeida |
| ||HemaSphere. 2019; 3(5): e291 |
|[Pubmed] | [DOI]|
||Can e-learning improve the performance of undergraduate medical students in Clinical Microbiology examinations?
| ||Niall T. Stevens,Killian Holmes,Rachel J. Grainger,Roisín Connolly,Anna-Rose Prior,Fidelma Fitzpatrick,Eoghan O’Neill,Fiona Boland,Teresa Pawlikowska,Hilary Humphreys |
| ||BMC Medical Education. 2019; 19(1) |
|[Pubmed] | [DOI]|