ORIGINAL RESEARCH ARTICLE Year : 2015  Volume : 28  Issue : 1  Page : 2934 Comparison of resident performance in interpreting mammography results using a probabilistic or a natural frequency presentation: A multiinstitutional randomized experimental study Pavlos Msaouel^{1}, Theocharis Kappos^{2}, Athanasios Tasoulis^{3}, Alexandros P Apostolopoulos^{4}, Ioannis Lekkas^{5}, ElliSophia Tripodaki^{6}, Nikolaos C Keramaris^{1}, ^{1} Greek Junior Doctors and Health Scientists Society, Athens, Greece ^{2} 2^{nd} Department of Surgery, "Metaxa" Cancer Memorial Hospital, Pireaus, Greece ^{3} Department of Clinical Therapeutics, "Alexandra" Hospital, Voula, Greece ^{4} Department of Orthopaedics, "Asklipiio" Hospital, Voula, Greece ^{5} Otorhinolaryngology Clinic, "251 G.N.A." Hospital, Athens, Greece ^{6} 1^{st} Department of Internal Medicine, "Evangelismos" Hospital, Athens, Greece Correspondence Address: Background: Residents are being increasingly challenged on how best to integrate diagnostic information in making decisions about patient care. The aim of this study is to assess the ability of residents to accurately integrate statistical data from a screening mammography test in order to estimate breast cancer probability and to investigate whether a simple alteration of the representation mode of probabilities into natural frequencies facilitates these computations. Methods: A multiinstitutional randomized controlled study of residents was performed in eight major hospitals in the city of Athens. Residents were asked to estimate the positive predictive value of the screening mammography test given its sensitivity and 1specificity as well as the prevalence of breast cancer in the relevant population. One version of the scenario was presented in the singleevent probability format that is commonly used in the medical literature, while the other used the natural frequency representation. The two questionnaire versions were randomly assigned to the participants. Results: Out of 200 residents, 153 completed and returned the questionnaire (response rate 76.5%). Although more than onethird of the residents reported excellent or close to excellent familiarity with sensitivity and positive predictive value, the majority of responses (79.1%) were incorrect. However, a significantly higher proportion of residents in the natural frequency group (n = 88) selected the correct response compared with residents (n = 65) in the singleevent probability group (28.4% vs 10.8%; 95% confidence intervals of the difference between the two proportions = 5.629.7%; P < 0.01). Discussion: Residents more often correctly understand test performance accuracy when test characteristics are presented to them as natural frequency representations than the more common approach of presenting single event probabilities. Educators and journal editors should be aware of this facilitative effect.
Background Biostatistical information can be represented as either singleevent probabilities (e.g., if a woman has breast cancer then the probability that she will get a positive mammography is 80%) or as natural frequencies (e.g., 8 out of 10 women with breast cancer will get a positive mammography). The accuracy of diagnostic tests is commonly reported in terms of sensitivity, specificity and positive predictive value (PPV) using the conditional probability format which does not specify the reference class used for each calculation: The reference class used to determine a test's sensitivity is people with illness, whereas the reference class used for specificity is people without illness. This switch of reference class makes conditional and singleevent probability representations of data confusing as they require physicians to perform a larger number of demanding arithmetic operations (multiplication, addition or division) particularly in clinical scenarios requiring simple Bayesian calculations. [1],[2],[3],[4] Thus, such information may be systematically miscalculated by physicians and subsequently miscommunicated to patients resulting in erroneous choices of medical therapy, inappropriate utilization of medical resources and disproportionate health care expenditures. [5],[6] The two versions of the diagnostic scenario presented in [Table 1] provide data on the prevalence of a disease (breast cancer) in a population (women aged >40 years). This background information, called base rate, must be taken into account when determining the PPV of a diagnostic test. However, clinicians often neglect base rate, which makes it a common cognitive bias that leads to miscalculations when making inferences about probabilities in Bayesian reasoning problems such as the mammography scenario described in [Table 1]. Bayes' mathematical formula is essential for such computations when data are presented as probabilities. However, this formula can be difficult to learn and memorize. On the other hand, conversion of probabilities to the natural frequency format can make Bayesian reasoning easier. Although both formats are mathematically equivalent, the natural frequency presentation is computationally and conceptually simpler as it automatically incorporates base rate information and utilizes an information format that may be more aligned with what human minds were evolved to understand and process. [3],[4],[7],[8],[9],[10],[11],[12]{Table 1} A number of reports have indicated that medical trainees at different levels may have distinct learning needs and preferences. [13],[14],[15],[16],[17],[18] It is therefore important to determine the effectiveness of specific learning techniques during particular stages of medical training. The present study focused on residents and investigated potential differences in their statistical reasoning performance when data from a mammography screening problem are presented either as probabilities or as natural frequencies [Table 1]. The mammography screening problem tests an elementary but important Bayesian calculation that lies at the heart of many everyday clinical interactions. [3],[4],[19],[20] Methods Participants and data collection [Figure 1] shows a summary of the study profile. The study complied with the Standards of the Declaration of Helsinki as well as Greek requirements for survey studies. The study was conducted from January to March 2010 and the participants were chosen from eight major Greek hospitals in the city of Athens [Table 2] that train residents who had graduated from all seven Greek medical schools. The Greek system does not allow official residency training in private hospitals and therefore all institutions surveyed in this study were public. Every resident was given a number according to official enrolment lists that had been kindly provided by each hospital. Out of 1272 eligible residents in all hospitals, 200 (25 from each hospital) were randomly selected by a computerized method into the sample, a number that represents approximately 2% of the total number of residents training in Greece. [21] The probability version of the mammography problem was included in questionnaires seen by half of the sample, while the other half received the natural frequencies version. The two versions were randomly assigned to participants by blindly placing each questionnaire into A4 envelopes, which were then hand shuffled by the researcher. The residents were asked to return the completed questionnaires to a sealed box provided in each hospital.{Figure 1}{Table 2} Survey measures The first seven questions queried residents' sociodemographics, specialty choices, previous education outside Greece, current training level, residency year and past training in biostatistics of residents. We combined residencies according to their conceptual and occupational relations to form three different medical "fields": (a) Internal medicine (n = 71; dermatology, pediatrics, neurology and general practice were also included in this group) (b) surgical specialties (n = 67) (c) diagnostic and laboratory specialties (n = 15). The next set of questions assessed residents' familiarity with the statistical concepts of sensitivity and PPV by asking them to rate their familiarity with each concept on a 7point Likert scale ranging from 'none' (score of 1) to 'excellent' (score of 7). We next presented the mammography scenario, derived from the one used by Gigerenzer and Hoffrage, [3] either in the probability or the natural frequency form [Table 1] and asked respondents to calculate and select, in a multiple choice format, the probability (given the sensitivity and 1specificity of the test and the prevalence of the disease) that a positive mammogram result meant that the woman has breast cancer, which corresponds to the PPV of the screening test. The correct estimate is approximately 7.8%, thus the only valid response in the multiple choices provided in the questionnaire [Table 1] would be 'Approximately 8%'. We chose the multiplechoice format because it elicits higher response rates and is the preferred response format among residents as it requires less response time compared with freeresponse items. [22],[23] The distractor items were chosen based on the response patterns observed in previous studies that tested the mammography screening scenario on physicians. [19],[20] One easy distractor ('100%') and two difficult distractors ('approximately 70%' and 'higher than 80%') were used in addition to the correct response. Sample size Sample size was calculated according to Cohen's guidelines by calculating Cohen's effect size w measure for Chisquare tests. [24] Previous studies have reported relatively large effect sizes w ranging from 0.45 to 0.92. [3],[11],[12] In order to detect an even smaller effect size w = 0.3 (which is a medium effect size by Cohen's criteria [24] ), at α=0.05 with 95% statistical power we would require a minimum sample size of 145 respondents. Prior response rates in surveys conducted by our group ranged between 77.8% and 90.9%. [25],[26] We therefore administered the questionnaire to 200 residents expecting that even a lower than anticipated response rate of 75% would provide 150 completed responses. Statistical analysis A priori sample size calculations were performed using the G*Power 3.0.5 software. [27] Data analysis was performed using R (Foundation for Statistical Computing, Vienna, Austria). [28] Median differences between two groups were analyzed using the MannWhitney U test for two nonpaired data. Categorical variables were compared using Pearson's Chisquare or Fisher's exact tests where appropriate. Differences between two proportions and median differences between two groups are represented as 95% confidence intervals (CI). Bivariate analyses were performed to identify individual resident characteristics that may be associated with performance on the mammography problem. Variables with P < 0.05 on bivariate comparisons were further included in a multivariate analysis performed with a stepwise binary logistic regression test using the dichotomous coding of responses (correct/incorrect) as the dependent variable. A twosided P value of < 0.05 was considered statistically significant. Results A total of 153 completed questionnaires were returned (response rate 76.5%), with 112 males, 41 females, and a median age 32 years. Sixtyfive of the completed questionnaires included the probability version of the Bayesian inference mammography test, while 88 completed questionnaires contained the natural frequency version. A comparison of respondents' characteristics between the two randomized groups is provided in [Table 2]. No significant differences in resident sociodemographic, educational and residency profile characteristics were found between the two groups. There was a significant difference in the response rates between the two groups as more residents completed and returned the natural frequency questionnaires (95% CI = 11.734.3%, P < 0.01). Perceived familiarity with the statistical concepts of sensitivity and PPV was rated as "excellent" or "close to excellent" by 88/153 (57.5%) and 60/153 residents (39.2%), respectively. The vast majority of residents (119 out of 153) reported that they received biostatistics training in lecture courses during medical school. Only six respondents (3.9%) received formal biostatistics training during residency. The distribution of the responses in the probability and natural frequency groups is shown in [Figure 2]. The two representation groups differed significantly with regard to the proportions of the five answers (P < 0.01). The majority of residents (n = 121; 79.1%) answered the mammography problem incorrectly. However, a significantly higher proportion of residents in the natural frequency group selected the correct response compared with residents in the probability group (28.4% vs 10.8%; 95% CI = 5.629.7%, P < 0.01). The most popular answer was 'higher than 80%' while the least common incorrect response was '100%'. Furthermore, the probability group tended to select 'I don't know' more frequently [Figure 2].{Figure 2} Bivariate analyses indicated that only residents' gender was significantly associated with performance on the mammography problem. More specifically, men were more likely than women to correctly answer the mammography problem (25.9% vs 7.3%; 95% CI = 7.230%, P = 0.012). Multivariate analysis was further performed in order to estimate the independent main effects of data presentation format and residents' gender on Bayesian reasoning performance [Table 3]. As shown in [Table 3], both the data presentation format and gender were independently associated with residents' performance in the mammography scenario (OR = 0.20, 95% CI = 0.060.72, P = 0.013 for females compared with males and OR = 3.62, 95% CI = 1.439.16, P = 0.007 for the natural frequency format compared with the probabilistic format). Thus, women were less likely to correctly identify the correct answer than men regardless of the statistical presentation format used.{Table 3} Discussion The present study demonstrated that residents are significantly more accurate at integrating statistical data for solving a Bayesian diagnostic test scenario when the information is presented as frequencies rather than percentages. This effect was elicited despite the fact that the answer selections were given as percentages in both cases. Although more than onethird of the residents reported excellent or close to excellent knowledge of sensitivity and PPV, and despite having daily experience reporting diagnostic test characteristics to patients, only a minority of residents could correctly apply this information to calculate the PPV of a very common screening test. Previous reports have shown that experienced physicians, including faculty members of Harvard Medical School, had equally great difficulties with similar medical scenarios posed as singleevent probabilities. [19],[20],[29],[30] Notably, residents were more likely to complete and return the natural frequency version of the questionnaire, suggesting that this format may also reduce the cognitive effort and psychological intimidation of the Bayesian problem. This may be further supported by the higher tendency of residents in the probability group to declare that they did not know the answer to the mammography scenario. The natural frequency representation has previously been shown to improve the statistical reasoning of physicians and medical students. [11],[12],[31] However, a number of investigations have demonstrated significant differences in learning needs, styles and preferences between undergraduate medical students and residents as well as between residents and specialists. [13],[14],[15],[16],[17] The present study extended to residents previous observations of improved statistical inference using natural frequency scenarios. [19],[20],[30],[31] The majority of residents miscalculated the PPV of the mammography test. Thus, our results raise the question to what extent misinterpretations of the diagnostic value of screening tests contribute to erroneous diagnosis and misuse of laboratory and imaging tests as well as invasive procedures. Identifying educational tools to improve perception and understanding of statistical data is fundamental for both physicians and patients. It should be noted that although the natural frequency format significantly improved the Bayesian reasoning performance of residents, less than onethird of respondents in the natural frequency group chose the correct answer. This performance is lower compared to what was observed in the seminal study by Gigerenzer et al. [3] that used the natural frequency version of the mammography screening problem in university students or in a study that used natural frequency presentations of different Bayesian reasoning tasks on obstetricians. [11] Furthermore, midwives have shown no benefit from natural frequency representations of diagnostic test data [11] indicating that the effects of such representation modifications may not be consistently strong across different healthcare groups. Further analysis of the effects of individual resident variables in Bayesian reasoning indicated a main effect of residents' gender as women were less likely to select the correct answer compared with men. A previous survey of internal medicine residents reported that male residents were more likely to correctly answer biostatistics knowledge quizzes. [32] However, to our knowledge no previous study has specifically investigated the effect of gender on the percentage of correct inferences in the mammography or other similar Bayesian scenarios. [3],[8],[9],[10],[11] The results of the present study, although limited by the small sample of women residents surveyed, may indicate the effect of genderspecific social influences, crosscultural variations, participant recruitment methods, intrinsic cognitive differences or other unknown factors and biases [33],[34],[35],[36] on the relationship between residents' gender and Bayesian reasoning. In addition, the fact that the mammography screening problem tested a strongly genderspecific disease (breast cancer) may also have influenced our results. Due to the small numbers of female respondents and correct answers, there was insufficient power to evaluate the interaction between gender and representation format. Limitations While the response rate in the present study was considerably higher than what is typical for physician studies, [37] the possibility of response biases cannot be excluded. In order to fully protect residents' anonymity, we were unable to collect any further data on nonrespondents. Residents who felt less comfortable with statistical data may have been less willing to complete and return the questionnaire. It should also be noted that the two tested representations of the mammography scenario have some word ordering differences. Although it is unlikely, the possibility that such subtle variances may have also affected the results cannot be excluded. Furthermore, the fact that all three distractors markedly overestimate the PPV of the mammography test may provide clues to the correct answer. Consequently, test taking skills of residents may affect the results. However, such guessing biases due to individual differences in test taking skills are compensated by the randomization process used in the present study. Conclusions The present data demonstrate that most residents are susceptible to misinterpreting test results due to base rate neglect bias. However, the natural frequency format can make residents more accurate in solving a simple Bayesian problem, which is frequently encountered in daily clinical practice. Medical trainees can be taught how to effectively translate singleevent probabilities into natural frequencies using simple classroom tutoring. [20],[31] Nevertheless, a substantial number of residents in the natural frequency group still failed to reach the correct answer. Therefore, this simple technique should be supplemented with further educational interventions as well as mechanical aids aimed at minimizing the effect of base rate neglect and other statistical biases. References


