Abstract
Since its description in 1974, the Objective Structured Clinical Examination (OSCE) has gained popularity as an objective assessment tool of medical students, residents, and trainees. With the development of the anesthesiology residents’ milestones and the preparation for the Next Accreditation System, there is an increased interest in OSCE as an evaluation tool of the six core competencies and the corresponding milestones proposed by the Accreditation Council for Graduate Medical Education.
In this article the authors review the history of OSCE and its current application in medical education and in different medical and surgical specialties. They also review the use of OSCE by anesthesiology programs and certification boards in the United States and internationally. In addition, they discuss the psychometrics of test design and implementation with emphasis on reliability and validity measures as they relate to OSCE.
Since its introduction in the late 1970s, the Objective Structured Clinical Examination (OSCE) has been used as an assessment tool in medical education, physician training, and certification exams. There has been an increased interest in OSCE in recent years, accompanying planned changes in residents’ education, evaluation, and certification.
The Accreditation Council for Graduate Medical Education (ACGME) defined in 1999 six core competencies that outlined the scope of medical residents’ development and described the required domains within which to assess all residents. These have become well established and incorporated into graduate medical education. More recently, the ACGME elaborated incremental concrete steps, called milestones, by which to measure progress within these domains.1 The milestones are a set of specialty-specific educational outcomes to be achieved at defined intervals. In the Next Accreditation System, annual residency program review will include an assessment of residents’ progress along the milestones.1 The ACGME will implement the Next Accreditation System in anesthesiology training by June 2014.
The anesthesiology milestones will delineate five developmental levels, each level characterized by the additional knowledge, skill, or behavior required for a physician at that stage. The assessment of residents’ progress along the milestones may incorporate written and oral examination, observation in the clinical setting, OSCE, or any combination thereof. In addition, the American Board of Anesthesiology plans to introduce OSCE into the final part of the applied examination, including use of standardized patients, mannequins, or computer-based assessment. Accordingly, anesthesiology training programs and residents must understand the design, applications, and limitations of OSCE to prepare for the upcoming changes in program accreditation and practitioner certification.
What Is an OSCE?
Miller2 proposed a hierarchical framework for assessing physicians long before the milestones were introduced. “Knowing” is the most fundamental form of understanding, and factual knowledge is amenable to assessment by written tests. When a physician combines knowledge with clinical judgment, this “knowing how” incorporates data to make an informed decision about patient management. The anesthesiology oral exam format aims to assess this competence. “Showing how” is what residents will do in practical situations, and this lends itself to assessment by direct clinical observation. Clinical observation, however, may be lacking in scope and in objectivity. OSCE and simulation scenarios can provide valuable additional assessments of this level of knowing, which includes clinical judgment and “practical skills.”3 Ultimately, residents should be able to translate this knowledge into the clinical setting and demonstrate their skills and knowledge with actual patients (“doing”). Assessment of this level of behavior remains the most challenging to accomplish reliably and accurately.2
Formalized by Harden et al.4 in 1975, the OSCE consists of a series of stations, each 5 to 10 min in duration. These stations provide an array of clinical scenarios and tasks that require procedural skills or data interpretation. The presence of two or more examiners using a standardized checklist promotes objectivity and consistency in the assessment of the trainee in clinical problem solving and patient-management skills.2,5–7 OSCE incorporates several methods of assessment such as multiple-choice questions, open-ended questions, simulation as well as the classical description of standardized patients.2,5,6 The hallmark of OSCE is the focus on assessment of clinical competence,5 or in Miller’s classification, the ability of trainees to demonstrate their knowledge in practice. By ensuring both the uniformity of the administered exam and the exposure to different examiners, the OSCE provides a more comprehensive evaluation of the trainee.5 Further, the OSCE design promotes objectivity by using multiple examiners, establishing evaluation criteria, and incorporating reproducible scoring sheets.8 The use of standardized patients further promotes uniformity in the delivery of the exam as well as testing of rare clinical scenarios.4 Finally, the OSCE design provides for immediate feedback to both the resident and the educator after completion of the stations,8,9 thus reinforcing the learning9 and identifying areas of weaknesses in the training or in the exam itself, thereby making it a valuable formative evaluation tool.8
OSCE in Medical Education
Because of these useful characteristics, the OSCE has been incorporated into several certifying international agencies’ certification processes since 1994. The Clinical Skills Examination portion of the United States Medical Licensing Examination is a model of OSCE that has been used for the assessment of foreign medical graduates since the late 1990s,10 and subsequently for the assessment of American medical students since 2004.11 In addition, most U.S. medical schools have included OSCE in their curricula.12 Although most reports of OSCE have described its use in medical student education, OSCE has been applied in graduate medical education either to complement information obtained from existing evaluation methods or to circumvent the subjective nature of other forms of assessment.13 Several reports have evaluated the feasibility, reliability, and validity of using OSCE in the medical and surgical fields.14–20 In reported experiences with OSCE, various formats have included a combination of methods including written problem-solving stations,15 standardized patients,14–16,20 hands-on stations requiring demonstration of technical skills, or interpretation of laboratory results.20 These differences in format and content target various objectives expected of diverse groups of trainees.
OSCE was found to be a reliable tool to assess physical examination skills,18,19 clinical judgment and diagnosis,15,20 interpretation of radiological and laboratory findings,20 technical skills,17–20 and even billing and documentation exercises.16 Communication skills, such as end-of-life decisions, disclosure of medical errors,14,21 and patient feedback,16 have also been assessed.
OSCE in Anesthesiology
Relatively few reports exist on the use of OSCE in anesthesiology training or certification. Anesthesiology OSCE has been used for assessing physical exam skills, clinical and history-taking skills, airway management,3,22 resuscitation,3,23 blood product transfusion,24 anatomy, as well as statistics.3
OSCE was included in the final examination of the Royal College of Anaesthetists in the United Kingdom in the mid-1990s.3 The exam consists of 17 stations used to assess a range of skills such as resuscitation, handling and troubleshooting anesthesia equipment, data interpretation, history taking and communication, physical examination, identification of anatomy, and understanding and using statistical tools.3 It was subsequently incorporated into the primary part of this two-part certification exam.25
The Israeli National Board Examination in Anesthesiology first included OSCE in 2003. The exam format and process were created in a joint effort by the Israeli Board of Anesthesiology Examination Committee, the Israel Center for Medical Simulation, and the Israeli National Institute for Testing and Evaluation.22 The examination content was designed using the approach described by Newble6 in which OSCE developers (1) identify the clinical competencies to be assessed, (2) design the tasks to evaluate the competency, and (3) create a blueprint for the test to be administered.6 In the Israeli experience, expert opinion determined the clinical skills to be assessed. The examination task force selected representative tasks, which were subsequently incorporated into five simulation-based OSCE stations: regional anesthesia, trauma management, resuscitation, intensive care medicine/ventilation, and critical events in the operating room.26 The passing threshold required completion of previously determined critical actions in addition to successful demonstration of 70% of the checklist items.22 This high pass rate was observed despite the use of a predetermined passing threshold, which was reported to result in lower pass rates when compared with expert consensus methods, such as Angoff method or borderline method.27
The Israeli Board of Anesthesiology later included a regional anesthesia station. The development of this station was identified as a highly iterative process, requiring revisiting for optimal design and delivery.26 The authors also recognize the need for consultation with experts for psychometric rigor,26 as will be discussed later in this review.
In 2010 the Royal College of Physicians and Surgeons of Canada incorporated simulation assistance with video presentations to its oral examination in anesthesiology.28 The test format and content continue to evolve, underscoring the importance of revisiting the original OSCE design throughout the process.
Psychometrics of OSCE
Psychometrics of any assessment tool can be evaluated along four measures: feasibility, objectivity, reliability, and validity.12,29 The majority of the reported experience with OSCE in medical education has focused on only one or a few aspects of psychometrics, without a systematic approach to reliability and validity. Many reports have highlighted the development and application of the examination form itself and underemphasized the importance of evaluation of the examination. Achieving reliability and content validity of OSCE is of particular importance when the OSCE is used as a measure of trainees’ progress in residency or as part of the board-certification process.2,6 However, OSCE design and feasibility need to be taken into consideration as well.
Feasibility
Instructional systems designs traditionally follow a multistage, iterative model frequently referred to by the acronym ADDIE: Assess, Develop, Design, Implement, and Evaluate.30,31 The process is illustrated in figure 1.
The process of Objective Structured Clinical Examination design begins with needs assessment of the program, its faculty, the involved trainees, and the requirements from regulating agencies. During program development phase, key concepts to be evaluated are identified and specific goals and objectives are formulated. In the design and implementation phase, specific tasks and their corresponding scoring sheets are designed, and location, personnel, equipment, finances, and duration of the Objective Structured Clinical Examination stations are decided. Throughout the process, continuous evaluation of the program is performed and the program altered as needed.
The process of Objective Structured Clinical Examination design begins with needs assessment of the program, its faculty, the involved trainees, and the requirements from regulating agencies. During program development phase, key concepts to be evaluated are identified and specific goals and objectives are formulated. In the design and implementation phase, specific tasks and their corresponding scoring sheets are designed, and location, personnel, equipment, finances, and duration of the Objective Structured Clinical Examination stations are decided. Throughout the process, continuous evaluation of the program is performed and the program altered as needed.
Assessing Needs
Planning for any new educational program should start with needs assessment of the learner, the organization, and other regulating agencies. Hence, when planning an OSCE, residency training programs should take into consideration their residents’ needs, their departmental and organizational goals, the ACGME design for curriculum, and the American Board of Anesthesiology design for board certification. It is tempting to assume that the needs of all stakeholders converge on preparing residents for the planned final step of their board certification. However, program-specific needs should be assessed before designing OSCE, such as whether the OSCE will be part of the formative and summative evaluations of the trainees and whether the results will affect a resident’s progression through residency. Needs and interests can be identified both by forming advisory groups31 of program directors, key faculty involved in education, and resident representatives, as well as by surveying the experience of other programs and other specialties. In addition, expected changes in credentialing, such as the inclusion of OSCE in the American Board of Anesthesiology certifying exam, may prompt the development of additional means of evaluation.
Developing a Program
Goals and objectives of the OSCE are developed to address the competencies and milestones identified during the needs-assessment phase. On the basis of the identified needs, the instructional program’s goals and objectives are formulated explicitly and clearly, and are shared with the program, the learners, and the faculty. Objectives are specific and detailed explanations of the stated goals, and are usually described using Bloom’s taxonomy.32 Bloom’s taxonomy, originally published in 1956,32 and later revised and refined,33,34 allowed for a common language to be used in education and for assessment of educational endeavors.32,33 In the taxonomy of education, cognitive processes are viewed as a linear progression from least to most complex and are defined by the use of “verbs” to illustrate the category. Learners progress from knowing to “understanding” the concepts and their relationships, “applying” the learning, “analyzing” the principles and their organization, “synthesizing” the information and producing a plan of action, and finally “evaluating” the learning and the situation.32,33 Objectives therefore describe the skills, knowledge, and attitudes that will be assessed by the OSCE as well as their level of complexity, depending on the trainee level. This is particularly relevant in adequately defining tasks to match the anticipated milestones. Setting clear goals and objectives is important for the design of the learning activity, and for the evaluation of its progress. These should be revisited frequently to avoid inflexibility in the design, to incorporate other previously omitted goals and objectives, and to redefine those that are not relevant.31
Design and Implementation
In the design phase, specific tasks are elaborated to accurately assess the stated goals and objectives, and a plan for implementation is put in place. The process is refined in this phase by defining the tasks to be included, the faculty involved in testing, and the logistics of the OSCE implementation. Identifying the skills, knowledge, or attitudes to be evaluated by the OSCE is key because the format of OSCE stations needs to be tailored to the task being assessed.2 It has been suggested that OSCE is best suited to assess clinical and practical skills, rather than attitudes or factual knowledge,6 but several reports describe its use in assessment of communication skills as well.13,14,16 Tasks are then constructed to assess the given competencies, keeping in mind that performance on one task may not reflect performance on other tasks, even those that are closely related.6 A number of tasks should therefore be planned for each broad competency in order to ensure validity of the test.6 In designing the activity, methods for standardization of OSCE should be sought such as the use of an objective checklist for evaluation of participants, training of the examiners to ensure interrater reliability, as well as establishing the policies and procedures for administration of the exam. The logistics of OSCE include deciding on duration and format of each station, as well as the duration of the entire examination. Details such as location of testing and training of faculty involved in the OSCE should be addressed. Last, space, time, and cost are important considerations in OSCE design. Design and implementation are financially costly, especially when standardized patients are employed.12,29 Adhering to a timeline for design and implementation will avoid delays and frustrations and also allows for timely evaluation of the activity.31
Evaluation
Evaluation of the instructional design is an ongoing process throughout implementation and design, as well as after completion of the activity. Kirkpatrick proposed a four-step evaluation model, with each step providing increased complexity: reactions, learning, behavior, and results.35 The first level describes the participants’ attitudes, satisfaction, and emotional response to the learning activity. This can be assessed by surveying participating residents to evaluate their subjective response to the exam. Studies have shown mixed responses to OSCE regarding the level of stress experienced by trainees, their overall enjoyment of the activity, and their perception of the content validity of the test.7,8,12 The second level of the evaluation model is a measure of change in the learning of residents, demonstrated either by better performance on retesting or by improved performance on other measures of assessment such as in-training exams, oral exams, and others. Correlation between use of OSCE and improved performance would also serve as a validity measure of the designed test. However, establishing such correlations has been historically difficult.3 At the third level of evaluation, a change in behavior is sought in clinical practice, as evaluated by faculty. Finally, the ultimate outcome one hopes to establish is that the instructional design will lead to improved patient care; however, this is both difficult to define and to measure.
Constant reflection and evaluation needs to accompany the process all throughout, leading to repeated reappraisal of the program, its purpose, and its design. Input from learners, the evaluators, and the program should be incorporated.31 In addition, although OSCE is used primarily as an assessment tool, the exam can uncover areas of weaknesses in the curriculum, which could subsequently be addressed by the program and the trainee.13
Reliability and Objectivity
Evaluation of OSCE reliability and validity has been investigated in several reports since the 1990s. Reliability is often referred to as the consistency of a test, that is, the reproducibility of the exam score.29 Validity, however, is conceptualized as the accuracy of the exam score, or the extent to which the test measures what it purports to measure.29 These simplistic definitions, however, hide the greater complexity inherent in test assessment.
Test reliability comprises several components: interrater reliability, internal reliability, test–retest reliability, and intermethod reliability. Interrater reliability is a measure of the degree of agreement between different raters when grading the same examinee at a specific station. It has been suggested that interrater reliability can be improved by using a standardized scoring checklist with the objectives clearly stated.2,4,36 Indeed, this is one of the reported strengths of the OSCE methodology. However, controversy persists over whether the checklist methodology of Harden performs as well as global ratings on measures of reliability.29 Checklist scoring systems fail to acknowledge the ability of a trainee to solve a given clinical problem by implicit pattern recognition rather than by a step-wise approach.12 In addition, standardization of the scoring is challenging and can lead to either a high pass rate or a high failure rate, depending on the method used for deciding on acceptable performance.27 Reaching expert consensus is considered a better standardized approach than an arbitrarily chosen cutoff, because performance evaluation assesses the learner’s engagement with the task rather than simply comparing them with peers.8
There are two main approaches to standardization: item-based (criterion-referenced) and trainee-based (norm-referenced). In item-based standardization, such as the Angoff method, a panel of experts decides how a borderline candidate is likely to perform on any given task,27 and this is used as a guide to evaluate the actual performance of trainees. Trainee-based methods evaluate the overall performance of the trainees rather than focusing on specific tasks; the mean of all borderline scores achieved by trainees on a task is considered the passing score for the given station.6 Although Wass et al.37 found norm-referenced scoring to be superior to criterion-referenced scoring, this finding has not been consistent. Finally, although interrater reliability is an important goal, care must be taken when considering the test “reliable” based solely on a high degree of correlation in the scores between different evaluators.
The internal reliability of the exam is at least as important yet more difficult to achieve.36 Internal reliability is characterized by the extent to which performance across different test stations remains consistent. In the context of a typical OSCE, high internal reliability implies that the scores obtained on various items are capturing the knowledge, skills, and attitudes from the same conceptual domain, or closely interrelated domains.38 Although the reliability can be influenced by many factors including motivation, stress, attention, and distraction,12,29 the overarching concern is that the exam score conveys the degree of mastery of the intended competency. Internal reliability can be improved by increasing the number of stations12,29,36 or by focusing the content of the different stations on assessing the same conceptual domain or competence.8,39 Some evidence suggests that a series of OSCEs administered over time, when evaluated collectively, demonstrate improved reliability.12
Quantitative assessment of internal reliability is often reported as Cronbach alpha. This statistic can be conceptually understood as the average correlation score obtained after examining all possible divisions of the test items into two groups.38,40 The magnitude of this score from 0 to 1 is a reflection of the internal consistency of the exam scores. Values greater than 0.8 are desirable, especially in high-stakes situation such as in pass–fail decisions.15 However, one must also be aware that the resulting number does not reflect an inherent property of the exam itself,36 but rather of the scores obtained on that exam. Accordingly, it is a reflection of the specific population of examinees who took the exam. If a test is piloted in a group of postgraduate year 1 trainees (or junior attendings) and then subsequently used for a group of postgraduate year 4 trainees, reliability may change. In addition, a trainee’s performance on one task is not a good predictor of his or her subsequent performance on other tasks.36
Finally, test–retest reliability is an estimate of the trainee’s likelihood of achieving a similar score on a given station on repeat performance.39 Performance on a given station is marginally improved by increasing time allowed for a task, but is significantly improved when feedback is provided.7 Achieving reliability is the necessary first step in establishing the validity of an examination.40
Validity
One cannot assume that because a test meets acceptable criteria for reliability, it is valid.40 Validity, in its simplest form, asks whether the exam provides accurate feedback about an examinee’s skill level in the domain or “construct” of interest. This may be assessed by examining several subcategories of validity. Face validity is, simply stated, the extent to which the exam has the appearance of measuring what it is intended to measure.29 The assessment of face validity can be done without the in-depth examination of exam content or the input of content experts and is, therefore, the least rigorously evaluated form of validity.41 Content validity, however, requires that subject-matter experts review (or design) exam items to ensure they reflect not only the topic or domain of interest but that they also adequately capture the entirety of the subject matter in that domain.29 In order to draw conclusions about the level of expertise in the desired knowledge, skills, or attitudes, the exam items have to be adequately representative of the full spectrum of those elements within that domain. The goal of content validity in OSCE construction may be achieved with the use of a “blueprint,” which carefully deconstructs the competency to be tested in the design phase.6 As might be expected in light of this complexity, it has been suggested that an OSCE with fewer than 10 stations is less likely to achieve content validity.6,8 In addition, overall competence of the trainee, as related to level of training or to overall expertise, can influence performance on a specific station,42 and an OSCE with high construct validity may differentiate between the learners’ level of training.8,15,19
Validity can also be measured by the relationship between exam performance and measures external to the exam.41 Concurrent validity refers to the correlation between performance on OSCE for a given competence and contemporaneous performance on an alternative, well-established method of evaluation of the same competence.29 Predictive validity provides a similar measure of comparison, though it looks to correlate exam scores with an alternative assessment performed at a future time. Examining performance correlations on OSCE to other test modalities can help avoid the problem of case specificity, where the tasks tested are specific and fail to represent the general competency to be assessed.6
However, when measures of concurrent validity have been reported in the literature, results have varied between 0.1 and 1, with a majority of studies reporting a correlation factor of less than 0.7.12 This limited concurrent validity may be related to the type of competency being assessed.12,29 Correlation coefficients, and hence concurrent validity, can be improved when specific subsets of an OSCE exam reflecting specific conceptual subsets of a competency are better matched to the comparison exam of interest. For example, Matsell et al.7 showed that OSCE had a high concurrent validity in pediatric residents in areas of knowledge and patient management compared with standardized multiple-choice tests; but OSCE did not correlate with other evaluation methods, such as faculty evaluation, in tasks assessing clinical skills and problem-solving ability. For this reason, some researchers have recommended that OSCE be used as an additional evaluation tool rather than an alternative to current assessment methods of clinical competencies.6,29,36
When considering the use of OSCE in the assessment of residents’ progression along the ACGME milestones, as well as performance on the American Board of Anesthesiology certification exam, there is an underlying assumption that an OSCE provides superior predictive validity. Moreover, it is this superior prediction of clinical performance that appears to be subtly implied by the recent enthusiasm for OSCE. But as OSCE is introduced into multiple training programs, one must ensure adequate internal reliability. Strong content validity should be achieved through rigorous planning and blueprinting. The true magnitude of predictive validity of OSCE with regard to postgraduate clinical performance remains to be seen (and its future assessment should not be forgotten). Designing these exams by anesthesiology programs will be a challenging endeavor and, necessarily, a highly iterative process, requiring frequent reevaluation of exam psychometrics. For this reason, Bromley et al.3 as well as others have recommended that economies of scale in exam development should be employed, both to minimize costs of development as well as to maximize quality, sharing well-designed and tested stations between institutions.36
Applications
In practical terms, OSCE may comprise as few as five stations and as many as twenty stations, depending on the specialty, with each station requiring 5 to 10 min for completion. This allows simultaneous administration to a larger group of trainees in a limited time frame. In addition, different formats can be combined within the OSCE stations, such as the use of standardized patients, laboratory data, equipment or slides, as previously described by Newble.6 In the Department of Anesthesiology at Columbia University Medical Center, we have started a limited use of OSCE for assessment of clinical skills, technical skills, and patient-management skills. Clinical skills of interest to anesthesiology programs may include taking medical history, obtaining informed consent for invasive procedures, and interpreting electrocardiograms, radiographic studies, and hemodynamic data. Technical skills can be assessed using simulation for airway management, double-lumen endotracheal tube placement, and basic echocardiography image acquisition and interpretation. Accordingly, some stations such as interpretation of laboratory or electrocardiogram studies could be completed without the presence of an examiner, and the results can be collected in a written format.13 However, technical stations, such as demonstrating the placement of a double-lumen tube or management of a difficult airway, require the presence of examiners. An objective assessment of the trainee is made by completing a predetermined checklist to evaluate for completion of all required steps. Agarwal et al.13 suggested further division of the stations into basic and advanced skills. Such a refinement may facilitate the assessment of progress along the developmental milestones according to trainee level. A sample of specific examples of the potential use of OSCE is provided in table 1, detailing the suggested tasks for assessing the described concepts. Table 2 illustrates the use of a blueprint for designing an OSCE in anesthesiology to assess core competencies.
Conclusion
As George Miller notes “no single assessment method can provide all the data required for judgment of anything so complex as the delivery of professional services by a successful physician.”2 When well designed, OSCE is a reliable tool with only “modest” validity8 and should be accordingly viewed as a valuable, although insufficient addition to residency assessment. OSCE allows for a flexible, yet structured examination characterized by objective evaluation of trainees,5 preparing them for the board examination, as well as providing programs with means of regular trainee and program assessment. However, programs should not rely solely on OSCE to provide by itself a comprehensive assessment of a trainee’s competence,3,12 but rather view it as complementary to existing exam modalities. In addition, the cost and the logistics of designing and implementing the OSCE should be considered carefully. Future studies should look at the use of OSCE both as a formative and summative assessment tool to evaluate its effect on the learning and behavioral outcomes of trainees, and compare it with other established methods of assessment. Some of the recognized strengths and potential weaknesses of the OSCE are listed in table 3. As more programs begin designing and implementing OSCE to accompany the changing accreditation system, experience with OSCE in anesthesiology education may be shared between programs and in the literature to further enrich the education community, to fill the knowledge gap about the applications of OSCE in anesthesiology, and to better prepare the trainees and residency programs for the ACGME competencies and milestones.