← Back to archives #65802

Evidence- and Experience-Based Best Practices: Situational Judgment Tests | HumRRO

Source: https://www.humrro.org/corpsite/blog/evidence-and-experience-based-best-practices-situational-judgment-tests

Archived: 2026-04-23 17:09

Evidence- and Experience-Based Best Practices: Situational Judgment Tests | HumRRO
Skip to content
Evidence- and Experience-Based Best Practices: Situational Judgment Tests
Evidence- and Experience-Based Best Practices: Situational Judgment Tests
vc
2025-11-05T15:42:29-05:00
June 8th, 2020
|
HumRRO Blogs
|
Researchers at HumRRO have produced one of the first-ever, practitioner-oriented guides on developing situational judgment tests (SJTs). Drawing on scientific literature and their own extensive research and real-world experience developing and implementing SJTs in high-stakes assessment contexts for public and private sector clients,
Deborah L. Whetzel
, Ph.D.,
Taylor S. Sullivan
, Ph.D., and
Rodney A McCloy
, Ph.D., wrote “
Situational Judgment Tests: An Overview of Development Practices and Psychometric Characteristics
,” published in the journal,
Personnel Assessment and Decisions
. According to ScholarWorks@BGSU, the article has already been downloaded in the United States and internationally nearly 400 times since its publication in March.
SJTs assess individual judgment by presenting examinees with problems to solve via scenarios and a list of plausible response options. Examinees then evaluate each response option for addressing the problem described in the scenario.
The paper discusses a variety of issues that affect SJTs, including reliability, validity, group differences, presentation modes, faking, and coaching, and provides best-practice guidance for practitioners.
“Consistent with HumRRO’s mission to give back to the profession, we are sharing experience- and evidence-based conclusions and suggestions for improving the development of SJTs,” said Sullivan.
It is clear from both psychometric properties and examinee response behavior that not all SJT designs are equally effective, and not all designs may be appropriate for all intended uses and assessment goals. To help practitioners and researchers alike, the authors provide best practices for developing SJTs:
SJT Best-Practice Guidelines
Scenarios
The use of critical incidents to develop SJT scenarios enhances their realism.
Specific scenarios rely on fewer assumptions, yielding higher levels of validity.
Brief scenarios reduce candidate reading load, which may reduce group differences.
Avoid sensitive topics and balance variety of characters.
Avoid overly simplistic scenarios that yield only one plausible response.
Avoid overly complex scenarios that provide more information than needed.
Response options
Generate response options that have a range of effectiveness levels.
If developing a construct-based SJT, be careful about option transparency.
List only one action in each response option (avoid double-barreled responses).
Distinguish between active bad (do something wrong) and passive bad (do nothing).
Check for tone (use of loaded words can give clues as to effectiveness).
Response instructions
Use knowledge-based (“should do”) instructions for high-stakes settings. (Candidates will engage in impression management and will respond based on what they think
should
be done even if they
would
personally respond differently).
Use behavioral tendency (“would do”) instructions if assessing non-cognitive constructs, such as personality.
Response format
Use a format where examinees rate each option, as this method provides the most information for a given scenario, yields higher reliability, and elicits the most favorable candidate reactions.
Single-response SJTs are easily classified into dimensions and have reliability and validity comparable to other SJTs, but they can have higher reading load given each scenario is associated with a single response.
Scoring
Empirical and rational keys have similar levels of reliability and validity.
Rational keys based on SME input are used most often.
Develop “overlength” forms (more scenarios and options per scenario than you will need) and score only those items that function properly)
Use 10–12 raters with a variety of perspectives to establish the scoring key. Outliers may skew results if fewer raters are used.
Use means and standard deviations to select options (means will provide effectiveness levels; standard deviation will provide level of SME agreement).
Reliability
Coefficient alpha (internal consistency) is not appropriate for multidimensional SJTs.
Use a split-half approach, with a Spearman-Brown correction, assuming the SJT content is balanced.
Validity
Because SJTs have small incremental validity over cognitive ability and personality, consider using them in tandem with other assessments to boost validity.
SJTs have been used effectively in military settings for selection and promotion.
SJTs likely measure a general personality factor.
SJTs correlate with other constructs, such as cognitive ability and personality.
Group differences
SJTs have smaller racial group differences than cognitive ability tests.
Women perform slightly better than men on SJTs on average.
Behavioral tendency instructions have smaller group differences than knowledge instructions.
Rating all options has lower group differences than ranking or selecting best/worst.
Presentation methods
Avatar- and video-based SJTs have several advantages in terms of higher face validity and lower group differences, but they may have lower reliability by inserting irrelevant contextual information.
Using avatars may be less costly, but developers should consider the uncanny valley effect when using three-dimensional human images.
Faking
Faking does affect rank ordering of candidates and who is hired.
Particularly in high-stakes settings, knowledge-based instructions (should do) appear to do a better job mitigating faking than behavioral tendency (would do) instructions
SJTs generally appear less vulnerable to faking than traditional personality measures.
Coaching
Use scoring adjustments, such as key stretching and within-person standardization, to reduce the effect of coaching examinees on how to maximize SJT responses.
Taylor Sullivan, Ph.D., contributed to this blog.
About the Authors:
Deborah Whetzel, Ph.D.
Manager
Contact Deborah
Rod McCloy, Ph.D.
Chief Scientist
Contact Rod
Share this article:
Facebook
Twitter
Reddit
LinkedIn
WhatsApp
Tumblr
Pinterest
Vk
Email
Related Posts
Evaluating K-12 Assessment Systems with Reduced Emphasis on Summative End-of-Year Assessments: 4 Lessons Learned About Local Competency-Based Assessments
Evaluating K-12 Assessment Systems with Reduced Emphasis on Summative End-of-Year Assessments: 4 Lessons Learned About Growth Assessments
Evaluating K-12 Assessment Systems with Reduced Emphasis on Summative Year-End Assessments: 5 Lessons Learned About Interim Tests
Critical Role of Interviewer Training in Ensuring Effective Structured Interviews and Benefits of Video-Based Learning
Measure Twice, Cut Once: For High-Stakes Education Assessments, Rigorous Quality Assurance Is a Must
Contributions to Society, Science, and the Profession Highlighted in HumRRO’s 2021-2022 Biennial Report
Latest News & Updates:
Dr. Tracey Hembry Brings Deep Psychometric and Education BD Expertise to HumRRO
April 20, 2026
Protected: HumRRO’s Dr. Kimberly Adams Elected 2026 SIOP Fellow
April 8, 2026
HumRRO at SIOP 2026: Big Ideas in the Big Easy
April 8, 2026
Celebrating Excellence: HumRRO Honored with Three Top SIOP Awards
January 22, 2026
Congratulations to Emily McKissick, Ph.D., the 2025 PTCMW Service Award Winner!
January 5, 2026
View More
Partner with HumRRO
Interested in partnering with HumRRO for your next project?
Contact us today to get started!
Contact HumRRO