What Makes an Assessment Test Valid and Reliable?
Not all assessments are equal. Learn the difference between valid and invalid tests, and why it's crucial for your hiring.
Door Ingmar van Maurik · Founder & CEO, Making Moves
Why it matters
An assessment is only valuable if it measures what it claims to measure and gives consistent results. Sounds logical, but the reality is that many companies deploy assessments without knowing whether they actually predict job performance.
The consequence: hiring decisions based on noise. You think you're hiring data-driven, but in reality, you're using an instrument that predicts no better than a coin flip — and sometimes even worse, because it creates a false sense of certainty.
In this article, we explain what validity and reliability actually mean, how to measure them, and why generic assessments often fall short. We also show how you can build assessments with your own system that actually predict who will succeed.
Validity: are you measuring what you want to measure?
Validity is the foundation of every assessment. It answers the question: does this test actually predict job performance? There are multiple forms of validity, each with a specific function.
Predictive validity
This is the gold standard in assessment psychometrics. You compare test scores with later real-world performance:
Predictive validity is expressed as a correlation coefficient (r). In psychometrics, these benchmarks apply:
|------------------------|---------------|
The best generic cognitive tests achieve an r of 0.30-0.50. But company-specific assessments can score significantly higher because they're calibrated to what success means in your specific context.
Construct validity
Does the test measure the right construct? This sounds simple but is complex in practice:
Construct validity is measured through:
Criterion validity
How well does the test predict a specific criterion? This can be:
It's important to recognize that different criteria require different predictors. A test that predicts productivity doesn't automatically predict retention.
Content validity
Does the test cover relevant content for the role? An assessment for a software developer should test for:
Not for: general verbal intelligence or abstract pattern recognition that has no relation to daily work activities.
Reliability: is it consistent?
Reliability asks the question: does the test produce comparable results on repeated administration? A test cannot be valid without being reliable — but a reliable test is not automatically valid.
Test-retest reliability
Does the same person score similarly when taking the test at two different times? This is measured with the test-retest correlation:
Important: some constructs are inherently less stable (e.g., mood vs. personality), which affects expected test-retest reliability.
Internal consistency
Do all questions within a section measure the same construct? This is measured with Cronbach's alpha:
Low internal consistency means some questions measure something different from the rest, making the total score unreliable.
Inter-rater reliability
For assessments requiring human judgment (e.g., simulations, presentations, interviews): do different evaluators reach the same conclusion? This is critical for:
The solution for low inter-rater reliability: structured scoring rubrics and evaluator training. Or better yet: deploy AI scoring where possible, which is inherently consistent.
Why generic tests often fall short
Most commercial assessments — from providers like SHL, Harver, TestGorilla, and Saville — are validated on generic populations. This means:
The norm group problem
Scores are compared with thousands of random people from diverse industries and roles. But:
The static model problem
Generic tests are updated every 5-10 years. Your company changes continuously:
A test validated in 2020 may no longer measure what's relevant in 2026.
The one-size-fits-all problem
The same personality test is used for developers, sales managers, finance analysts, and customer service representatives. But the competencies that predict success are fundamentally different per role.
Read more in our article on why generic assessments don't work.
The solution: company-specific validation
With your own assessment system, you can address the shortcomings of generic tests:
Building your own norm groups
Instead of comparing scores to a generic population, you build norm groups per role and department:
Calculating predictive validity with your own data
This is the ultimate test: do your assessments actually predict success? With your own data, you can:
Continuous calibration after every hire
After every hire, the model is validated:
1. Candidate scores on the assessment
2. Candidate is hired (or rejected)
3. After 6 months: performance review
4. Calculate correlation: was the prediction correct?
5. Adjust the model based on results
This means your assessment system gets smarter over time — an advantage generic tests cannot provide by definition.
Bias analyses on your own population
With your own data, you can actively monitor:
The difference in practice
|--------|-------------------|---------------------------|
Key takeaways
An assessment without validation is an expensive gamble. You give it the appearance of objectivity, but in reality, you base decisions on unproven assumptions. A validated custom assessment, on the other hand, is a strategic weapon in your hiring.
The core points:
Want to know how valid your current assessments are? Or want a system that continuously learns and improves? Get in touch or see how our AI hiring system builds assessment validation into the process.