Model Evaluation Benchmarks
Systematic tests, benchmark suites, and evaluation harnesses for measuring model capabilities, robustness, fairness, and risks.
Core metadata
- ID: model_evaluation_benchmarks
- Era: Modern
- First known date: 2017 (decade)
- Region: Global / multiple regions
- Review status: source_checked
- Maturity: established
Prerequisites
- Large Language Models (large_language_models)
- ML Benchmark Datasets (ml_benchmark_datasets)
- Probability & Statistical Inference (probability_statistics_inference)
Dependents
- AI Diagnostic Decision Support (ai_diagnostic_decision_support)
- AI Safety & Alignment Methods (ai_safety_alignment_methods)
- Robust Explainable AI (XAI) (explainable_ai_xai)
- Verifiable AI Reasoning Systems (verifiable_ai_reasoning_systems)
Fields
Field lanes
- Artificial Intelligence & Machine Learning: Data & Evaluation
Node sources
- Improving Transparency in AI Language Models: A Holistic Evaluation (Stanford HAI, 2022, generic_overview) • Supports: node, maturity
- Artificial Intelligence Risk Management Framework (AI RMF 1.0) (NIST, 2023, official_agency) • Supports: node, maturity
Prerequisite edge evidence
Edge/source evidence summary:
- Prerequisite edges: 3
- Average edge confidence: 68%
- Prerequisite sources: 3
- expert_inference: 3
| Prerequisite | Type | Confidence | Evidence level | Note | Sources |
|---|---|---|---|---|---|
| ML Benchmark Datasets (ml_benchmark_datasets) | enabling | 68% | expert_inference | ML Benchmark Datasets provides a capability that enables this technology without being the only possible path. |
|
| Large Language Models (large_language_models) | enabling | 68% | expert_inference | Large Language Models provides a capability that enables this technology without being the only possible path. |
|
| Probability & Statistical Inference (probability_statistics_inference) | enabling | 68% | expert_inference | Probability & Statistical Inference provides a capability that enables this technology without being the only possible path. |
|
This page is generated from canonical era JSON and is indexable by URL.