Model Evaluation Benchmarks

Systematic tests, benchmark suites, and evaluation harnesses for measuring model capabilities, robustness, fairness, and risks.

Core metadata

ID: model_evaluation_benchmarks
Era: Modern
First known date: 2017 (decade)
Region: Global / multiple regions
Review status: source_checked
Maturity: established

Prerequisites

Dependents

Fields

Artificial Intelligence & Machine Learning

Field lanes

Artificial Intelligence & Machine Learning: Data & Evaluation

Node sources

Improving Transparency in AI Language Models: A Holistic Evaluation (Stanford HAI, 2022, generic_overview) • Supports: node, maturity
Artificial Intelligence Risk Management Framework (AI RMF 1.0) (NIST, 2023, official_agency) • Supports: node, maturity

Prerequisite edge evidence

Edge/source evidence summary:

Prerequisite edges: 3
Average edge confidence: 68%
Prerequisite sources: 3
expert_inference: 3

Prerequisite	Type	Confidence	Evidence level	Note	Sources
ML Benchmark Datasets (ml_benchmark_datasets)	enabling	68%	expert_inference	ML Benchmark Datasets provides a capability that enables this technology without being the only possible path.	Improving Transparency in AI Language Models: A Holistic Evaluation (Stanford HAI, 2022, generic_overview) • Supports: node, maturity, edge
Large Language Models (large_language_models)	enabling	68%	expert_inference	Large Language Models provides a capability that enables this technology without being the only possible path.	Improving Transparency in AI Language Models: A Holistic Evaluation (Stanford HAI, 2022, generic_overview) • Supports: node, maturity, edge
Probability & Statistical Inference (probability_statistics_inference)	enabling	68%	expert_inference	Probability & Statistical Inference provides a capability that enables this technology without being the only possible path.	Improving Transparency in AI Language Models: A Holistic Evaluation (Stanford HAI, 2022, generic_overview) • Supports: node, maturity, edge

This page is generated from canonical era JSON and is indexable by URL.