Evaluation
This page describes how submissions to the PROCEDURE Track are evaluated and ranked. The evaluation philosophy follows the Metrics Reloaded recommendations [Maier-Hein et al., 2024], balancing interpretability of the ranking against methodological precision of the detailed performance analyses. The protocol is shared with the FRAME and SEGMENT Tracks with one key difference: the PROCEDURE Track maintains two parallel leaderboards — a Technical leaderboard and a Clinical leaderboard — reflecting the dual research questions of the FOCUS challenge.
| At a glance. Each test case is a Visual Question Answering (VQA) instance. Submissions are scored with Accuracy, aggregated into capability- and robustness-stratified buckets, and combined into a final ranking using the Copeland method. The procedure is run twice: once over all test questions for the Technical leaderboard, and once over the clinically relevant subset for the Clinical leaderboard. Ties within the top three places of each leaderboard are resolved by bootstrap-based head-to-head win rates. |
Why this evaluation design
InterpretabilityA single, transparent primary metric, Accuracy, makes leaderboard positions easy to read and verify, while a richer stratified analysis is reported alongside for nuance. |
Fairness across categoriesEqual-weight aggregation across capabilities and robustness buckets prevents large question categories from dominating the ranking. |
Robustness firstOut-of-distribution (OOD) questions, including unseen procedure types and unseen question formulations, are weighted equally to in-distribution (ID) questions to reward generalization. |
Theoretical guaranteesBecause no aggregation scheme satisfies every desirable property simultaneously (Arrow's impossibility theorem), we adopt the Copeland method, which has favorable properties relative to alternatives such as Borda counts [Rofin et al., 2023]. |
Two leaderboards
The PROCEDURE Track is the most clinically demanding part of the FOCUS challenge, requiring models to reason over long video contexts up to entire procedures. To separate general foreign object understanding from direct clinical utility, we run the full ranking procedure on two question sets:
Technical leaderboardComputed from all VQA pairs in the test set. This leaderboard measures a model's general capability for long-horizon foreign object understanding, including questions of primarily technical interest. Example: When was the second needle inserted? |
Clinical leaderboardComputed from clinically relevant VQA pairs only. This leaderboard measures a model's ability to answer questions with direct intraoperative quality-assurance value. Example: Where is the sponge that was inserted at 00:42:17 still located? |
Every test case carries a binary clinical relevance flag set during annotation by expert surgeons. Both leaderboards use the same metric, aggregation, and ranking procedure described below; they differ only in the underlying question set.
Prize money for the PROCEDURE Track is split approximately equally between the two leaderboards, and a team qualifies for the final test phase by beating both baselines on at least one of the leaderboards during pre-evaluation.
Primary metric: Accuracy
The primary metric used for ranking is Accuracy, defined as the proportion of correctly answered VQA cases. We chose Accuracy because it directly operationalizes the assessment goal of correctness, is the de facto standard for VQA benchmarking, and yields results that are comparable across categorical and open-ended question types.
The well-known disadvantages of Accuracy, including sensitivity to class prevalence and threshold choice, are mitigated by the stratified data splits and hierarchical aggregation described below, rather than by replacing the metric.
Closed-ended questions
Closed-ended questions are scored by exact match against the reference answer after format-specific verification and parsing. Each question declares its answer format during dataset generation. Submissions whose response does not pass the format's verification step are counted as incorrect.
The full set of formats is implemented in the
focus.data.formats
module of the
orena-focus
Python package and includes:
| Format | Accepted response | Comparison rule |
binary |
yes / no case-insensitive |
Exact match on parsed boolean |
number |
Non-negative integer | Exact match on parsed integer |
percentage |
Non-negative number, optional % suffix |
Tolerance-aware match on parsed float |
fo_class |
A registered foreign-object class name or none |
Case-insensitive match on canonical class name |
time |
hh:mm:ss timestamp |
Tolerance-aware match |
multiple_choice / open_ended / matching |
One of a predefined option set, or free-form text up to 300 characters | LLM-as-a-judge |
Open-ended questions: LLM-as-a-judge
For questions whose responses cannot be checked by exact match, namely formats
multiple_choice, open_ended, and matching, semantic correctness
is assessed using an LLM-as-a-judge protocol. A sample judge implementation is provided in
focus.evaluation.judges.
Up to three independent judge LLMs evaluate every open-ended response; the final verdict is determined by majority vote. The voting routine short-circuits as soon as an absolute majority is reached, which keeps inference cost low without changing the outcome. This paradigm has become common practice in modern VLM benchmarks and tolerates clinically meaningful linguistic variability; internal pilot experiments showed only minor disagreement across state-of-the-art judge LLMs.
To prevent tuning towards specific judges, the exact set of judge models used for the official evaluation will not be disclosed until after the challenge concludes. The evaluation code is released openly, but the judge identity will remain redacted.
| Anti-gaming policy. Any attempt to manipulate the LLM-as-a-judge through adversarial prompting, jailbreaking, or other techniques aimed at unfairly influencing evaluation will result in immediate disqualification of the team. |
Tolerance-aware accuracy
For questions where the reference annotation has inherent uncertainty, most notably temporal
references in the time format and numeric estimates in the percentage
format, predictions are counted as a true positive if they fall within a predefined tolerance
window configured per question via the format's threshold_seconds or
threshold_pp parameter. Tolerance thresholds were derived from inter-rater variability
and clinical input, similar to the application-dependent accuracy formulation of
[Dergachyova et al., 2016].
Missing submissions
Any VQA case for which a submission does not produce a response, including timeouts exceeding the per-question time budget, is treated as incorrect.
Aggregation: stratified buckets
Each test case carries three pieces of meta-information used for aggregation:
1. Robustness levelIn-distribution (ID) or out-of-distribution (OOD) with respect to procedure type and question formulation. OOD cases come from procedure types not represented in the training set and from question phrasings not seen during training. |
2. Clinical relevanceA binary flag indicating whether the question is clinically relevant or only of technical interest. This flag determines which leaderboard a case contributes to. |
3. Primary capabilityEach case is mapped to exactly one primary capability based on its question intent, using the FOCUS taxonomy. |
Capability mapping
| Capability | Question intent | Example |
| Object recognition and instance matching | Which object? In which state? Where? | Which type of foreign object was inserted around 00:17:30? |
| Temporal grounding | When? How long? | When was the second sponge introduced? |
| Aggregation | How many? | How many distinct needles have entered the abdominal cavity so far? |
| Event and procedural understanding | Which action? | Has the specimen bag been removed by the end of the recording? |
| Complex reasoning | Why? What happens if? | Are there any foreign objects unaccounted for at this point in the procedure? |
For every model and every leaderboard, mean Accuracy is computed within each capability × robustness bucket. With 5 capabilities and 2 robustness levels, this yields up to 10 bucket scores per model per leaderboard.
Ranking procedure
The ranking is computed in three steps, and is run independently for each leaderboard, Technical and Clinical.
Step 1 — Per-bucket ranking with significance adjustment
Within each bucket, models are ranked by mean Accuracy. Pairwise significance tests using cluster-aware bootstrapping are then used to collapse ranks when performance differences are not significant. Two models with statistically indistinguishable bucket-level Accuracy receive the same rank, so that irrelevant differences within the noise floor of the test set do not propagate into the final ranking.
Step 2 — Aggregating buckets via the Copeland method
The per-bucket rankings are combined into a single overall ranking using the Copeland method [Rofin et al., 2023]:
- For every ordered pair of models A and B, count the number of buckets in which A is ranked strictly higher than B, and vice versa.
- Model A dominates model B if A is ranked higher more often than B is.
- Each model's Copeland score is the number of models it dominates minus the number of models that dominate it.
- Higher Copeland scores are better; models are ordered by descending Copeland score.
This approach was chosen over linear schemes, such as the Borda rule, equivalent to averaging ranks, because it is more robust to irrelevant alternatives. In other words, adding or removing a weak submission does not arbitrarily perturb the ordering at the top.
Step 3 — Tie-breaking via bootstrap win rate
Ties within the top three positions of each Copeland ranking are resolved by directly comparing the tied models:
- Within every bucket, draw K bootstrap samples with replacement of the case-level scores, respecting the clustering of cases within source videos.
- For each bootstrap sample, identify the tied model with the highest Accuracy in that bucket. This is one win.
- The win rate of a model in a bucket is the fraction of bootstrap samples in which it wins. The model's overall win rate is the mean win rate across all buckets of the leaderboard in question.
- Tied models are ordered by descending overall win rate.
Outperforming the baselines
Two baseline submissions are provided:
1. Frontier closed-source VLMA state-of-the-art frontier model, for example GPT-class or Gemini-class, applied zero-shot and selected by best validation performance, contingent on the availability of such a model with sufficient context length at the time of the challenge. |
2. Fine-tuned open-source VLMA strong open-source VLM fine-tuned on the challenge training data by the organizers. |
During the pre-evaluation phase, a team is considered to outperform a baseline if its mean Accuracy across all buckets is higher than the baseline's. Teams that beat both baselines on at least one of the two leaderboards during pre-evaluation are admitted to the final test phase. Prizes in the final phase require beating both baselines on the corresponding leaderboard.
The full Copeland-based ranking, not just mean Accuracy, is applied in the final test phase.
Open-source evaluation code
The evaluation code for the pre-evaluation phase is released publicly as part of the
orena-focus
Python package, so that participants can reproduce per-case scoring locally on the released
training data. The aggregating and ranking code for the final phase will be released soon.
The specific judge LLMs used for open-ended scoring remain undisclosed during the challenge to
prevent over-fitting to a particular judge.