Evaluation

This page describes how submissions to the PROCEDURE Track are evaluated and ranked. The evaluation philosophy follows the Metrics Reloaded recommendations [Maier-Hein et al., 2024], balancing interpretability of the ranking against methodological precision of the detailed performance analyses. The protocol is shared with the FRAME and SEGMENT Tracks with one key difference: the PROCEDURE Track maintains two parallel leaderboards — a Technical leaderboard and a Clinical leaderboard — reflecting the dual research questions of the FOCUS challenge.

At a glance. Each test case is a Visual Question Answering (VQA) instance. Submissions are scored with Accuracy, aggregated into capability- and robustness-stratified buckets, and combined into a final ranking using the Copeland method. The procedure is run twice: once over all test questions for the Technical leaderboard, and once over the clinically relevant subset for the Clinical leaderboard. Ties within the top three places of each leaderboard are resolved by bootstrap-based head-to-head win rates.

Why this evaluation design

Interpretability

A single, transparent primary metric, Accuracy, makes leaderboard positions easy to read and verify, while a richer stratified analysis is reported alongside for nuance.

Fairness across categories

Equal-weight aggregation across capabilities and robustness buckets prevents large question categories from dominating the ranking.

Robustness first

Out-of-distribution (OOD) questions, including unseen procedure types and unseen question formulations, are weighted equally to in-distribution (ID) questions to reward generalization.

Theoretical guarantees

Because no aggregation scheme satisfies every desirable property simultaneously (Arrow's impossibility theorem), we adopt the Copeland method, which has favorable properties relative to alternatives such as Borda counts [Rofin et al., 2023].


Two leaderboards

The PROCEDURE Track is the most clinically demanding part of the FOCUS challenge, requiring models to reason over long video contexts up to entire procedures. To separate general foreign object understanding from direct clinical utility, we run the full ranking procedure on two question sets:

Technical leaderboard

Computed from all VQA pairs in the test set. This leaderboard measures a model's general capability for long-horizon foreign object understanding, including questions of primarily technical interest.

Example: When was the second needle inserted?

Clinical leaderboard

Computed from clinically relevant VQA pairs only. This leaderboard measures a model's ability to answer questions with direct intraoperative quality-assurance value.

Example: Where is the sponge that was inserted at 00:42:17 still located?

Every test case carries a binary clinical relevance flag set during annotation by expert surgeons. Both leaderboards use the same metric, aggregation, and ranking procedure described below; they differ only in the underlying question set.

Prize money for the PROCEDURE Track is split approximately equally between the two leaderboards, and a team qualifies for the final test phase by beating both baselines on at least one of the leaderboards during pre-evaluation.


Primary metric: Accuracy

The primary metric used for ranking is Accuracy, defined as the proportion of correctly answered VQA cases. We chose Accuracy because it directly operationalizes the assessment goal of correctness, is the de facto standard for VQA benchmarking, and yields results that are comparable across categorical and open-ended question types.

The well-known disadvantages of Accuracy, including sensitivity to class prevalence and threshold choice, are mitigated by the stratified data splits and hierarchical aggregation described below, rather than by replacing the metric.

Closed-ended questions

Closed-ended questions are scored by exact match against the reference answer after format-specific verification and parsing. Each question declares its answer format during dataset generation. Submissions whose response does not pass the format's verification step are counted as incorrect.

The full set of formats is implemented in the focus.data.formats module of the orena-focus Python package and includes:

Format Accepted response Comparison rule
binary yes / no case-insensitive Exact match on parsed boolean
number Non-negative integer Exact match on parsed integer
percentage Non-negative number, optional % suffix Tolerance-aware match on parsed float
fo_class A registered foreign-object class name or none Case-insensitive match on canonical class name
time hh:mm:ss timestamp Tolerance-aware match
multiple_choice / open_ended / matching One of a predefined option set, or free-form text up to 300 characters LLM-as-a-judge

Open-ended questions: LLM-as-a-judge

For questions whose responses cannot be checked by exact match, namely formats multiple_choice, open_ended, and matching, semantic correctness is assessed using an LLM-as-a-judge protocol. A sample judge implementation is provided in focus.evaluation.judges.

Up to three independent judge LLMs evaluate every open-ended response; the final verdict is determined by majority vote. The voting routine short-circuits as soon as an absolute majority is reached, which keeps inference cost low without changing the outcome. This paradigm has become common practice in modern VLM benchmarks and tolerates clinically meaningful linguistic variability; internal pilot experiments showed only minor disagreement across state-of-the-art judge LLMs.

To prevent tuning towards specific judges, the exact set of judge models used for the official evaluation will not be disclosed until after the challenge concludes. The evaluation code is released openly, but the judge identity will remain redacted.

Anti-gaming policy. Any attempt to manipulate the LLM-as-a-judge through adversarial prompting, jailbreaking, or other techniques aimed at unfairly influencing evaluation will result in immediate disqualification of the team.

Tolerance-aware accuracy

For questions where the reference annotation has inherent uncertainty, most notably temporal references in the time format and numeric estimates in the percentage format, predictions are counted as a true positive if they fall within a predefined tolerance window configured per question via the format's threshold_seconds or threshold_pp parameter. Tolerance thresholds were derived from inter-rater variability and clinical input, similar to the application-dependent accuracy formulation of [Dergachyova et al., 2016].

Missing submissions

Any VQA case for which a submission does not produce a response, including timeouts exceeding the per-question time budget, is treated as incorrect.


Aggregation: stratified buckets

Each test case carries three pieces of meta-information used for aggregation:

1. Robustness level

In-distribution (ID) or out-of-distribution (OOD) with respect to procedure type and question formulation. OOD cases come from procedure types not represented in the training set and from question phrasings not seen during training.

2. Clinical relevance

A binary flag indicating whether the question is clinically relevant or only of technical interest. This flag determines which leaderboard a case contributes to.

3. Primary capability

Each case is mapped to exactly one primary capability based on its question intent, using the FOCUS taxonomy.

Capability mapping

Capability Question intent Example
Object recognition and instance matching Which object? In which state? Where? Which type of foreign object was inserted around 00:17:30?
Temporal grounding When? How long? When was the second sponge introduced?
Aggregation How many? How many distinct needles have entered the abdominal cavity so far?
Event and procedural understanding Which action? Has the specimen bag been removed by the end of the recording?
Complex reasoning Why? What happens if? Are there any foreign objects unaccounted for at this point in the procedure?

For every model and every leaderboard, mean Accuracy is computed within each capability × robustness bucket. With 5 capabilities and 2 robustness levels, this yields up to 10 bucket scores per model per leaderboard.


Ranking procedure

The ranking is computed in three steps, and is run independently for each leaderboard, Technical and Clinical.

Step 1 — Per-bucket ranking with significance adjustment

Within each bucket, models are ranked by mean Accuracy. Pairwise significance tests using cluster-aware bootstrapping are then used to collapse ranks when performance differences are not significant. Two models with statistically indistinguishable bucket-level Accuracy receive the same rank, so that irrelevant differences within the noise floor of the test set do not propagate into the final ranking.

Step 2 — Aggregating buckets via the Copeland method

The per-bucket rankings are combined into a single overall ranking using the Copeland method [Rofin et al., 2023]:

  • For every ordered pair of models A and B, count the number of buckets in which A is ranked strictly higher than B, and vice versa.
  • Model A dominates model B if A is ranked higher more often than B is.
  • Each model's Copeland score is the number of models it dominates minus the number of models that dominate it.
  • Higher Copeland scores are better; models are ordered by descending Copeland score.

This approach was chosen over linear schemes, such as the Borda rule, equivalent to averaging ranks, because it is more robust to irrelevant alternatives. In other words, adding or removing a weak submission does not arbitrarily perturb the ordering at the top.

Step 3 — Tie-breaking via bootstrap win rate

Ties within the top three positions of each Copeland ranking are resolved by directly comparing the tied models:

  • Within every bucket, draw K bootstrap samples with replacement of the case-level scores, respecting the clustering of cases within source videos.
  • For each bootstrap sample, identify the tied model with the highest Accuracy in that bucket. This is one win.
  • The win rate of a model in a bucket is the fraction of bootstrap samples in which it wins. The model's overall win rate is the mean win rate across all buckets of the leaderboard in question.
  • Tied models are ordered by descending overall win rate.

Outperforming the baselines

Two baseline submissions are provided:

1. Frontier closed-source VLM

A state-of-the-art frontier model, for example GPT-class or Gemini-class, applied zero-shot and selected by best validation performance, contingent on the availability of such a model with sufficient context length at the time of the challenge.

2. Fine-tuned open-source VLM

A strong open-source VLM fine-tuned on the challenge training data by the organizers.

During the pre-evaluation phase, a team is considered to outperform a baseline if its mean Accuracy across all buckets is higher than the baseline's. Teams that beat both baselines on at least one of the two leaderboards during pre-evaluation are admitted to the final test phase. Prizes in the final phase require beating both baselines on the corresponding leaderboard.

The full Copeland-based ranking, not just mean Accuracy, is applied in the final test phase.


Open-source evaluation code

The evaluation code for the pre-evaluation phase is released publicly as part of the orena-focus Python package, so that participants can reproduce per-case scoring locally on the released training data. The aggregating and ranking code for the final phase will be released soon. The specific judge LLMs used for open-ended scoring remain undisclosed during the challenge to prevent over-fitting to a particular judge.