Design Evaluation: Reliability and Correctness in an Assistant Thread/Run Coordinator with Tool Output Submission

stemaway · March 19, 2026, 11:35pm

Title

Evaluation Overview

This evaluation focuses on designing and validating a coordinator that manages Assistants-style conversation threads and run lifecycles, including “requires_action” handling and tool output submission with correct correlation and ordering. The emphasis is on experimental design: how to measure end-to-end correctness and robustness when multi-step agent workflows interact with external tools and asynchronous run state.

You’ll be evaluated on your ability to:

Design measurement strategies for multi-step agent orchestration (beyond single-turn accuracy)
Interpret mixed experimental results and identify what the metrics do and don’t capture
Spot gaps in evaluation methodology and propose targeted follow-up experiments
Reason about robustness under retries, partial failures, and asynchronous execution
Justify tradeoffs (polling vs event-driven, strict validation vs tolerance, throughput vs consistency)
Anticipate deployment risks when external tools have side effects

System Design

We operate a multi-tenant agent platform where user requests create threads and runs, the model can request tool calls, and the system must submit tool outputs back to the correct run step to resume execution. The team is evaluating a proposed change to the coordinator and tool output submission handler to improve throughput and reduce stalled runs while keeping run-step correlation correct.

Current State

A Thread/Run Coordinator polls run status and resumes runs by submitting tool outputs when runs enter requires_action.
Tool results are normalized and submitted via a Tool Output Submission Handler that accepts tool execution results from a queue.
Correlation is maintained using (thread_id, run_id, tool_call_id) plus internal “step index” derived from the last seen run state.
Idempotency is implemented at the submission handler level using a short-lived cache keyed by (run_id, tool_call_id) (TTL 10 minutes).
External tools include both read-only tools (search, retrieval) and side-effectful tools (ticket creation, refunds, outbound notifications).

Proposed Change

Move from pure polling to a hybrid model: webhook-driven run state updates with polling fallback.
Add an asynchronous submission queue with batching (submit multiple tool outputs per run in one call when possible).
Relax strict ordering enforcement: accept tool outputs as they arrive and rely on the run-step correlator to attach them to the correct step.
Replace the in-memory idempotency cache with a shared dedup store (Redis) to support horizontal scaling.
Add a “run completion watchdog” to auto-retry submissions when a run remains in requires_action beyond a threshold.

The Design Question

Engineering and product want to ship the hybrid coordinator to reduce latency and increase throughput before a planned customer onboarding in three weeks. Early results show fewer stalled runs and better p95 latency, but there are open questions about whether the evaluation captures end-to-end correctness for side-effectful tools under real operational conditions. During the evaluation, you may be asked to reason through this question and explore alternative approaches.

Experimental Setup

Workload construction: 50k synthetic “agent tasks” replayed from production templates (customer support triage, order lookup, ticket creation, refund eligibility), with tool calls stubbed to a tool gateway that can simulate latency and timeouts.
Environments: One staging cluster (8 coordinator pods, 6 tool-executor workers) and one canary slice in production (5% of traffic for 48 hours).
Baselines: (A) current polling coordinator + synchronous tool output submission; (B) proposed hybrid coordinator + async submission queue + Redis dedup.
Metrics tracked: run completion rate, time-in-requires_action, tool submission success rate, duplicate tool gateway requests (counted by identical payload hash within 60 seconds), and user-visible latency (p50/p95).
Correctness checks: compare final assistant message against a rubric-based evaluator for 2k sampled runs; verify tool outputs were accepted by the Assistants API (HTTP 200) and that runs reached a terminal state.
Fault injection: 1% of tool calls experience a forced timeout; 0.2% of tool-executor workers are restarted during the run (randomly distributed), with automatic retry enabled at the job runner layer.

Results & Data

Artifact 1 — Staging benchmark summary (50k tasks, 6 hours)

Metric	Baseline A (Polling+Sync)	Proposed B (Hybrid+Async)
Run completion rate (terminal within 2 min)	96.8%	98.9%
Median time in `requires_action`	4.6 s	1.9 s
p95 end-to-end run latency	28.4 s	19.7 s
Tool output submission HTTP 200 rate	99.4%	99.6%
“Stalled” runs (still `requires_action` at 2 min)	1,610	540
Coordinator CPU (avg)	62%	48%

Artifact 2 — Tool gateway “duplicate request” counter (staging, same 50k tasks)
Definition: two POSTs with identical payload hash to the same tool endpoint within 60 seconds.

Tool category	Baseline A	Proposed B
Read-only (search/retrieval)	0.08% of calls	0.11% of calls
Side-effectful (create/update)	0.03% of calls	0.27% of calls
Payment/refund endpoints (subset)	0.00% of calls	0.09% of calls

Counts (Proposed B): 312 duplicates out of 115,904 side-effectful calls.

Artifact 3 — Canary A/B (production 5% traffic, 48 hours)

Metric	Control (A)	Canary (B)
p95 user-visible latency	2.9 s	2.3 s
Runs reaching terminal state within 60 s	94.1%	95.6%
Support tickets tagged “assistant stuck” (per 10k sessions)	7.4	4.1
Tool gateway timeout rate	0.7%	0.8%
Tool gateway duplicate request counter (per 10k sessions)	1.2	2.0

Artifact 4 — Sampled run-step correlation audit (2k runs, staging)
Method: compare the tool_call_id(s) requested by the model in requires_action to the tool outputs submitted by the handler; mark “matched” if every requested tool_call_id has exactly one submitted output.

Outcome	Baseline A	Proposed B
All tool_call_ids matched exactly once	98.6%	98.9%
Missing tool output(s)	0.9%	0.6%
Extra submitted output(s)	0.5%	0.5%

Artifact 5 — Distribution of duplicate tool gateway requests vs worker restarts (staging fault injection)

Condition bucket	Duplicate rate (side-effectful tools)
No injected timeouts, no restarts	0.04%
1% injected timeouts only	0.18%
0.2% worker restarts only	0.21%
Both timeouts + restarts	0.46%

Total side-effectful calls in this experiment: 60,120.

Current Observations

The team has noticed the proposed hybrid coordinator reduces time spent in requires_action and improves p95 end-to-end latency in both staging and canary.
Initial results show the run-step correlation audit is similar between baseline and proposed, with slightly fewer missing tool outputs in the proposed approach.
The team has noticed the tool gateway duplicate request counter is higher in the proposed approach, concentrated in side-effectful endpoints, while read-only tools remain close to baseline.
Initial results show duplicate rates increase in experiments that include timeouts and worker restarts, with the highest rate when both are present.
The team has noticed that most sessions with duplicate tool gateway requests still reach a terminal run state within the SLA window, and the rubric-based evaluator scores for final assistant messages are similar between A and B on the sampled set.
Initial results show the canary has fewer “assistant stuck” tickets, while the duplicate request counter per 10k sessions is higher than control.

Constraints

A customer onboarding milestone is scheduled in three weeks; product wants the latency and “stuck run” improvements for that launch.
The team has limited ability to run large-scale production experiments beyond a 5% canary due to risk policies around side-effectful tools.
External tool providers have heterogeneous support for idempotency semantics; some endpoints accept an idempotency key, others do not.
Observability is partial: the Assistants API provides run and step status, but the system’s tool gateway logs are sampled at 10% in production.
Budget allows for one additional week of focused engineering work before a go/no-go decision.

Your Role & What Happens Next

You are acting as the senior engineer leading the design review for the coordinator and tool output submission handler changes, with responsibility for evaluating whether the proposed approach is ready for broader rollout and what evidence is still missing.

In the next stage, you will be asked to interpret the experimental artifacts, assess whether the methodology supports the team’s current confidence level, identify key risks that aren’t directly measured, and outline a practical plan for closing the most important evidence gaps under the stated constraints.

Design Evaluation: Reliability and Correctness in an Assistant Thread/Run Coordinator with Tool Output Submission

Title

Evaluation Overview

System Design

The Design Question

Experimental Setup

Results & Data

Current Observations

Constraints

Your Role & What Happens Next

Preparation (Optional)