Title
Assistant Thread/Run Coordination Reliability in Tool-Driven Agent Flows
Evaluation Overview
This evaluation focuses on reliability and correctness in an Assistants-style orchestration layer that manages threads, run lifecycles, and tool-result submission—especially around requires_action states and multi-step tool execution. The scenario emphasizes reasoning about distributed behavior, correlation/ordering, and safe handling of retries in systems that can produce real-world side effects.
You’ll be evaluated on your ability to:
- Trace end-to-end run state across thread/run lifecycle events and tool submissions
- Reason about idempotency, deduplication, and correlation IDs in multi-step agent workflows
- Debug ambiguous incidents using logs/metrics/audit trails without overfitting to one signal
- Identify correctness gaps between orchestrator state and external tool state
- Propose pragmatic mitigations (code + operational controls) under real constraints
- Communicate tradeoffs between throughput, consistency, and implementation complexity
System Overview
Architecture
- The product exposes an “agent” API that maps each user conversation to an Assistants-style thread, and each user turn to a run.
- A Thread/Run Coordinator owns run creation, status tracking, and resumption when a run enters
requires_action. - A Tool Output Submission Handler accepts tool execution results, validates/normalizes them, correlates them to the correct run step, and submits them back to the run to continue.
- Tools are executed out-of-process via a tool gateway (HTTP) that calls internal services (e.g., ticketing, email, billing sandbox) and returns structured JSON results.
- The coordinator uses a mix of polling (for run status) and event callbacks (tool completion events) to drive progress.
- An idempotency/deduplication layer exists for some internal events, but tool execution spans multiple services and queues.
Implementation
The coordinator and submission handler run as stateless services behind an internal load balancer. Run state is primarily provider-managed (thread/run state), with application-side correlation metadata stored in Redis and Postgres for observability and deduplication.
- Run status is polled every 2s while “active”;
requires_actiontriggers tool dispatch and waits for tool completion. - Tool results are normalized into a canonical schema and submitted via
submit_tool_outputswithtool_call_idand a per-run correlation ID. - The tool gateway uses HTTP timeouts (p95 target 3s, hard timeout 10s) and retries some requests on network errors.
- Submission handler processes tool completion events from a queue (at-least-once delivery), then submits to the provider API.
- Logs include
thread_id,run_id,tool_call_id(when available), and an internalcorrelation_idgenerated at run start.
What Happened
On Tuesday between 10:14–10:27 UTC, several customers reported duplicate side effects from agent actions (e.g., two identical support tickets created, or the same “send confirmation email” action occurring twice) even though the UI showed a single assistant response for the turn. The issue was intermittent and clustered around a brief period of elevated latency in the agent backend. During the evaluation, you may be asked to investigate this and similar scenarios.
Example Artifacts
Artifact 1: Coordinator run lifecycle log (appears mostly normal)
2026-03-12T10:16:02.118Z coordinator INFO thread=thr_8f2 run=run_4c1 corr=cr_91a create_run model=gpt-4.1
2026-03-12T10:16:03.904Z coordinator INFO thread=thr_8f2 run=run_4c1 corr=cr_91a status=in_progress
2026-03-12T10:16:05.221Z coordinator INFO thread=thr_8f2 run=run_4c1 corr=cr_91a status=requires_action tool_calls=1
2026-03-12T10:16:05.233Z coordinator INFO thread=thr_8f2 run=run_4c1 corr=cr_91a dispatch_tool name=create_ticket tool_call_id=call_A1
2026-03-12T10:16:08.041Z coordinator INFO thread=thr_8f2 run=run_4c1 corr=cr_91a status=in_progress
2026-03-12T10:16:09.612Z coordinator INFO thread=thr_8f2 run=run_4c1 corr=cr_91a status=completed
Artifact 2: Tool gateway access log (subtle anomaly)
10:16:05.410Z POST /tools/create_ticket 200 2.1s
x-corr-id=cr_91a x-run-id=run_4c1 x-thread-id=thr_8f2
body_hash=9b3f... tool_call_id=call_A1
10:16:07.982Z POST /tools/create_ticket 200 1.9s
x-corr-id=cr_91a x-run-id=run_4c1 x-thread-id=thr_8f2
body_hash=9b3f... tool_call_id=call_A1
Artifact 3: External system audit trail (ticketing)
[
{
"event": "ticket.created",
"ticket_id": "TCK-774201",
"created_at": "2026-03-12T10:16:07.531Z",
"source": "agent-tool-gateway",
"request_fingerprint": "9b3f..."
},
{
"event": "ticket.created",
"ticket_id": "TCK-774202",
"created_at": "2026-03-12T10:16:09.104Z",
"source": "agent-tool-gateway",
"request_fingerprint": "9b3f..."
}
]
Artifact 4: Submission handler log (looks reasonable, but incomplete)
2026-03-12T10:16:07.610Z submitter INFO corr=cr_91a run=run_4c1 tool_call_id=call_A1 normalize_ok bytes=842
2026-03-12T10:16:07.742Z submitter INFO corr=cr_91a run=run_4c1 tool_call_id=call_A1 submit_tool_outputs status=200
2026-03-12T10:16:08.003Z submitter WARN corr=cr_91a run=run_4c1 tool_call_id=call_A1 duplicate_event queue_msg_id=qm_5521 action=drop
Observed Signals
- Duplicate downstream records appear within 1–3 minutes of each other with near-identical payload fingerprints, while the assistant UI shows only one completed run for the user turn.
- Tool gateway logs show two HTTP POSTs with the same
body_hashand the samex-run-id/x-thread-idduring the incident window; some entries include the sametool_call_id. - Orchestrator tracing shows repeated spans for “tool dispatch” on a small subset of runs during a period of elevated backend latency; other runs in the same timeframe behave normally.
- Customer support reports include a few cases where users insist they clicked once; other cases include users who retried after seeing a spinner, creating ambiguity about client-side resubmission.
- The submission handler reports dropping some duplicate queue events, but external side effects still show duplicates for a subset of those runs.
- Infrastructure metrics show a brief spike in replica churn/restarts in the agent backend around 10:15–10:20 UTC, alongside increased p95 request latency.
Constraints
- Production impact is high: duplicated side effects (tickets/emails/billing sandbox actions) require manual cleanup and erode trust.
- Logging is partially redacted for privacy; full tool payloads are not stored, only hashes/fingerprints and selected headers.
- The system must maintain throughput (no global locks); p95 end-to-end turn latency target is 6 seconds.
- Changes should be incremental and safe to roll out behind feature flags; a full architecture rewrite is out of scope.
- External tools are not uniformly idempotent; some support idempotency keys, others do not.
Your Role & What Happens Next
You’re stepping in as an engineer responsible for the Assistant Thread/Run Coordinator and the Tool Output Submission Handler, with the goal of improving correctness and operational reliability in multi-step tool-driven runs.
During the evaluation, you’ll be asked to reason through the run lifecycle, correlate artifacts across services, and propose changes that prevent stalling at requires_action while also reducing the chance of duplicated side effects—without sacrificing latency and throughput.