Run Coordinator: Tool Output Continuity at Scale

stemaway · March 20, 2026, 5:17pm

Title

What This Covers

You’ll reason about an Assistants-style thread/run coordinator and a tool-output submission handler under transient infrastructure instability. The focus is on debugging run lifecycles, correlating tool results to run steps, and preventing user-visible side effects.

The System

A backend service manages conversation threads and runs, polling run status and resuming runs when they enter requires_action.
Tools are executed by internal workers; results flow through a Tool Output Submission Handler that validates payloads and submits outputs back to the run.
The submission handler maintains correlation IDs (thread_id, run_id, tool_call_id) and enforces step ordering.
The coordinator is deployed on a horizontally scaled serving layer; replicas can restart during load spikes.
External tools include side-effectful actions (e.g., ticket creation, email sending) and return an audit ID.

What Happened

On Tuesday between 14:05–14:25 UTC, Support reported a burst of duplicate outbound actions for a small subset of users (e.g., duplicate tickets/emails created within seconds). The on-call also saw a rise in runs spending longer in requires_action before completing. During the evaluation, you’ll investigate this and related scenarios.

Anchor Artifact

2026-03-19T14:12:08.441Z  serve/replica-llm-7  WARN  Replica restarting (exit=137, OOMKilled=true)
2026-03-19T14:12:09.003Z  coordinator         INFO  run.status thread=thd_9c2a run=run_31f7 status=requires_action pending_tool_calls=1
2026-03-19T14:12:09.017Z  tool-exec           INFO  POST /tools/create_ticket req=gw_8b91 user=usr_1042 payload_hash=9f3c... timeout=10s
2026-03-19T14:12:19.028Z  tool-exec           WARN  create_ticket timeout req=gw_8b91 elapsed_ms=10011
2026-03-19T14:12:19.041Z  tool-exec           INFO  POST /tools/create_ticket req=gw_8b92 user=usr_1042 payload_hash=9f3c... timeout=10s
2026-03-19T14:12:19.612Z  tool-exec           INFO  201 Created req=gw_8b92 tool_audit_id=tkt_772901
2026-03-19T14:12:19.640Z  submitter           INFO  submit_tool_outputs thread=thd_9c2a run=run_31f7 tool_call_id=call_A1 step=stp_12 bytes=684
2026-03-19T14:12:20.104Z  tool-exec           INFO  201 Created req=gw_8b91 tool_audit_id=tkt_772900
2026-03-19T14:12:20.131Z  submitter           WARN  duplicate submit attempt thread=thd_9c2a run=run_31f7 tool_call_id=call_A1 step=stp_12

Observed Signals

Monitoring shows a short spike in replica restarts and p95 run completion time (from 4.2s to 11.8s) during the same window.
The team noticed some external systems contain near-identical records created seconds apart, while the app UI shows only one confirmation.
A few customers reported “I clicked once” while Support chat transcripts show occasional repeated user messages.

Constraints

You can’t change the external tool providers’ behavior this week.
Any mitigation must preserve throughput (no global locking) and keep median latency within 10% of baseline.
You must assume tool execution and tool-output submission can be retried by infrastructure.

What You’ll Be Evaluated On

How you triage and narrow hypotheses from partial logs/metrics
Run/thread lifecycle reasoning, including requires_action handling and step correlation
Designing safe retry/idempotency and deduplication strategies across components
Proposing mitigations that reduce user impact quickly, plus longer-term fixes
Identifying what additional telemetry you’d add (and where) to confirm root cause